Files

Michael 82d7fef21e feat(gate): CSV-normalization gate with confidence-tiered findings

Adds a Review & Normalize page that sits between upload and every tool
page. The analyzer now tags each finding with confidence (high/medium/low)
and a fix_action; the gate auto-applies high-confidence fixes, surfaces
medium/low ones for user review, and blocks tool pages on error-level
findings until resolved or waived.

Core (src/core/):
  - analyze.py: Finding gains confidence, fix_action, pre_applied; new
    detectors for encoding_uncertain, encoding_decode_failed; new top-
    level encoding_override parameter.
  - fixes.py: registry of fix algorithms keyed by fix_action id.
  - normalize.py: auto_fix(), apply_decisions(), is_normalized(), and
    the NormalizationResult / Decision dataclasses the gate consumes.
  - io.py: detect_encoding tries strict UTF-8 first; repair_bytes now
    transcodes UTF-16/32 to UTF-8 before NUL-strip (fixes UTF-16 corruption)
    and normalizes line endings (fixes bare-CR parser crash); empty file
    handled gracefully instead of EmptyDataError traceback.

GUI (src/gui/):
  - pages/0_Review.py: gate page with per-finding decision controls,
    encoding override picker (16 codepages + custom), and Advanced output
    options (encoding, delimiter, line terminator) on the download.
  - components.py: require_normalization_gate() helper.
  - pages/1-9: gate guard wired on every tool page.

Test corpora:
  - test-cases/encodings-corpus/: 31 encoded CSV fixtures + 9 reference
    UTF-8 files + manifest, synced from Business/DataTools.
  - test-cases/text-cleaner-corpus/test_data/17: synced malformed input
    (unquoted $1,500.00) for the unquoted-delimiter detector.

Tests (94 new):
  - test_normalize.py (48): finding fields, fix registry, auto_fix scope,
    decision paths, gate idempotency, output-options helper.
  - test_encodings_corpus.py (90, 16 xfailed): parametric detection +
    decode + analyzer-no-crash sweep against the manifest.
  - test_analyze.py: encoding override + encoding_uncertain detectors.
  - test_corpus.py: pre-parse repair in the strict reader.

run_tests.py: new aliases --tool normalize, --tool encodings, --tool gate;
encodings corpus added to --fixtures category.

Docs: USER-GUIDE §3.3 covers the gate workflow, encoding override, and
output options; TECHNICAL §10.2.1-10.2.4 documents the analyzer schema,
gate API, Review page, and pre-parse repair pipeline; CLI-REFERENCE adds
the analyzer JSON schema with the new fields; README links to all of it.

Suite: 765 passed, 17 xfailed (was 458 passed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-29 20:35:27 +00:00

11 KiB

Raw Blame History

USER-GUIDE.md - Excel & CSV Data Cleaning Mastery Bundle

Version: 1.6 Last updated: April 28, 2026

Thank you for purchasing the Data Cleaning Mastery Bundle. This guide covers installation and every script included.

1. Installation

The bundle is fully self-contained. You do not need to install Python.

Windows

Download BundleName-Setup-1.0.exe from your purchase email.
Double-click the installer.
Follow the wizard. The installer creates a desktop shortcut named "Launch Bundle" and an entry in Start Menu.
Launch via the desktop shortcut. Your default browser will open to a local page (typically http://localhost:8501) where the data tool runs.

macOS

Download BundleName-1.0.dmg from your purchase email.
Double-click the .dmg to mount it.
Drag the Bundle app into the Applications folder.
Launch from Applications, Spotlight, or Launchpad. Your default browser will open to a local page where the data tool runs.

The app is signed and notarized by Apple, so it opens cleanly with no security warnings.

Linux

Download BundleName-1.0.AppImage from your purchase email.
Make it executable: chmod +x BundleName-1.0.AppImage
Double-click to run, or execute from a terminal. Your default browser will open to a local page where the data tool runs.

If AppImage doesn't work on your distribution, a .tar.gz fallback is available in your purchase email. Extract it and run ./run.sh from the extracted folder.

How the GUI works (important to know)

This tool runs in your browser locally on your computer. When you launch it, a small program starts a local server on your machine and opens your default browser to view it. This is normal and expected.

No internet is required. Your data never leaves your computer.
Your data is not uploaded anywhere. All processing happens on your machine.
The browser is just the display surface. Closing the browser closes the GUI; the underlying program also stops.

If you prefer the command line, every script also ships as a CLI tool. See Section 3.

Requirements

Windows: Windows 10 or 11 (64-bit).
macOS: macOS 11 Big Sur or later (Apple Silicon or Intel).
Linux: any modern 64-bit distribution from 2020 onward.
A modern default browser (Chrome, Edge, Firefox, or Safari from the last 3 years).
~400-500 MB free disk space.
Internet connection: not required.

2. What's Included

Scripts (in the scripts/ folder):

#	Script	Purpose	Status
01	`01_deduplicator.py`	Smart duplicate removal: exact match + basic fuzzy, configurable subset columns, full logs	Working
02	`02_text_cleaner.py`	Character-level hygiene: trim leading/trailing whitespace, collapse internal multi-spaces, strip non-printable characters, Unicode normalization (smart quotes, em-dashes, accents), remove zero-width characters, BOM handling, line-ending normalization, case operations	Working
03	`03_format_standardizer.py`	Standardize dates, currencies, names, phone numbers, addresses	Skeleton
04	`04_missing_value_handler.py`	Detect and handle missing values: disguised nulls (`N/A`, `-`, blanks, sentinel codes), imputation (mean/median/mode/forward-fill), required-field enforcement, drop-by-threshold	Skeleton
05	`05_column_mapper_enforcer.py`	Rename columns, enforce a target schema	Skeleton
06	`06_outlier_detector.py`	Detect and flag statistical outliers (z-score, IQR, modified z-score), multivariate detection, domain-rule violations, optional winsorization	Skeleton
07	`07_multi_file_merger.py`	Merge multiple CSV or Excel files into one	Skeleton
08	`08_validator_reporter.py`	Validate data against rules, output PDF or Excel report	Skeleton
09	`09_master_orchestrator.py`	One-click launcher menu, calls any other script	Skeleton

Sample data (in the samples/ folder):

messy_sales.csv - intentionally dirty sales data for testing.
bank_export.xlsx - sample bank export for testing missing-value handling and outlier detection.

3. Usage

You have two ways to use the bundle: the GUI (recommended for most users) or the CLI (for power users and automation).

3.1 GUI usage (recommended)

Launch the bundle via the desktop shortcut, app icon, or AppImage.
Your browser opens to the bundle's home page.
Select the script you want to use from the sidebar (Deduplicator, Format Standardizer, etc.).
Drop your file into the file uploader, or select from the included samples.
Sensible defaults are pre-filled. Click "Run" to see a preview of what the script will do.
Review the preview. If it looks right, click "Save Output" to write the cleaned file.

The GUI is designed to work out of the box with zero configuration. Advanced options are tucked into expandable "Advanced" panes for users who want them.

3.2 CLI usage

All scripts are also CLI tools with --help output.

Basic usage (from a terminal):

Windows (the bundle adds CLI tools to your PATH):

deduplicator samples\messy_sales.csv

macOS / Linux:

deduplicator samples/messy_sales.csv

With options:

deduplicator samples/messy_sales.csv --output cleaned.csv --subset email,phone

Get help on any script:

deduplicator --help

Recommended run order: If you are running scripts individually, run 02_text_cleaner first to normalize whitespace and special characters, then 04_missing_value_handler before 06_outlier_detector. Outlier detection on data still containing blanks or sentinel codes (like -999) produces unreliable results because missing-value placeholders distort the statistics (means get dragged, IQR widens, false negatives explode). The Master Orchestrator (script 09) runs them in the correct order automatically.

3.3 Review & Normalize gate

Before any tool page accepts a file, the file passes through a CSV-normalization gate. The gate scans every uploaded file, surfaces every data-quality issue our analyzer can detect, and lets you choose how to handle each one before downstream tools see the data.

How it works

Upload a file on the home page. The analyzer scans it and counts findings by confidence tier.
Click any tool. If the file hasn't been normalized yet, you're redirected to the Review & Normalize page.
The page shows every finding grouped by severity and confidence, with a per-finding decision control.

Confidence tiers

High — round-trip-safe algorithmic fix (BOM strip, whitespace trim, NBSP / zero-width strip, smart-quote fold, line-ending normalize, header cleanup). One-click "Auto-fix high-confidence" applies them all.
Medium — right call in the common case but with known false-positive shapes. Examples: lowercasing the email column, replacing null-like sentinels (N/A, -, nan), repairing unquoted-currency rows. Preview the change before applying.
Low — heuristic fixes that can corrupt data when wrong. Mojibake repair (café → café), mixed-encoding detection. Off by default; you opt in per finding.
Error — blocking. Empty file, unrepairable rows, U+FFFD replacement characters. Cannot enter the tool pages until resolved or explicitly waived.

Encoding override

When the analyzer reports encoding_uncertain or you spot mojibake (Ã©) or <EFBFBD> characters in the findings list, use the File encoding picker at the top of the Review page. Pick the right code page (cp1252 for Western Excel exports, KOI8-R for older Russian data, Big5 for traditional Chinese, etc.) or type a custom one, then click Re-analyze. Findings refresh against the corrected decode.

The picker is hidden for .xlsx files since Excel stores text as Unicode internally.

Advanced output options

After applying decisions, an ⚙️ Advanced output options expander on the download appears. Three dropdowns let you tune the output file format:

Encoding (code page) — UTF-8 (default), UTF-8 with BOM (Excel-friendly), Windows-1252, Latin-1, Latin-9, cp1250, ISO-8859-2, cp1251, Shift_JIS, GB18030, Big5, EUC-KR, UTF-16 LE.
Delimiter — comma (default), tab, semicolon, pipe.
Line terminator — LF (default), CRLF (Windows), CR.

The download filename auto-adjusts the extension (.tsv for tab, otherwise .csv). When the chosen encoding can't represent a character (Cyrillic content into cp1252, Asian script into Latin-1), the page shows a warning naming the offending character and falls back to ? replacement so the download still works.

4. Output

Every script writes:

A cleaned output file next to the input (or wherever you specify).
A timestamped log file in the logs/ folder showing what changed and why.

Reports from validator_reporter go to the reports/ folder as PDF or Excel.

The GUI also displays the output preview in-browser before any file is written. The original input file is never modified.

5. Troubleshooting

The GUI won't launch / browser doesn't open:

Wait 10-15 seconds after double-clicking. The local server takes a moment to start the first time.
If the browser doesn't open automatically, manually visit http://localhost:8501 in your browser.
If you see a "port in use" error, another program is using port 8501. Close other instances of the bundle and try again.

"Why is my browser opening?" / "Why does this need internet?": This tool runs as a local web app. The browser is just the display; nothing is uploaded, nothing leaves your computer. No internet connection is used after install. This is the same approach used by many modern data tools (Jupyter notebooks, RStudio, etc.).

Windows: "Windows protected your PC" SmartScreen warning: Click "More info" then "Run anyway." This is a standard warning for software without an extended-validation Windows code signing certificate.

macOS: "App is damaged and cannot be opened": This usually indicates the download was corrupted. Re-download from the link in your purchase email.

Linux: AppImage will not run: Make sure it is executable: chmod +x BundleName-1.0.AppImage. If it still fails, your distribution may be missing FUSE; install with sudo apt install libfuse2 (Debian/Ubuntu) or use the .tar.gz fallback.

Script throws an error about a file: Check the log file in the logs/ folder. The log explains exactly what went wrong and which row of input data triggered it.

The GUI feels slow on a large file: Files over ~100,000 rows take longer to process. The GUI shows a progress bar. If you have very large files (millions of rows) consider using the CLI directly, which is faster for batch jobs.

Need help: Email the address on your purchase receipt.

6. License

Single-user license. Do not redistribute. See LICENSE.txt in the install folder.

11 KiB Raw Blame History Unescape Escape