New docs/REQUIREMENTS.md catalogs every shipped capability in 17 numbered categories — file handling, input/output encodings, delimiters, line endings, detectors, finding schema, confidence tiers, decisions, performance targets (1 GB), tools, gate behavior, interfaces, platforms, deps, test coverage, privacy. Linked from README and USER-GUIDE so a buyer / integrator can scan compliance in under a minute. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
209 lines
11 KiB
Markdown
209 lines
11 KiB
Markdown
# USER-GUIDE.md - Excel & CSV Data Cleaning Mastery Bundle
|
||
|
||
**Version**: 1.6
|
||
**Last updated**: April 28, 2026
|
||
|
||
Thank you for purchasing the Data Cleaning Mastery Bundle. This guide covers installation and every script included.
|
||
|
||
---
|
||
|
||
## 1. Installation
|
||
|
||
The bundle is fully self-contained. **You do not need to install Python.**
|
||
|
||
### Windows
|
||
|
||
1. Download `BundleName-Setup-1.0.exe` from your purchase email.
|
||
2. Double-click the installer.
|
||
3. Follow the wizard. The installer creates a desktop shortcut named "Launch Bundle" and an entry in Start Menu.
|
||
4. Launch via the desktop shortcut. Your default browser will open to a local page (typically `http://localhost:8501`) where the data tool runs.
|
||
|
||
### macOS
|
||
|
||
1. Download `BundleName-1.0.dmg` from your purchase email.
|
||
2. Double-click the `.dmg` to mount it.
|
||
3. Drag the Bundle app into the Applications folder.
|
||
4. Launch from Applications, Spotlight, or Launchpad. Your default browser will open to a local page where the data tool runs.
|
||
|
||
The app is signed and notarized by Apple, so it opens cleanly with no security warnings.
|
||
|
||
### Linux
|
||
|
||
1. Download `BundleName-1.0.AppImage` from your purchase email.
|
||
2. Make it executable: `chmod +x BundleName-1.0.AppImage`
|
||
3. Double-click to run, or execute from a terminal. Your default browser will open to a local page where the data tool runs.
|
||
|
||
If AppImage doesn't work on your distribution, a `.tar.gz` fallback is available in your purchase email. Extract it and run `./run.sh` from the extracted folder.
|
||
|
||
### How the GUI works (important to know)
|
||
|
||
This tool runs in your browser **locally on your computer**. When you launch it, a small program starts a local server on your machine and opens your default browser to view it. This is normal and expected.
|
||
|
||
- **No internet is required.** Your data never leaves your computer.
|
||
- **Your data is not uploaded anywhere.** All processing happens on your machine.
|
||
- The browser is just the display surface. Closing the browser closes the GUI; the underlying program also stops.
|
||
|
||
If you prefer the command line, every script also ships as a CLI tool. See Section 3.
|
||
|
||
### Requirements
|
||
|
||
- Windows: Windows 10 or 11 (64-bit).
|
||
- macOS: macOS 11 Big Sur or later (Apple Silicon or Intel).
|
||
- Linux: any modern 64-bit distribution from 2020 onward.
|
||
- A modern default browser (Chrome, Edge, Firefox, or Safari from the last 3 years).
|
||
- ~400-500 MB free disk space.
|
||
- Internet connection: not required.
|
||
|
||
For the full short-form numbered list of what's supported (file sizes, code pages, delimiters, performance targets, detector list, etc.), see [REQUIREMENTS.md](REQUIREMENTS.md).
|
||
|
||
---
|
||
|
||
## 2. What's Included
|
||
|
||
**Scripts (in the `scripts/` folder)**:
|
||
|
||
| # | Script | Purpose | Status |
|
||
|---|---|---|---|
|
||
| 01 | `01_deduplicator.py` | Smart duplicate removal: exact match + basic fuzzy, configurable subset columns, full logs | Working |
|
||
| 02 | `02_text_cleaner.py` | Character-level hygiene: trim leading/trailing whitespace, collapse internal multi-spaces, strip non-printable characters, Unicode normalization (smart quotes, em-dashes, accents), remove zero-width characters, BOM handling, line-ending normalization, case operations | Working |
|
||
| 03 | `03_format_standardizer.py` | Standardize dates, currencies, names, phone numbers, addresses | Skeleton |
|
||
| 04 | `04_missing_value_handler.py` | Detect and handle missing values: disguised nulls (`N/A`, `-`, blanks, sentinel codes), imputation (mean/median/mode/forward-fill), required-field enforcement, drop-by-threshold | Skeleton |
|
||
| 05 | `05_column_mapper_enforcer.py` | Rename columns, enforce a target schema | Skeleton |
|
||
| 06 | `06_outlier_detector.py` | Detect and flag statistical outliers (z-score, IQR, modified z-score), multivariate detection, domain-rule violations, optional winsorization | Skeleton |
|
||
| 07 | `07_multi_file_merger.py` | Merge multiple CSV or Excel files into one | Skeleton |
|
||
| 08 | `08_validator_reporter.py` | Validate data against rules, output PDF or Excel report | Skeleton |
|
||
| 09 | `09_master_orchestrator.py` | One-click launcher menu, calls any other script | Skeleton |
|
||
|
||
**Sample data (in the `samples/` folder)**:
|
||
- `messy_sales.csv` - intentionally dirty sales data for testing.
|
||
- `bank_export.xlsx` - sample bank export for testing missing-value handling and outlier detection.
|
||
|
||
---
|
||
|
||
## 3. Usage
|
||
|
||
You have two ways to use the bundle: the GUI (recommended for most users) or the CLI (for power users and automation).
|
||
|
||
### 3.1 GUI usage (recommended)
|
||
|
||
1. Launch the bundle via the desktop shortcut, app icon, or AppImage.
|
||
2. Your browser opens to the bundle's home page.
|
||
3. Select the script you want to use from the sidebar (Deduplicator, Format Standardizer, etc.).
|
||
4. Drop your file into the file uploader, or select from the included samples.
|
||
5. Sensible defaults are pre-filled. Click "Run" to see a preview of what the script will do.
|
||
6. Review the preview. If it looks right, click "Save Output" to write the cleaned file.
|
||
|
||
The GUI is designed to work out of the box with zero configuration. Advanced options are tucked into expandable "Advanced" panes for users who want them.
|
||
|
||
### 3.2 CLI usage
|
||
|
||
All scripts are also CLI tools with `--help` output.
|
||
|
||
**Basic usage** (from a terminal):
|
||
|
||
Windows (the bundle adds CLI tools to your PATH):
|
||
```
|
||
deduplicator samples\messy_sales.csv
|
||
```
|
||
|
||
macOS / Linux:
|
||
```
|
||
deduplicator samples/messy_sales.csv
|
||
```
|
||
|
||
**With options**:
|
||
|
||
```
|
||
deduplicator samples/messy_sales.csv --output cleaned.csv --subset email,phone
|
||
```
|
||
|
||
**Get help on any script**:
|
||
|
||
```
|
||
deduplicator --help
|
||
```
|
||
|
||
**Recommended run order**: If you are running scripts individually, run `02_text_cleaner` first to normalize whitespace and special characters, then `04_missing_value_handler` *before* `06_outlier_detector`. Outlier detection on data still containing blanks or sentinel codes (like `-999`) produces unreliable results because missing-value placeholders distort the statistics (means get dragged, IQR widens, false negatives explode). The Master Orchestrator (script 09) runs them in the correct order automatically.
|
||
|
||
---
|
||
|
||
## 3.3 Review & Normalize gate
|
||
|
||
Before any tool page accepts a file, the file passes through a **CSV-normalization gate**. The gate scans every uploaded file, surfaces every data-quality issue our analyzer can detect, and lets you choose how to handle each one before downstream tools see the data.
|
||
|
||
### How it works
|
||
|
||
1. Upload a file on the home page. The analyzer scans it and counts findings by confidence tier.
|
||
2. Click any tool. If the file hasn't been normalized yet, you're redirected to the **Review & Normalize** page.
|
||
3. The page shows every finding grouped by severity and confidence, with a per-finding decision control.
|
||
|
||
### Confidence tiers
|
||
|
||
- **High** — round-trip-safe algorithmic fix (BOM strip, whitespace trim, NBSP / zero-width strip, smart-quote fold, line-ending normalize, header cleanup). One-click "Auto-fix high-confidence" applies them all.
|
||
- **Medium** — right call in the common case but with known false-positive shapes. Examples: lowercasing the email column, replacing null-like sentinels (`N/A`, `-`, `nan`), repairing unquoted-currency rows. Preview the change before applying.
|
||
- **Low** — heuristic fixes that can corrupt data when wrong. Mojibake repair (`café` → `café`), mixed-encoding detection. Off by default; you opt in per finding.
|
||
- **Error** — blocking. Empty file, unrepairable rows, U+FFFD replacement characters. Cannot enter the tool pages until resolved or explicitly waived.
|
||
|
||
### Encoding override
|
||
|
||
When the analyzer reports `encoding_uncertain` or you spot mojibake (`é`) or `<60>` characters in the findings list, use the **File encoding** picker at the top of the Review page. Pick the right code page (cp1252 for Western Excel exports, KOI8-R for older Russian data, Big5 for traditional Chinese, etc.) or type a custom one, then click **Re-analyze**. Findings refresh against the corrected decode.
|
||
|
||
The picker is hidden for `.xlsx` files since Excel stores text as Unicode internally.
|
||
|
||
### Advanced output options
|
||
|
||
After applying decisions, an `⚙️ Advanced output options` expander on the download appears. Three dropdowns let you tune the output file format:
|
||
|
||
- **Encoding (code page)** — UTF-8 (default), UTF-8 with BOM (Excel-friendly), Windows-1252, Latin-1, Latin-9, cp1250, ISO-8859-2, cp1251, Shift_JIS, GB18030, Big5, EUC-KR, UTF-16 LE.
|
||
- **Delimiter** — comma (default), tab, semicolon, pipe.
|
||
- **Line terminator** — LF (default), CRLF (Windows), CR.
|
||
|
||
The download filename auto-adjusts the extension (`.tsv` for tab, otherwise `.csv`). When the chosen encoding can't represent a character (Cyrillic content into cp1252, Asian script into Latin-1), the page shows a warning naming the offending character and falls back to `?` replacement so the download still works.
|
||
|
||
---
|
||
|
||
## 4. Output
|
||
|
||
Every script writes:
|
||
- A cleaned output file next to the input (or wherever you specify).
|
||
- A timestamped log file in the `logs/` folder showing what changed and why.
|
||
|
||
Reports from `validator_reporter` go to the `reports/` folder as PDF or Excel.
|
||
|
||
The GUI also displays the output preview in-browser before any file is written. The original input file is never modified.
|
||
|
||
---
|
||
|
||
## 5. Troubleshooting
|
||
|
||
**The GUI won't launch / browser doesn't open**:
|
||
1. Wait 10-15 seconds after double-clicking. The local server takes a moment to start the first time.
|
||
2. If the browser doesn't open automatically, manually visit `http://localhost:8501` in your browser.
|
||
3. If you see a "port in use" error, another program is using port 8501. Close other instances of the bundle and try again.
|
||
|
||
**"Why is my browser opening?" / "Why does this need internet?"**:
|
||
This tool runs as a local web app. The browser is just the display; nothing is uploaded, nothing leaves your computer. No internet connection is used after install. This is the same approach used by many modern data tools (Jupyter notebooks, RStudio, etc.).
|
||
|
||
**Windows: "Windows protected your PC" SmartScreen warning**:
|
||
Click "More info" then "Run anyway." This is a standard warning for software without an extended-validation Windows code signing certificate.
|
||
|
||
**macOS: "App is damaged and cannot be opened"**:
|
||
This usually indicates the download was corrupted. Re-download from the link in your purchase email.
|
||
|
||
**Linux: AppImage will not run**:
|
||
Make sure it is executable: `chmod +x BundleName-1.0.AppImage`. If it still fails, your distribution may be missing FUSE; install with `sudo apt install libfuse2` (Debian/Ubuntu) or use the `.tar.gz` fallback.
|
||
|
||
**Script throws an error about a file**:
|
||
Check the log file in the `logs/` folder. The log explains exactly what went wrong and which row of input data triggered it.
|
||
|
||
**The GUI feels slow on a large file**:
|
||
Files over ~100,000 rows take longer to process. The GUI shows a progress bar. If you have very large files (millions of rows) consider using the CLI directly, which is faster for batch jobs.
|
||
|
||
**Need help**: Email the address on your purchase receipt.
|
||
|
||
---
|
||
|
||
## 6. License
|
||
|
||
Single-user license. Do not redistribute. See `LICENSE.txt` in the install folder.
|