172 lines
8.0 KiB
Markdown
172 lines
8.0 KiB
Markdown
# USER-GUIDE.md - Excel & CSV Data Cleaning Mastery Bundle
|
|
|
|
**Version**: 1.6
|
|
**Last updated**: April 28, 2026
|
|
|
|
Thank you for purchasing the Data Cleaning Mastery Bundle. This guide covers installation and every script included.
|
|
|
|
---
|
|
|
|
## 1. Installation
|
|
|
|
The bundle is fully self-contained. **You do not need to install Python.**
|
|
|
|
### Windows
|
|
|
|
1. Download `BundleName-Setup-1.0.exe` from your purchase email.
|
|
2. Double-click the installer.
|
|
3. Follow the wizard. The installer creates a desktop shortcut named "Launch Bundle" and an entry in Start Menu.
|
|
4. Launch via the desktop shortcut. Your default browser will open to a local page (typically `http://localhost:8501`) where the data tool runs.
|
|
|
|
### macOS
|
|
|
|
1. Download `BundleName-1.0.dmg` from your purchase email.
|
|
2. Double-click the `.dmg` to mount it.
|
|
3. Drag the Bundle app into the Applications folder.
|
|
4. Launch from Applications, Spotlight, or Launchpad. Your default browser will open to a local page where the data tool runs.
|
|
|
|
The app is signed and notarized by Apple, so it opens cleanly with no security warnings.
|
|
|
|
### Linux
|
|
|
|
1. Download `BundleName-1.0.AppImage` from your purchase email.
|
|
2. Make it executable: `chmod +x BundleName-1.0.AppImage`
|
|
3. Double-click to run, or execute from a terminal. Your default browser will open to a local page where the data tool runs.
|
|
|
|
If AppImage doesn't work on your distribution, a `.tar.gz` fallback is available in your purchase email. Extract it and run `./run.sh` from the extracted folder.
|
|
|
|
### How the GUI works (important to know)
|
|
|
|
This tool runs in your browser **locally on your computer**. When you launch it, a small program starts a local server on your machine and opens your default browser to view it. This is normal and expected.
|
|
|
|
- **No internet is required.** Your data never leaves your computer.
|
|
- **Your data is not uploaded anywhere.** All processing happens on your machine.
|
|
- The browser is just the display surface. Closing the browser closes the GUI; the underlying program also stops.
|
|
|
|
If you prefer the command line, every script also ships as a CLI tool. See Section 3.
|
|
|
|
### Requirements
|
|
|
|
- Windows: Windows 10 or 11 (64-bit).
|
|
- macOS: macOS 11 Big Sur or later (Apple Silicon or Intel).
|
|
- Linux: any modern 64-bit distribution from 2020 onward.
|
|
- A modern default browser (Chrome, Edge, Firefox, or Safari from the last 3 years).
|
|
- ~400-500 MB free disk space.
|
|
- Internet connection: not required.
|
|
|
|
---
|
|
|
|
## 2. What's Included
|
|
|
|
**Scripts (in the `scripts/` folder)**:
|
|
|
|
| # | Script | Purpose | Status |
|
|
|---|---|---|---|
|
|
| 01 | `01_deduplicator.py` | Smart duplicate removal: exact match + basic fuzzy, configurable subset columns, full logs | Working |
|
|
| 02 | `02_text_cleaner.py` | Character-level hygiene: trim leading/trailing whitespace, collapse internal multi-spaces, strip non-printable characters, Unicode normalization (smart quotes, em-dashes, accents), remove zero-width characters, BOM handling, line-ending normalization, case operations | Skeleton |
|
|
| 03 | `03_format_standardizer.py` | Standardize dates, currencies, names, phone numbers, addresses | Skeleton |
|
|
| 04 | `04_missing_value_handler.py` | Detect and handle missing values: disguised nulls (`N/A`, `-`, blanks, sentinel codes), imputation (mean/median/mode/forward-fill), required-field enforcement, drop-by-threshold | Skeleton |
|
|
| 05 | `05_column_mapper_enforcer.py` | Rename columns, enforce a target schema | Skeleton |
|
|
| 06 | `06_outlier_detector.py` | Detect and flag statistical outliers (z-score, IQR, modified z-score), multivariate detection, domain-rule violations, optional winsorization | Skeleton |
|
|
| 07 | `07_multi_file_merger.py` | Merge multiple CSV or Excel files into one | Skeleton |
|
|
| 08 | `08_validator_reporter.py` | Validate data against rules, output PDF or Excel report | Skeleton |
|
|
| 09 | `09_master_orchestrator.py` | One-click launcher menu, calls any other script | Skeleton |
|
|
|
|
**Sample data (in the `samples/` folder)**:
|
|
- `messy_sales.csv` - intentionally dirty sales data for testing.
|
|
- `bank_export.xlsx` - sample bank export for testing missing-value handling and outlier detection.
|
|
|
|
---
|
|
|
|
## 3. Usage
|
|
|
|
You have two ways to use the bundle: the GUI (recommended for most users) or the CLI (for power users and automation).
|
|
|
|
### 3.1 GUI usage (recommended)
|
|
|
|
1. Launch the bundle via the desktop shortcut, app icon, or AppImage.
|
|
2. Your browser opens to the bundle's home page.
|
|
3. Select the script you want to use from the sidebar (Deduplicator, Format Standardizer, etc.).
|
|
4. Drop your file into the file uploader, or select from the included samples.
|
|
5. Sensible defaults are pre-filled. Click "Run" to see a preview of what the script will do.
|
|
6. Review the preview. If it looks right, click "Save Output" to write the cleaned file.
|
|
|
|
The GUI is designed to work out of the box with zero configuration. Advanced options are tucked into expandable "Advanced" panes for users who want them.
|
|
|
|
### 3.2 CLI usage
|
|
|
|
All scripts are also CLI tools with `--help` output.
|
|
|
|
**Basic usage** (from a terminal):
|
|
|
|
Windows (the bundle adds CLI tools to your PATH):
|
|
```
|
|
deduplicator samples\messy_sales.csv
|
|
```
|
|
|
|
macOS / Linux:
|
|
```
|
|
deduplicator samples/messy_sales.csv
|
|
```
|
|
|
|
**With options**:
|
|
|
|
```
|
|
deduplicator samples/messy_sales.csv --output cleaned.csv --subset email,phone
|
|
```
|
|
|
|
**Get help on any script**:
|
|
|
|
```
|
|
deduplicator --help
|
|
```
|
|
|
|
**Recommended run order**: If you are running scripts individually, run `02_text_cleaner` first to normalize whitespace and special characters, then `04_missing_value_handler` *before* `06_outlier_detector`. Outlier detection on data still containing blanks or sentinel codes (like `-999`) produces unreliable results because missing-value placeholders distort the statistics (means get dragged, IQR widens, false negatives explode). The Master Orchestrator (script 09) runs them in the correct order automatically.
|
|
|
|
---
|
|
|
|
## 4. Output
|
|
|
|
Every script writes:
|
|
- A cleaned output file next to the input (or wherever you specify).
|
|
- A timestamped log file in the `logs/` folder showing what changed and why.
|
|
|
|
Reports from `validator_reporter` go to the `reports/` folder as PDF or Excel.
|
|
|
|
The GUI also displays the output preview in-browser before any file is written. The original input file is never modified.
|
|
|
|
---
|
|
|
|
## 5. Troubleshooting
|
|
|
|
**The GUI won't launch / browser doesn't open**:
|
|
1. Wait 10-15 seconds after double-clicking. The local server takes a moment to start the first time.
|
|
2. If the browser doesn't open automatically, manually visit `http://localhost:8501` in your browser.
|
|
3. If you see a "port in use" error, another program is using port 8501. Close other instances of the bundle and try again.
|
|
|
|
**"Why is my browser opening?" / "Why does this need internet?"**:
|
|
This tool runs as a local web app. The browser is just the display; nothing is uploaded, nothing leaves your computer. No internet connection is used after install. This is the same approach used by many modern data tools (Jupyter notebooks, RStudio, etc.).
|
|
|
|
**Windows: "Windows protected your PC" SmartScreen warning**:
|
|
Click "More info" then "Run anyway." This is a standard warning for software without an extended-validation Windows code signing certificate.
|
|
|
|
**macOS: "App is damaged and cannot be opened"**:
|
|
This usually indicates the download was corrupted. Re-download from the link in your purchase email.
|
|
|
|
**Linux: AppImage will not run**:
|
|
Make sure it is executable: `chmod +x BundleName-1.0.AppImage`. If it still fails, your distribution may be missing FUSE; install with `sudo apt install libfuse2` (Debian/Ubuntu) or use the `.tar.gz` fallback.
|
|
|
|
**Script throws an error about a file**:
|
|
Check the log file in the `logs/` folder. The log explains exactly what went wrong and which row of input data triggered it.
|
|
|
|
**The GUI feels slow on a large file**:
|
|
Files over ~100,000 rows take longer to process. The GUI shows a progress bar. If you have very large files (millions of rows) consider using the CLI directly, which is faster for batch jobs.
|
|
|
|
**Need help**: Email the address on your purchase receipt.
|
|
|
|
---
|
|
|
|
## 6. License
|
|
|
|
Single-user license. Do not redistribute. See `LICENSE.txt` in the install folder.
|