Files
datatools-dev/docs/USER-GUIDE.md
Michael fd9606c67b build: drop the local Python release method, return to CI-only installer builds
Removes the single-command Python packaging method (build/make_release.py
+ build/build_portable_zip.py + build/macos/build_zip.sh) and the portable
.zip artifacts it produced. Release builds go back to the original GitHub
Actions process: the CI matrix builds one installer per platform (.dmg /
.exe / .AppImage) on tag push and attaches them to a GitHub Release.

Tesseract OCR bundling is preserved: the fetch helpers the workflow depends
on (fetch_tessdata, fetch_tesseract_for_platform) are extracted into a
standalone build/tesseract.py, which build.yml now imports.

Docs (README, build/README, DEVELOPER, TECHNICAL, USER-GUIDE, vendor README,
es translations) updated to drop the portable-zip flavor and point at the
new module.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 17:47:36 +00:00

200 lines
12 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
> 🌐 **Language:** English · [Español](USER-GUIDE.es.md)
# User Guide
**Version**: 1.6 · **Updated**: 2026-05-01
## 0. First launch — activation
DataTools must be activated before any tools unlock. On first launch you'll see the **Activate** screen.
Enter your full name + email, paste the license blob from your purchase email (starts with `DTLIC1:`), and click **Activate**. Renewal works the same way — paste the renewal blob, click **Apply renewal**.
**Tiers**:
| Tier | Tools |
|---|---|
| **Lite** | Find Duplicates · Clean Text · Standardize Formats |
| **Core** | All 9 tools |
A Lite user opening a Core-only tool sees an "Upgrade your license" prompt. The home page also shows a 🔒 Locked badge on tool cards your tier doesn't unlock. To upgrade, paste a Core blob on the Activate page.
Every license lasts 1 year. The sidebar shows your tier and days remaining at all times; a renewal warning appears 30 days before expiry. The license file lives at `~/.datatools/license.json` (Windows: `C:\Users\<you>\.datatools\license.json`).
To use the same license on a different machine: deactivate this one (Activate page → **Deactivate this device**) and re-paste your blob on the new machine.
## 1. Install
You don't need Python and you don't need admin rights — the bundle ships its own interpreter and every dependency. Each OS gets a single installer that wires up the Desktop shortcut + Start Menu / Launchpad entry automatically.
### 1.1 Windows
**Installer (`DataTools-<ver>-win-setup.exe`)**
1. Download `DataTools-<ver>-win-setup.exe` from your release email or GitHub Releases.
2. Double-click the installer. On the first run Windows SmartScreen will say **"Windows protected your PC"** — click **More info****Run anyway**. (This warning only appears once per build until we have an EV code-signing cert.)
3. Accept the per-user install location (`%LOCALAPPDATA%\Programs\DataTools` by default — no admin prompt). Check **Create a desktop shortcut** if you want one (on by default).
4. Click **Install**, then **Finish**. The installer offers to launch DataTools immediately.
5. From now on launch from: **Start Menu → DataTools**, the **Desktop shortcut**, or just type `DataTools` into Windows Run (Win+R) / cmd.
To pin to the taskbar, launch the app once, right-click its icon in the taskbar, then **Pin to taskbar**. Windows requires this manual step — no installer is allowed to pin programmatically.
**Uninstall**: Settings → Apps → DataTools → Uninstall.
### 1.2 macOS
**Installer DMG (`DataTools-<ver>-mac.dmg`)**
1. Download `DataTools-<ver>-mac.dmg`.
2. Double-click the .dmg. A Finder window opens showing the **DataTools** icon and an **Applications** alias.
3. Drag **DataTools** onto **Applications**. Wait for the copy to finish, then eject the DMG.
4. On unsigned builds the first launch shows **"DataTools" cannot be opened because the developer cannot be verified**. Fix: right-click DataTools in /Applications → **Open** → confirm **Open** in the dialog. macOS remembers this choice — subsequent launches are clean.
5. Launch from **Launchpad**, **Spotlight** (`⌘ Space` → type "DataTools"), or **Applications** in Finder.
To keep DataTools in the Dock: launch the app, right-click its Dock icon → **Options → Keep in Dock**. macOS doesn't allow installers to pin to the Dock automatically.
**Uninstall**: drag `DataTools.app` to the Trash. Your data files stay where you put them — nothing else is installed.
### 1.3 Linux
`DataTools-<ver>-linux-x86_64.AppImage` is already portable — no separate zip needed.
1. Download the .AppImage.
2. `chmod +x DataTools-*.AppImage`.
3. Double-click, or run it from a terminal.
If your distro doesn't ship FUSE 2: `sudo apt install libfuse2` (Debian/Ubuntu) or equivalent.
### 1.4 What happens on first launch
The launcher (called `DataTools.exe` / `DataTools.app` / `DataTools.AppImage`) does three things, in order:
1. Picks a free TCP port on `127.0.0.1` — usually 8501, falls back through 8502, 8503, … if another app is using 8501.
2. Starts a local Streamlit server on that port. The server is **bound to localhost only**, never to your LAN.
3. Opens your default browser at `http://127.0.0.1:<port>/`. If the browser doesn't open within 5 seconds, paste that URL into your browser manually.
The launcher window stays open in the background. Closing it stops the server — the browser tab will say "this site can't be reached" the next time you click it.
### 1.5 How the GUI works
- Runs locally on your machine. **No internet, no upload.**
- The browser is just the display surface. Closing it does NOT stop the app — close the launcher window (or quit the macOS .app from the Dock) to fully exit.
- Prefer the terminal? Every tool ships with a CLI too (Section 3).
### 1.6 System requirements
- Windows 10/11 (64-bit), macOS 11+, modern Linux (2020+).
- Modern browser (Chrome, Edge, Firefox, Safari, last 3 years).
- ~500 MB free disk space (the bundle itself is ~300 MB; the rest is working scratch space for large CSVs).
**OCR for scanned PDFs is bundled** — Tesseract 5.5 + the English `eng.traineddata` model ship inside every installer / portable / AppImage. The PDF Extractor's scanned-statement path works out of the box; no separate install required. (Developers running from a `pip install -r requirements.txt` checkout still need system Tesseract on `PATH` — see [DEVELOPER.md §PDF Extractor — bundled Tesseract](DEVELOPER.md#pdf-extractor--bundled-tesseract).)
Full numbered support matrix: [REQUIREMENTS.md](REQUIREMENTS.md).
## 2. What's included
| # | Tool | Purpose | Status |
|---|------|---------|--------|
| 01 | Find Duplicates | Exact + fuzzy match, 5 normalizers, audit | Ready |
| 02 | Clean Text | Whitespace, smart chars, BOM, line endings, case ops | Ready |
| 03 | Standardize Formats | Dates / phones / emails / addresses / names / currencies / booleans | Ready |
| 04 | Fix Missing Values | Disguised nulls, imputation, drop-by-threshold | Coming Soon |
| 05 | Map Columns | Rename + enforce schema | Coming Soon |
| 06 | Find Unusual Values | z-score, IQR, multivariate | Coming Soon |
| 07 | Combine Files | Combine multiple files | Coming Soon |
| 08 | Quality Check | Rules + PDF/Excel report | Coming Soon |
| 09 | Automated Workflows | One-click multi-tool launcher | Coming Soon |
**Sample data** (`samples/`): `messy_sales.csv`, `bank_export.xlsx`.
## 3. Usage
### 3.1 GUI (recommended)
1. Launch the bundle.
2. Pick a tool from the sidebar.
3. Drop your file (or select a sample).
4. Defaults are pre-filled — click **Run** to preview.
5. Click **Save Output** to write the cleaned file.
Advanced options are tucked in expander panes. The original file is never modified.
**In-tool Help**: every tool page has a **Help** button right of the title. Click it to open a popover with a compact how-to (When to use · Steps · Examples · Tip). Use it as a refresher mid-task — the popover closes when you click outside, your inputs are untouched.
**Sidebar nav**: the sidebar groups tools into sections (Analysis, Data Cleaners, Transformations, Automations). Each section header shows `+` when collapsed and `` when expanded — click the header to toggle.
### 3.2 CLI
```bash
deduplicator customers.csv [--apply]
text-cleaner messy.csv [--apply]
format-standardize feed.csv [--apply]
```
Get help: `deduplicator --help`. Full reference: [CLI-REFERENCE.md](CLI-REFERENCE.md).
### 3.3 Run order (when running tools manually)
If you skip Automated Workflows, follow this order:
1. **02 Clean Text** first — normalizes whitespace + special chars.
2. **03 Standardize Formats** — dates, phones, etc. need cleaned text.
3. **04 Fix Missing Values** — sentinel codes hide as numbers.
4. **05 Map Columns** — schema before outlier stats.
5. **06 Find Unusual Values** — needs clean numerics. Stats on data with `NaN` or `-999` are mathematically poisoned.
6. **07 Combine Files**, **08 Quality Check** as needed.
7. **01 Find Duplicates** is order-flexible (normalizes internally for matching).
Automated Workflows enforces this automatically.
### 3.4 Language
The sidebar has a **Language / Idioma** picker. Two packs ship today:
- **English** (default)
- **Español**
Pick a language once — the choice persists for the session and the picker is visible from every page. Switch any time; the page re-renders in place with no data loss.
**Coverage** (v1.6): home page, tool cards, the upload + analysis panel, the findings list, the Review & Normalize gate prompt, the sidebar picker, and the shutdown screen. Per-tool page bodies (advanced-option labels, column-mapper prompts, dedup review labels) are tracked for future packs — they currently render in English in both modes. If a string you'd expect to switch doesn't, that's a missing pack key, not a bug in the picker; email support with a screenshot.
## 4. Review & Normalize gate
Every uploaded file is scanned before any tool sees it.
**Confidence tiers**:
- **High** — round-trip safe. One-click "Auto-fix high-confidence" applies them all.
- **Medium** — usually right, occasional false positives. Preview first.
- **Low** — heuristic. Off by default; opt in per finding.
- **Error** — blocks the gate (empty file, U+FFFD, unrepairable rows).
**Encoding override**: when the picker reports `encoding_uncertain` or you spot mojibake (`é`) or `<60>` chars, choose the right codepage at the top of the page (cp1252 for Western Excel, KOI8-R for older Russian, Big5 for traditional Chinese, …) → **Re-analyze**.
**Advanced output**: an `⚙️` expander on the download lets you tune encoding, delimiter, and line terminator. The download filename auto-adjusts (`.tsv` for tab, `.csv` otherwise).
## 5. Output
Every run writes:
- **Cleaned file** next to the input (or wherever you specify).
- **Audit file** (per-cell changes for text/format tools, match groups for dedup).
- **Timestamped log** in `logs/`.
Original input is never modified.
## 6. Troubleshooting
- **GUI won't launch / browser doesn't open** — wait 10-15 s; manually visit `http://127.0.0.1:8501` (or whichever port the launcher window prints). Port-in-use error → close other instances. The launcher walks ports 85018550 looking for a free one, so a stale instance can shift the URL.
- **Why does my browser open?** — local web app pattern (same as Jupyter, RStudio). Nothing leaves your machine.
- **Windows SmartScreen** — click "More info" → "Run anyway". One-time per build until we have an EV-signed cert.
- **macOS "App is damaged" / "developer cannot be verified"** — right-click the app → **Open** → confirm. If the message persists, the file was likely corrupted in transit — re-download. As a last resort: `xattr -cr /Applications/DataTools.app` clears the quarantine attribute.
- **macOS portable .zip — extracted but won't open** — Safari unzips on download by default; if you see a `__MACOSX/` folder or `._DataTools.app` file you used a different unarchiver. Re-extract with the built-in Archive Utility (right-click the .zip → **Open With → Archive Utility**) so the .app's metadata is preserved.
- **Windows portable .zip — antivirus quarantines DataTools.exe** — your AV doesn't recognize the bundle. Allowlist the extracted folder. The installer .exe trips fewer AV products because it's a known Inno Setup wrapper.
- **Linux AppImage won't run** — `chmod +x file.AppImage`. Missing FUSE → `sudo apt install libfuse2`.
- **Slow on large file** — over ~100k rows takes longer; progress bar shows. Multi-million rows → use the CLI directly.
- **Where does the app store my license / settings?** — `~/.datatools/` on macOS + Linux, `C:\Users\<you>\.datatools\` on Windows. Your input/output files stay where you put them; the app never copies them anywhere else.
- **Need help** — email the address on your purchase receipt.
## 7. License
Single-user. See `LICENSE.txt`.