Sweep follow-up to 93e43fc. Display labels now consistent across docs,
landing pages, CLI output, code comments, docstrings, and test prose.
Five parallel surfaces touched:
- docs (EN + ES): README, USER-GUIDE, CLI-REFERENCE, and 11 internal
design/planning docs
- landing pages: index + bookkeeper/revops/shopify-pet
- src: CLI module docstrings, _TOOL_DISPLAY dicts in cli_analyze.py
and gui/components/_legacy.py, core module headers, every tool
page's module docstring
- tests: class/method/module docstrings and section-header comments
- test-cases READMEs
Page slugs (1_Deduplicator etc.), tool_id strings (01_deduplicator
etc.), Python class names (TestDeduplicatorWorkflow, FeatureFlag.*),
URL paths, anchor IDs, CSS classes, and asset filenames were left
intact since they're code identifiers / structural references.
All 2033 tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6.8 KiB
🌐 Language: English · Español
User Guide
Version: 1.6 · Updated: 2026-05-01
0. First launch — activation
DataTools must be activated before any tools unlock. On first launch you'll see the Activate screen.
Enter your full name + email, paste the license blob from your purchase email (starts with DTLIC1:), and click Activate. Renewal works the same way — paste the renewal blob, click Apply renewal.
Tiers:
| Tier | Tools |
|---|---|
| Lite | Find Duplicates · Clean Text · Standardize Formats |
| Core | All 9 tools |
A Lite user opening a Core-only tool sees an "Upgrade your license" prompt. The home page also shows a 🔒 Locked badge on tool cards your tier doesn't unlock. To upgrade, paste a Core blob on the Activate page.
Every license lasts 1 year. The sidebar shows your tier and days remaining at all times; a renewal warning appears 30 days before expiry. The license file lives at ~/.datatools/license.json (Windows: C:\Users\<you>\.datatools\license.json).
To use the same license on a different machine: deactivate this one (Activate page → Deactivate this device) and re-paste your blob on the new machine.
1. Install
You don't need Python — the bundle is self-contained.
| OS | File | How |
|---|---|---|
| Windows | BundleName-Setup-1.0.exe |
Double-click installer → desktop shortcut. |
| macOS | BundleName-1.0.dmg |
Mount, drag to Applications. Signed + notarized. |
| Linux | BundleName-1.0.AppImage |
chmod +x, double-click. (.tar.gz fallback available.) |
Launching opens your default browser to a local page (http://localhost:8501).
How the GUI works
- Runs locally on your machine. No internet, no upload.
- Browser is just the display surface. Closing it stops the underlying program.
- Prefer the terminal? Every tool ships with a CLI too (Section 3).
System requirements
- Windows 10/11 (64-bit), macOS 11+, modern Linux (2020+).
- Modern browser (Chrome, Edge, Firefox, Safari, last 3 years).
- ~400-500 MB free disk space.
Full numbered support matrix: REQUIREMENTS.md.
2. What's included
| # | Tool | Purpose | Status |
|---|---|---|---|
| 01 | Find Duplicates | Exact + fuzzy match, 5 normalizers, audit | Ready |
| 02 | Clean Text | Whitespace, smart chars, BOM, line endings, case ops | Ready |
| 03 | Standardize Formats | Dates / phones / emails / addresses / names / currencies / booleans | Ready |
| 04 | Fix Missing Values | Disguised nulls, imputation, drop-by-threshold | Coming Soon |
| 05 | Map Columns | Rename + enforce schema | Coming Soon |
| 06 | Find Unusual Values | z-score, IQR, multivariate | Coming Soon |
| 07 | Combine Files | Combine multiple files | Coming Soon |
| 08 | Quality Check | Rules + PDF/Excel report | Coming Soon |
| 09 | Automated Workflows | One-click multi-tool launcher | Coming Soon |
Sample data (samples/): messy_sales.csv, bank_export.xlsx.
3. Usage
3.1 GUI (recommended)
- Launch the bundle.
- Pick a tool from the sidebar.
- Drop your file (or select a sample).
- Defaults are pre-filled — click Run to preview.
- Click Save Output to write the cleaned file.
Advanced options are tucked in expander panes. The original file is never modified.
3.2 CLI
deduplicator customers.csv [--apply]
text-cleaner messy.csv [--apply]
format-standardize feed.csv [--apply]
Get help: deduplicator --help. Full reference: CLI-REFERENCE.md.
3.3 Run order (when running tools manually)
If you skip Automated Workflows, follow this order:
- 02 Clean Text first — normalizes whitespace + special chars.
- 03 Standardize Formats — dates, phones, etc. need cleaned text.
- 04 Fix Missing Values — sentinel codes hide as numbers.
- 05 Map Columns — schema before outlier stats.
- 06 Find Unusual Values — needs clean numerics. Stats on data with
NaNor-999are mathematically poisoned. - 07 Combine Files, 08 Quality Check as needed.
- 01 Find Duplicates is order-flexible (normalizes internally for matching).
Automated Workflows enforces this automatically.
3.4 Language
The sidebar has a Language / Idioma picker. Two packs ship today:
- English (default)
- Español
Pick a language once — the choice persists for the session and the picker is visible from every page. Switch any time; the page re-renders in place with no data loss.
Coverage (v1.6): home page, tool cards, the upload + analysis panel, the findings list, the Review & Normalize gate prompt, the sidebar picker, and the shutdown screen. Per-tool page bodies (advanced-option labels, column-mapper prompts, dedup review labels) are tracked for future packs — they currently render in English in both modes. If a string you'd expect to switch doesn't, that's a missing pack key, not a bug in the picker; email support with a screenshot.
4. Review & Normalize gate
Every uploaded file is scanned before any tool sees it.
Confidence tiers:
- High — round-trip safe. One-click "Auto-fix high-confidence" applies them all.
- Medium — usually right, occasional false positives. Preview first.
- Low — heuristic. Off by default; opt in per finding.
- Error — blocks the gate (empty file, U+FFFD, unrepairable rows).
Encoding override: when the picker reports encoding_uncertain or you spot mojibake (é) or <EFBFBD> chars, choose the right codepage at the top of the page (cp1252 for Western Excel, KOI8-R for older Russian, Big5 for traditional Chinese, …) → Re-analyze.
Advanced output: an ⚙️ expander on the download lets you tune encoding, delimiter, and line terminator. The download filename auto-adjusts (.tsv for tab, .csv otherwise).
5. Output
Every run writes:
- Cleaned file next to the input (or wherever you specify).
- Audit file (per-cell changes for text/format tools, match groups for dedup).
- Timestamped log in
logs/.
Original input is never modified.
6. Troubleshooting
- GUI won't launch / browser doesn't open — wait 10-15 s; manually visit
http://localhost:8501. Port-in-use error → close other instances. - Why does my browser open? — local web app pattern (same as Jupyter, RStudio). Nothing leaves your machine.
- Windows SmartScreen — click "More info" → "Run anyway". Standard for non-EV-signed software.
- macOS "App is damaged" — re-download (file likely corrupted in transit).
- Linux AppImage won't run —
chmod +x file.AppImage. Missing FUSE →sudo apt install libfuse2or use.tar.gz. - Slow on large file — over ~100k rows takes longer; progress bar shows. Multi-million rows → use the CLI directly.
- Need help — email the address on your purchase receipt.
7. License
Single-user. See LICENSE.txt.