Files
datatools-dev/docs/USER-GUIDE.md
Michael fd9606c67b build: drop the local Python release method, return to CI-only installer builds
Removes the single-command Python packaging method (build/make_release.py
+ build/build_portable_zip.py + build/macos/build_zip.sh) and the portable
.zip artifacts it produced. Release builds go back to the original GitHub
Actions process: the CI matrix builds one installer per platform (.dmg /
.exe / .AppImage) on tag push and attaches them to a GitHub Release.

Tesseract OCR bundling is preserved: the fetch helpers the workflow depends
on (fetch_tessdata, fetch_tesseract_for_platform) are extracted into a
standalone build/tesseract.py, which build.yml now imports.

Docs (README, build/README, DEVELOPER, TECHNICAL, USER-GUIDE, vendor README,
es translations) updated to drop the portable-zip flavor and point at the
new module.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 17:47:36 +00:00

12 KiB
Raw Permalink Blame History

🌐 Language: English · Español

User Guide

Version: 1.6 · Updated: 2026-05-01

0. First launch — activation

DataTools must be activated before any tools unlock. On first launch you'll see the Activate screen.

Enter your full name + email, paste the license blob from your purchase email (starts with DTLIC1:), and click Activate. Renewal works the same way — paste the renewal blob, click Apply renewal.

Tiers:

Tier Tools
Lite Find Duplicates · Clean Text · Standardize Formats
Core All 9 tools

A Lite user opening a Core-only tool sees an "Upgrade your license" prompt. The home page also shows a 🔒 Locked badge on tool cards your tier doesn't unlock. To upgrade, paste a Core blob on the Activate page.

Every license lasts 1 year. The sidebar shows your tier and days remaining at all times; a renewal warning appears 30 days before expiry. The license file lives at ~/.datatools/license.json (Windows: C:\Users\<you>\.datatools\license.json).

To use the same license on a different machine: deactivate this one (Activate page → Deactivate this device) and re-paste your blob on the new machine.

1. Install

You don't need Python and you don't need admin rights — the bundle ships its own interpreter and every dependency. Each OS gets a single installer that wires up the Desktop shortcut + Start Menu / Launchpad entry automatically.

1.1 Windows

Installer (DataTools-<ver>-win-setup.exe)

  1. Download DataTools-<ver>-win-setup.exe from your release email or GitHub Releases.
  2. Double-click the installer. On the first run Windows SmartScreen will say "Windows protected your PC" — click More infoRun anyway. (This warning only appears once per build until we have an EV code-signing cert.)
  3. Accept the per-user install location (%LOCALAPPDATA%\Programs\DataTools by default — no admin prompt). Check Create a desktop shortcut if you want one (on by default).
  4. Click Install, then Finish. The installer offers to launch DataTools immediately.
  5. From now on launch from: Start Menu → DataTools, the Desktop shortcut, or just type DataTools into Windows Run (Win+R) / cmd.

To pin to the taskbar, launch the app once, right-click its icon in the taskbar, then Pin to taskbar. Windows requires this manual step — no installer is allowed to pin programmatically.

Uninstall: Settings → Apps → DataTools → Uninstall.

1.2 macOS

Installer DMG (DataTools-<ver>-mac.dmg)

  1. Download DataTools-<ver>-mac.dmg.
  2. Double-click the .dmg. A Finder window opens showing the DataTools icon and an Applications alias.
  3. Drag DataTools onto Applications. Wait for the copy to finish, then eject the DMG.
  4. On unsigned builds the first launch shows "DataTools" cannot be opened because the developer cannot be verified. Fix: right-click DataTools in /Applications → Open → confirm Open in the dialog. macOS remembers this choice — subsequent launches are clean.
  5. Launch from Launchpad, Spotlight (⌘ Space → type "DataTools"), or Applications in Finder.

To keep DataTools in the Dock: launch the app, right-click its Dock icon → Options → Keep in Dock. macOS doesn't allow installers to pin to the Dock automatically.

Uninstall: drag DataTools.app to the Trash. Your data files stay where you put them — nothing else is installed.

1.3 Linux

DataTools-<ver>-linux-x86_64.AppImage is already portable — no separate zip needed.

  1. Download the .AppImage.
  2. chmod +x DataTools-*.AppImage.
  3. Double-click, or run it from a terminal.

If your distro doesn't ship FUSE 2: sudo apt install libfuse2 (Debian/Ubuntu) or equivalent.

1.4 What happens on first launch

The launcher (called DataTools.exe / DataTools.app / DataTools.AppImage) does three things, in order:

  1. Picks a free TCP port on 127.0.0.1 — usually 8501, falls back through 8502, 8503, … if another app is using 8501.
  2. Starts a local Streamlit server on that port. The server is bound to localhost only, never to your LAN.
  3. Opens your default browser at http://127.0.0.1:<port>/. If the browser doesn't open within 5 seconds, paste that URL into your browser manually.

The launcher window stays open in the background. Closing it stops the server — the browser tab will say "this site can't be reached" the next time you click it.

1.5 How the GUI works

  • Runs locally on your machine. No internet, no upload.
  • The browser is just the display surface. Closing it does NOT stop the app — close the launcher window (or quit the macOS .app from the Dock) to fully exit.
  • Prefer the terminal? Every tool ships with a CLI too (Section 3).

1.6 System requirements

  • Windows 10/11 (64-bit), macOS 11+, modern Linux (2020+).
  • Modern browser (Chrome, Edge, Firefox, Safari, last 3 years).
  • ~500 MB free disk space (the bundle itself is ~300 MB; the rest is working scratch space for large CSVs).

OCR for scanned PDFs is bundled — Tesseract 5.5 + the English eng.traineddata model ship inside every installer / portable / AppImage. The PDF Extractor's scanned-statement path works out of the box; no separate install required. (Developers running from a pip install -r requirements.txt checkout still need system Tesseract on PATH — see DEVELOPER.md §PDF Extractor — bundled Tesseract.)

Full numbered support matrix: REQUIREMENTS.md.

2. What's included

# Tool Purpose Status
01 Find Duplicates Exact + fuzzy match, 5 normalizers, audit Ready
02 Clean Text Whitespace, smart chars, BOM, line endings, case ops Ready
03 Standardize Formats Dates / phones / emails / addresses / names / currencies / booleans Ready
04 Fix Missing Values Disguised nulls, imputation, drop-by-threshold Coming Soon
05 Map Columns Rename + enforce schema Coming Soon
06 Find Unusual Values z-score, IQR, multivariate Coming Soon
07 Combine Files Combine multiple files Coming Soon
08 Quality Check Rules + PDF/Excel report Coming Soon
09 Automated Workflows One-click multi-tool launcher Coming Soon

Sample data (samples/): messy_sales.csv, bank_export.xlsx.

3. Usage

  1. Launch the bundle.
  2. Pick a tool from the sidebar.
  3. Drop your file (or select a sample).
  4. Defaults are pre-filled — click Run to preview.
  5. Click Save Output to write the cleaned file.

Advanced options are tucked in expander panes. The original file is never modified.

In-tool Help: every tool page has a Help button right of the title. Click it to open a popover with a compact how-to (When to use · Steps · Examples · Tip). Use it as a refresher mid-task — the popover closes when you click outside, your inputs are untouched.

Sidebar nav: the sidebar groups tools into sections (Analysis, Data Cleaners, Transformations, Automations). Each section header shows + when collapsed and when expanded — click the header to toggle.

3.2 CLI

deduplicator       customers.csv [--apply]
text-cleaner       messy.csv     [--apply]
format-standardize feed.csv      [--apply]

Get help: deduplicator --help. Full reference: CLI-REFERENCE.md.

3.3 Run order (when running tools manually)

If you skip Automated Workflows, follow this order:

  1. 02 Clean Text first — normalizes whitespace + special chars.
  2. 03 Standardize Formats — dates, phones, etc. need cleaned text.
  3. 04 Fix Missing Values — sentinel codes hide as numbers.
  4. 05 Map Columns — schema before outlier stats.
  5. 06 Find Unusual Values — needs clean numerics. Stats on data with NaN or -999 are mathematically poisoned.
  6. 07 Combine Files, 08 Quality Check as needed.
  7. 01 Find Duplicates is order-flexible (normalizes internally for matching).

Automated Workflows enforces this automatically.

3.4 Language

The sidebar has a Language / Idioma picker. Two packs ship today:

  • English (default)
  • Español

Pick a language once — the choice persists for the session and the picker is visible from every page. Switch any time; the page re-renders in place with no data loss.

Coverage (v1.6): home page, tool cards, the upload + analysis panel, the findings list, the Review & Normalize gate prompt, the sidebar picker, and the shutdown screen. Per-tool page bodies (advanced-option labels, column-mapper prompts, dedup review labels) are tracked for future packs — they currently render in English in both modes. If a string you'd expect to switch doesn't, that's a missing pack key, not a bug in the picker; email support with a screenshot.

4. Review & Normalize gate

Every uploaded file is scanned before any tool sees it.

Confidence tiers:

  • High — round-trip safe. One-click "Auto-fix high-confidence" applies them all.
  • Medium — usually right, occasional false positives. Preview first.
  • Low — heuristic. Off by default; opt in per finding.
  • Error — blocks the gate (empty file, U+FFFD, unrepairable rows).

Encoding override: when the picker reports encoding_uncertain or you spot mojibake (é) or <EFBFBD> chars, choose the right codepage at the top of the page (cp1252 for Western Excel, KOI8-R for older Russian, Big5 for traditional Chinese, …) → Re-analyze.

Advanced output: an ⚙️ expander on the download lets you tune encoding, delimiter, and line terminator. The download filename auto-adjusts (.tsv for tab, .csv otherwise).

5. Output

Every run writes:

  • Cleaned file next to the input (or wherever you specify).
  • Audit file (per-cell changes for text/format tools, match groups for dedup).
  • Timestamped log in logs/.

Original input is never modified.

6. Troubleshooting

  • GUI won't launch / browser doesn't open — wait 10-15 s; manually visit http://127.0.0.1:8501 (or whichever port the launcher window prints). Port-in-use error → close other instances. The launcher walks ports 85018550 looking for a free one, so a stale instance can shift the URL.
  • Why does my browser open? — local web app pattern (same as Jupyter, RStudio). Nothing leaves your machine.
  • Windows SmartScreen — click "More info" → "Run anyway". One-time per build until we have an EV-signed cert.
  • macOS "App is damaged" / "developer cannot be verified" — right-click the app → Open → confirm. If the message persists, the file was likely corrupted in transit — re-download. As a last resort: xattr -cr /Applications/DataTools.app clears the quarantine attribute.
  • macOS portable .zip — extracted but won't open — Safari unzips on download by default; if you see a __MACOSX/ folder or ._DataTools.app file you used a different unarchiver. Re-extract with the built-in Archive Utility (right-click the .zip → Open With → Archive Utility) so the .app's metadata is preserved.
  • Windows portable .zip — antivirus quarantines DataTools.exe — your AV doesn't recognize the bundle. Allowlist the extracted folder. The installer .exe trips fewer AV products because it's a known Inno Setup wrapper.
  • Linux AppImage won't runchmod +x file.AppImage. Missing FUSE → sudo apt install libfuse2.
  • Slow on large file — over ~100k rows takes longer; progress bar shows. Multi-million rows → use the CLI directly.
  • Where does the app store my license / settings?~/.datatools/ on macOS + Linux, C:\Users\<you>\.datatools\ on Windows. Your input/output files stay where you put them; the app never copies them anywhere else.
  • Need help — email the address on your purchase receipt.

7. License

Single-user. See LICENSE.txt.