Files
datatools-dev/docs/USER-GUIDE.md
Michael b703911df3 docs: reflect bundled Tesseract on every install surface
- NEW LICENSE_TESSERACT.txt at the repo root: header noting it covers
  the bundled Tesseract OCR binary (Apache 2.0, upstream
  tesseract-ocr/tesseract, copyright Google + contributors) and the
  eng.traineddata from tessdata_best (also Apache 2.0). Clarifies
  DataTools itself remains proprietary. Full canonical Apache 2.0
  license text included.
- README.md + README.es.md (Download section): bumped size estimate
  ~200 MB → ~300 MB, added a short paragraph stating Tesseract OCR
  is bundled (no separate install required), with a link to the new
  license file.
- docs/USER-GUIDE.md + docs/USER-GUIDE.es.md (§1.6 System
  requirements): bumped disk estimate, added a paragraph stating
  Tesseract 5.5 + eng.traineddata ship inside every installer /
  portable / AppImage, with a source-install fallback hint pointing
  developers to DEVELOPER.md.
- docs/DEVELOPER.md: new "PDF Extractor — bundled Tesseract" section
  documenting the runtime layout (sys._MEIPASS / tesseract / …),
  discovery order, source of bytes (build/vendor/tessdata + per-
  platform fetch in make_release.py), version pin, update recipe.
- docs/TECHNICAL.md: new §3.10 "Bundled Tesseract (PDF Extractor
  OCR)" — short version of the discovery order for the build
  pipeline section.
- build/README.md: distribution-outputs paragraph now lists
  Tesseract among bundled deps with the ~250-300 MB estimate; new
  "Tesseract bundling" section: layout diagram, resolver order,
  source of bytes + 5.5.0 pin, update steps, license-file ref.

Out-of-scope gaps noted by the docs sweep:
- docs/FUTURE-TOOLS.md §D still describes Tesseract bundling as a
  high-risk packaging headache; now superseded. Worth a one-line
  "(resolved — bundled as of v1.x)" callout in a future pass.
- USER-GUIDE §2 "What's included" table doesn't list PDF Extractor
  at all (it shipped in b8aff86…967d3f6). Separate gap to close.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-02 18:20:50 +00:00

13 KiB
Raw Blame History

🌐 Language: English · Español

User Guide

Version: 1.6 · Updated: 2026-05-01

0. First launch — activation

DataTools must be activated before any tools unlock. On first launch you'll see the Activate screen.

Enter your full name + email, paste the license blob from your purchase email (starts with DTLIC1:), and click Activate. Renewal works the same way — paste the renewal blob, click Apply renewal.

Tiers:

Tier Tools
Lite Find Duplicates · Clean Text · Standardize Formats
Core All 9 tools

A Lite user opening a Core-only tool sees an "Upgrade your license" prompt. The home page also shows a 🔒 Locked badge on tool cards your tier doesn't unlock. To upgrade, paste a Core blob on the Activate page.

Every license lasts 1 year. The sidebar shows your tier and days remaining at all times; a renewal warning appears 30 days before expiry. The license file lives at ~/.datatools/license.json (Windows: C:\Users\<you>\.datatools\license.json).

To use the same license on a different machine: deactivate this one (Activate page → Deactivate this device) and re-paste your blob on the new machine.

1. Install

You don't need Python and you don't need admin rights — the bundle ships its own interpreter and every dependency. Two flavors per OS, pick whichever your IT policy allows:

  • Installer — wires up Desktop shortcut + Start Menu / Launchpad entry automatically. Recommended for most users.
  • Portable .zip — unzip and double-click. No registry writes, runs from anywhere (Desktop, USB stick, network share). Use this if you can't run installers, want a single-folder install you can copy between machines, or are evaluating before committing to install.

Both flavors are byte-identical inside: same Python, same dependencies, same launch behavior.

1.1 Windows

Option A — Installer (DataTools-<ver>-win-setup.exe)

  1. Download DataTools-<ver>-win-setup.exe from your release email or GitHub Releases.
  2. Double-click the installer. On the first run Windows SmartScreen will say "Windows protected your PC" — click More infoRun anyway. (This warning only appears once per build until we have an EV code-signing cert.)
  3. Accept the per-user install location (%LOCALAPPDATA%\Programs\DataTools by default — no admin prompt). Check Create a desktop shortcut if you want one (on by default).
  4. Click Install, then Finish. The installer offers to launch DataTools immediately.
  5. From now on launch from: Start Menu → DataTools, the Desktop shortcut, or just type DataTools into Windows Run (Win+R) / cmd.

To pin to the taskbar, launch the app once, right-click its icon in the taskbar, then Pin to taskbar. Windows requires this manual step — no installer is allowed to pin programmatically.

Option B — Portable (DataTools-<ver>-win-portable.zip)

  1. Download DataTools-<ver>-win-portable.zip.
  2. Right-click the .zip → Extract All… → pick a folder (e.g. C:\Tools\DataTools).
  3. Open the extracted DataTools\ folder, double-click DataTools.exe. SmartScreen warning fires the first time only.
  4. To create your own desktop shortcut later: right-click DataTools.exeSend to → Desktop (create shortcut).

Uninstall (installer only): Settings → Apps → DataTools → Uninstall. Portable: delete the folder.

1.2 macOS

Option A — Installer DMG (DataTools-<ver>-mac.dmg)

  1. Download DataTools-<ver>-mac.dmg.
  2. Double-click the .dmg. A Finder window opens showing the DataTools icon and an Applications alias.
  3. Drag DataTools onto Applications. Wait for the copy to finish, then eject the DMG.
  4. On unsigned builds the first launch shows "DataTools" cannot be opened because the developer cannot be verified. Fix: right-click DataTools in /Applications → Open → confirm Open in the dialog. macOS remembers this choice — subsequent launches are clean.
  5. Launch from Launchpad, Spotlight (⌘ Space → type "DataTools"), or Applications in Finder.

To keep DataTools in the Dock: launch the app, right-click its Dock icon → Options → Keep in Dock. macOS doesn't allow installers to pin to the Dock automatically.

Option B — Portable (DataTools-<ver>-mac-portable.zip)

  1. Download DataTools-<ver>-mac-portable.zip. Safari auto-unzips on download; in Finder you'll see DataTools.app directly.
  2. Move DataTools.app to Applications if you want it discoverable via Launchpad — or keep it on your Desktop, a USB stick, or a network share. The portable .app runs from anywhere.
  3. Double-click DataTools.app. Right-click → Open the first time (same unsigned-build dance as the DMG).

Uninstall: drag DataTools.app to the Trash. Your data files stay where you put them — nothing else is installed.

1.3 Linux

DataTools-<ver>-linux-x86_64.AppImage is already portable — no separate zip needed.

  1. Download the .AppImage.
  2. chmod +x DataTools-*.AppImage.
  3. Double-click, or run it from a terminal.

If your distro doesn't ship FUSE 2: sudo apt install libfuse2 (Debian/Ubuntu) or equivalent.

1.4 What happens on first launch

The launcher (called DataTools.exe / DataTools.app / DataTools.AppImage) does three things, in order:

  1. Picks a free TCP port on 127.0.0.1 — usually 8501, falls back through 8502, 8503, … if another app is using 8501.
  2. Starts a local Streamlit server on that port. The server is bound to localhost only, never to your LAN.
  3. Opens your default browser at http://127.0.0.1:<port>/. If the browser doesn't open within 5 seconds, paste that URL into your browser manually.

The launcher window stays open in the background. Closing it stops the server — the browser tab will say "this site can't be reached" the next time you click it.

1.5 How the GUI works

  • Runs locally on your machine. No internet, no upload.
  • The browser is just the display surface. Closing it does NOT stop the app — close the launcher window (or quit the macOS .app from the Dock) to fully exit.
  • Prefer the terminal? Every tool ships with a CLI too (Section 3).

1.6 System requirements

  • Windows 10/11 (64-bit), macOS 11+, modern Linux (2020+).
  • Modern browser (Chrome, Edge, Firefox, Safari, last 3 years).
  • ~500 MB free disk space (the bundle itself is ~300 MB; the rest is working scratch space for large CSVs).

OCR for scanned PDFs is bundled — Tesseract 5.5 + the English eng.traineddata model ship inside every installer / portable / AppImage. The PDF Extractor's scanned-statement path works out of the box; no separate install required. (Developers running from a pip install -r requirements.txt checkout still need system Tesseract on PATH — see DEVELOPER.md §PDF Extractor — bundled Tesseract.)

Full numbered support matrix: REQUIREMENTS.md.

2. What's included

# Tool Purpose Status
01 Find Duplicates Exact + fuzzy match, 5 normalizers, audit Ready
02 Clean Text Whitespace, smart chars, BOM, line endings, case ops Ready
03 Standardize Formats Dates / phones / emails / addresses / names / currencies / booleans Ready
04 Fix Missing Values Disguised nulls, imputation, drop-by-threshold Coming Soon
05 Map Columns Rename + enforce schema Coming Soon
06 Find Unusual Values z-score, IQR, multivariate Coming Soon
07 Combine Files Combine multiple files Coming Soon
08 Quality Check Rules + PDF/Excel report Coming Soon
09 Automated Workflows One-click multi-tool launcher Coming Soon

Sample data (samples/): messy_sales.csv, bank_export.xlsx.

3. Usage

  1. Launch the bundle.
  2. Pick a tool from the sidebar.
  3. Drop your file (or select a sample).
  4. Defaults are pre-filled — click Run to preview.
  5. Click Save Output to write the cleaned file.

Advanced options are tucked in expander panes. The original file is never modified.

In-tool Help: every tool page has a Help button right of the title. Click it to open a popover with a compact how-to (When to use · Steps · Examples · Tip). Use it as a refresher mid-task — the popover closes when you click outside, your inputs are untouched.

Sidebar nav: the sidebar groups tools into sections (Analysis, Data Cleaners, Transformations, Automations). Each section header shows + when collapsed and when expanded — click the header to toggle.

3.2 CLI

deduplicator       customers.csv [--apply]
text-cleaner       messy.csv     [--apply]
format-standardize feed.csv      [--apply]

Get help: deduplicator --help. Full reference: CLI-REFERENCE.md.

3.3 Run order (when running tools manually)

If you skip Automated Workflows, follow this order:

  1. 02 Clean Text first — normalizes whitespace + special chars.
  2. 03 Standardize Formats — dates, phones, etc. need cleaned text.
  3. 04 Fix Missing Values — sentinel codes hide as numbers.
  4. 05 Map Columns — schema before outlier stats.
  5. 06 Find Unusual Values — needs clean numerics. Stats on data with NaN or -999 are mathematically poisoned.
  6. 07 Combine Files, 08 Quality Check as needed.
  7. 01 Find Duplicates is order-flexible (normalizes internally for matching).

Automated Workflows enforces this automatically.

3.4 Language

The sidebar has a Language / Idioma picker. Two packs ship today:

  • English (default)
  • Español

Pick a language once — the choice persists for the session and the picker is visible from every page. Switch any time; the page re-renders in place with no data loss.

Coverage (v1.6): home page, tool cards, the upload + analysis panel, the findings list, the Review & Normalize gate prompt, the sidebar picker, and the shutdown screen. Per-tool page bodies (advanced-option labels, column-mapper prompts, dedup review labels) are tracked for future packs — they currently render in English in both modes. If a string you'd expect to switch doesn't, that's a missing pack key, not a bug in the picker; email support with a screenshot.

4. Review & Normalize gate

Every uploaded file is scanned before any tool sees it.

Confidence tiers:

  • High — round-trip safe. One-click "Auto-fix high-confidence" applies them all.
  • Medium — usually right, occasional false positives. Preview first.
  • Low — heuristic. Off by default; opt in per finding.
  • Error — blocks the gate (empty file, U+FFFD, unrepairable rows).

Encoding override: when the picker reports encoding_uncertain or you spot mojibake (é) or <EFBFBD> chars, choose the right codepage at the top of the page (cp1252 for Western Excel, KOI8-R for older Russian, Big5 for traditional Chinese, …) → Re-analyze.

Advanced output: an ⚙️ expander on the download lets you tune encoding, delimiter, and line terminator. The download filename auto-adjusts (.tsv for tab, .csv otherwise).

5. Output

Every run writes:

  • Cleaned file next to the input (or wherever you specify).
  • Audit file (per-cell changes for text/format tools, match groups for dedup).
  • Timestamped log in logs/.

Original input is never modified.

6. Troubleshooting

  • GUI won't launch / browser doesn't open — wait 10-15 s; manually visit http://127.0.0.1:8501 (or whichever port the launcher window prints). Port-in-use error → close other instances. The launcher walks ports 85018550 looking for a free one, so a stale instance can shift the URL.
  • Why does my browser open? — local web app pattern (same as Jupyter, RStudio). Nothing leaves your machine.
  • Windows SmartScreen — click "More info" → "Run anyway". One-time per build until we have an EV-signed cert.
  • macOS "App is damaged" / "developer cannot be verified" — right-click the app → Open → confirm. If the message persists, the file was likely corrupted in transit — re-download. As a last resort: xattr -cr /Applications/DataTools.app clears the quarantine attribute.
  • macOS portable .zip — extracted but won't open — Safari unzips on download by default; if you see a __MACOSX/ folder or ._DataTools.app file you used a different unarchiver. Re-extract with the built-in Archive Utility (right-click the .zip → Open With → Archive Utility) so the .app's metadata is preserved.
  • Windows portable .zip — antivirus quarantines DataTools.exe — your AV doesn't recognize the bundle. Allowlist the extracted folder. The installer .exe trips fewer AV products because it's a known Inno Setup wrapper.
  • Linux AppImage won't runchmod +x file.AppImage. Missing FUSE → sudo apt install libfuse2.
  • Slow on large file — over ~100k rows takes longer; progress bar shows. Multi-million rows → use the CLI directly.
  • Where does the app store my license / settings?~/.datatools/ on macOS + Linux, C:\Users\<you>\.datatools\ on Windows. Your input/output files stay where you put them; the app never copies them anywhere else.
  • Need help — email the address on your purchase receipt.

7. License

Single-user. See LICENSE.txt.