- NEW LICENSE_TESSERACT.txt at the repo root: header noting it covers the bundled Tesseract OCR binary (Apache 2.0, upstream tesseract-ocr/tesseract, copyright Google + contributors) and the eng.traineddata from tessdata_best (also Apache 2.0). Clarifies DataTools itself remains proprietary. Full canonical Apache 2.0 license text included. - README.md + README.es.md (Download section): bumped size estimate ~200 MB → ~300 MB, added a short paragraph stating Tesseract OCR is bundled (no separate install required), with a link to the new license file. - docs/USER-GUIDE.md + docs/USER-GUIDE.es.md (§1.6 System requirements): bumped disk estimate, added a paragraph stating Tesseract 5.5 + eng.traineddata ship inside every installer / portable / AppImage, with a source-install fallback hint pointing developers to DEVELOPER.md. - docs/DEVELOPER.md: new "PDF Extractor — bundled Tesseract" section documenting the runtime layout (sys._MEIPASS / tesseract / …), discovery order, source of bytes (build/vendor/tessdata + per- platform fetch in make_release.py), version pin, update recipe. - docs/TECHNICAL.md: new §3.10 "Bundled Tesseract (PDF Extractor OCR)" — short version of the discovery order for the build pipeline section. - build/README.md: distribution-outputs paragraph now lists Tesseract among bundled deps with the ~250-300 MB estimate; new "Tesseract bundling" section: layout diagram, resolver order, source of bytes + 5.5.0 pin, update steps, license-file ref. Out-of-scope gaps noted by the docs sweep: - docs/FUTURE-TOOLS.md §D still describes Tesseract bundling as a high-risk packaging headache; now superseded. Worth a one-line "(resolved — bundled as of v1.x)" callout in a future pass. - USER-GUIDE §2 "What's included" table doesn't list PDF Extractor at all (it shipped in b8aff86…967d3f6). Separate gap to close. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
13 KiB
🌐 Language: English · Español
User Guide
Version: 1.6 · Updated: 2026-05-01
0. First launch — activation
DataTools must be activated before any tools unlock. On first launch you'll see the Activate screen.
Enter your full name + email, paste the license blob from your purchase email (starts with DTLIC1:), and click Activate. Renewal works the same way — paste the renewal blob, click Apply renewal.
Tiers:
| Tier | Tools |
|---|---|
| Lite | Find Duplicates · Clean Text · Standardize Formats |
| Core | All 9 tools |
A Lite user opening a Core-only tool sees an "Upgrade your license" prompt. The home page also shows a 🔒 Locked badge on tool cards your tier doesn't unlock. To upgrade, paste a Core blob on the Activate page.
Every license lasts 1 year. The sidebar shows your tier and days remaining at all times; a renewal warning appears 30 days before expiry. The license file lives at ~/.datatools/license.json (Windows: C:\Users\<you>\.datatools\license.json).
To use the same license on a different machine: deactivate this one (Activate page → Deactivate this device) and re-paste your blob on the new machine.
1. Install
You don't need Python and you don't need admin rights — the bundle ships its own interpreter and every dependency. Two flavors per OS, pick whichever your IT policy allows:
- Installer — wires up Desktop shortcut + Start Menu / Launchpad entry automatically. Recommended for most users.
- Portable .zip — unzip and double-click. No registry writes, runs from anywhere (Desktop, USB stick, network share). Use this if you can't run installers, want a single-folder install you can copy between machines, or are evaluating before committing to install.
Both flavors are byte-identical inside: same Python, same dependencies, same launch behavior.
1.1 Windows
Option A — Installer (DataTools-<ver>-win-setup.exe)
- Download
DataTools-<ver>-win-setup.exefrom your release email or GitHub Releases. - Double-click the installer. On the first run Windows SmartScreen will say "Windows protected your PC" — click More info → Run anyway. (This warning only appears once per build until we have an EV code-signing cert.)
- Accept the per-user install location (
%LOCALAPPDATA%\Programs\DataToolsby default — no admin prompt). Check Create a desktop shortcut if you want one (on by default). - Click Install, then Finish. The installer offers to launch DataTools immediately.
- From now on launch from: Start Menu → DataTools, the Desktop shortcut, or just type
DataToolsinto Windows Run (Win+R) / cmd.
To pin to the taskbar, launch the app once, right-click its icon in the taskbar, then Pin to taskbar. Windows requires this manual step — no installer is allowed to pin programmatically.
Option B — Portable (DataTools-<ver>-win-portable.zip)
- Download
DataTools-<ver>-win-portable.zip. - Right-click the .zip → Extract All… → pick a folder (e.g.
C:\Tools\DataTools). - Open the extracted
DataTools\folder, double-clickDataTools.exe. SmartScreen warning fires the first time only. - To create your own desktop shortcut later: right-click
DataTools.exe→ Send to → Desktop (create shortcut).
Uninstall (installer only): Settings → Apps → DataTools → Uninstall. Portable: delete the folder.
1.2 macOS
Option A — Installer DMG (DataTools-<ver>-mac.dmg)
- Download
DataTools-<ver>-mac.dmg. - Double-click the .dmg. A Finder window opens showing the DataTools icon and an Applications alias.
- Drag DataTools onto Applications. Wait for the copy to finish, then eject the DMG.
- On unsigned builds the first launch shows "DataTools" cannot be opened because the developer cannot be verified. Fix: right-click DataTools in /Applications → Open → confirm Open in the dialog. macOS remembers this choice — subsequent launches are clean.
- Launch from Launchpad, Spotlight (
⌘ Space→ type "DataTools"), or Applications in Finder.
To keep DataTools in the Dock: launch the app, right-click its Dock icon → Options → Keep in Dock. macOS doesn't allow installers to pin to the Dock automatically.
Option B — Portable (DataTools-<ver>-mac-portable.zip)
- Download
DataTools-<ver>-mac-portable.zip. Safari auto-unzips on download; in Finder you'll seeDataTools.appdirectly. - Move
DataTools.appto Applications if you want it discoverable via Launchpad — or keep it on your Desktop, a USB stick, or a network share. The portable .app runs from anywhere. - Double-click
DataTools.app. Right-click → Open the first time (same unsigned-build dance as the DMG).
Uninstall: drag DataTools.app to the Trash. Your data files stay where you put them — nothing else is installed.
1.3 Linux
DataTools-<ver>-linux-x86_64.AppImage is already portable — no separate zip needed.
- Download the .AppImage.
chmod +x DataTools-*.AppImage.- Double-click, or run it from a terminal.
If your distro doesn't ship FUSE 2: sudo apt install libfuse2 (Debian/Ubuntu) or equivalent.
1.4 What happens on first launch
The launcher (called DataTools.exe / DataTools.app / DataTools.AppImage) does three things, in order:
- Picks a free TCP port on
127.0.0.1— usually 8501, falls back through 8502, 8503, … if another app is using 8501. - Starts a local Streamlit server on that port. The server is bound to localhost only, never to your LAN.
- Opens your default browser at
http://127.0.0.1:<port>/. If the browser doesn't open within 5 seconds, paste that URL into your browser manually.
The launcher window stays open in the background. Closing it stops the server — the browser tab will say "this site can't be reached" the next time you click it.
1.5 How the GUI works
- Runs locally on your machine. No internet, no upload.
- The browser is just the display surface. Closing it does NOT stop the app — close the launcher window (or quit the macOS .app from the Dock) to fully exit.
- Prefer the terminal? Every tool ships with a CLI too (Section 3).
1.6 System requirements
- Windows 10/11 (64-bit), macOS 11+, modern Linux (2020+).
- Modern browser (Chrome, Edge, Firefox, Safari, last 3 years).
- ~500 MB free disk space (the bundle itself is ~300 MB; the rest is working scratch space for large CSVs).
OCR for scanned PDFs is bundled — Tesseract 5.5 + the English eng.traineddata model ship inside every installer / portable / AppImage. The PDF Extractor's scanned-statement path works out of the box; no separate install required. (Developers running from a pip install -r requirements.txt checkout still need system Tesseract on PATH — see DEVELOPER.md §PDF Extractor — bundled Tesseract.)
Full numbered support matrix: REQUIREMENTS.md.
2. What's included
| # | Tool | Purpose | Status |
|---|---|---|---|
| 01 | Find Duplicates | Exact + fuzzy match, 5 normalizers, audit | Ready |
| 02 | Clean Text | Whitespace, smart chars, BOM, line endings, case ops | Ready |
| 03 | Standardize Formats | Dates / phones / emails / addresses / names / currencies / booleans | Ready |
| 04 | Fix Missing Values | Disguised nulls, imputation, drop-by-threshold | Coming Soon |
| 05 | Map Columns | Rename + enforce schema | Coming Soon |
| 06 | Find Unusual Values | z-score, IQR, multivariate | Coming Soon |
| 07 | Combine Files | Combine multiple files | Coming Soon |
| 08 | Quality Check | Rules + PDF/Excel report | Coming Soon |
| 09 | Automated Workflows | One-click multi-tool launcher | Coming Soon |
Sample data (samples/): messy_sales.csv, bank_export.xlsx.
3. Usage
3.1 GUI (recommended)
- Launch the bundle.
- Pick a tool from the sidebar.
- Drop your file (or select a sample).
- Defaults are pre-filled — click Run to preview.
- Click Save Output to write the cleaned file.
Advanced options are tucked in expander panes. The original file is never modified.
In-tool Help: every tool page has a Help button right of the title. Click it to open a popover with a compact how-to (When to use · Steps · Examples · Tip). Use it as a refresher mid-task — the popover closes when you click outside, your inputs are untouched.
Sidebar nav: the sidebar groups tools into sections (Analysis, Data Cleaners, Transformations, Automations). Each section header shows + when collapsed and − when expanded — click the header to toggle.
3.2 CLI
deduplicator customers.csv [--apply]
text-cleaner messy.csv [--apply]
format-standardize feed.csv [--apply]
Get help: deduplicator --help. Full reference: CLI-REFERENCE.md.
3.3 Run order (when running tools manually)
If you skip Automated Workflows, follow this order:
- 02 Clean Text first — normalizes whitespace + special chars.
- 03 Standardize Formats — dates, phones, etc. need cleaned text.
- 04 Fix Missing Values — sentinel codes hide as numbers.
- 05 Map Columns — schema before outlier stats.
- 06 Find Unusual Values — needs clean numerics. Stats on data with
NaNor-999are mathematically poisoned. - 07 Combine Files, 08 Quality Check as needed.
- 01 Find Duplicates is order-flexible (normalizes internally for matching).
Automated Workflows enforces this automatically.
3.4 Language
The sidebar has a Language / Idioma picker. Two packs ship today:
- English (default)
- Español
Pick a language once — the choice persists for the session and the picker is visible from every page. Switch any time; the page re-renders in place with no data loss.
Coverage (v1.6): home page, tool cards, the upload + analysis panel, the findings list, the Review & Normalize gate prompt, the sidebar picker, and the shutdown screen. Per-tool page bodies (advanced-option labels, column-mapper prompts, dedup review labels) are tracked for future packs — they currently render in English in both modes. If a string you'd expect to switch doesn't, that's a missing pack key, not a bug in the picker; email support with a screenshot.
4. Review & Normalize gate
Every uploaded file is scanned before any tool sees it.
Confidence tiers:
- High — round-trip safe. One-click "Auto-fix high-confidence" applies them all.
- Medium — usually right, occasional false positives. Preview first.
- Low — heuristic. Off by default; opt in per finding.
- Error — blocks the gate (empty file, U+FFFD, unrepairable rows).
Encoding override: when the picker reports encoding_uncertain or you spot mojibake (é) or <EFBFBD> chars, choose the right codepage at the top of the page (cp1252 for Western Excel, KOI8-R for older Russian, Big5 for traditional Chinese, …) → Re-analyze.
Advanced output: an ⚙️ expander on the download lets you tune encoding, delimiter, and line terminator. The download filename auto-adjusts (.tsv for tab, .csv otherwise).
5. Output
Every run writes:
- Cleaned file next to the input (or wherever you specify).
- Audit file (per-cell changes for text/format tools, match groups for dedup).
- Timestamped log in
logs/.
Original input is never modified.
6. Troubleshooting
- GUI won't launch / browser doesn't open — wait 10-15 s; manually visit
http://127.0.0.1:8501(or whichever port the launcher window prints). Port-in-use error → close other instances. The launcher walks ports 8501–8550 looking for a free one, so a stale instance can shift the URL. - Why does my browser open? — local web app pattern (same as Jupyter, RStudio). Nothing leaves your machine.
- Windows SmartScreen — click "More info" → "Run anyway". One-time per build until we have an EV-signed cert.
- macOS "App is damaged" / "developer cannot be verified" — right-click the app → Open → confirm. If the message persists, the file was likely corrupted in transit — re-download. As a last resort:
xattr -cr /Applications/DataTools.appclears the quarantine attribute. - macOS portable .zip — extracted but won't open — Safari unzips on download by default; if you see a
__MACOSX/folder or._DataTools.appfile you used a different unarchiver. Re-extract with the built-in Archive Utility (right-click the .zip → Open With → Archive Utility) so the .app's metadata is preserved. - Windows portable .zip — antivirus quarantines DataTools.exe — your AV doesn't recognize the bundle. Allowlist the extracted folder. The installer .exe trips fewer AV products because it's a known Inno Setup wrapper.
- Linux AppImage won't run —
chmod +x file.AppImage. Missing FUSE →sudo apt install libfuse2. - Slow on large file — over ~100k rows takes longer; progress bar shows. Multi-million rows → use the CLI directly.
- Where does the app store my license / settings? —
~/.datatools/on macOS + Linux,C:\Users\<you>\.datatools\on Windows. Your input/output files stay where you put them; the app never copies them anywhere else. - Need help — email the address on your purchase receipt.
7. License
Single-user. See LICENSE.txt.