Files
datatools-dev/README.md
Michael 4706ed571e build: wire desktop-bundle pipeline (CI matrix + per-platform installers)
Stand up the seamless-download path for non-technical buyers:

* .github/workflows/build.yml — matrix CI (mac/win/linux) that builds
  PyInstaller bundles and packages them per platform on tag push,
  attaching the resulting installers to a GitHub Release.
* build/installer.iss — Inno Setup script for the Windows installer
  (per-user install, optional desktop shortcut, runs on finish).
* build/macos/build_dmg.sh — wraps DataTools.app into a .dmg with a
  drag-to-/Applications layout.
* build/appimage/{AppRun,datatools.desktop,build.sh} — AppImage recipe.
* src/__init__.py — single source of truth for __version__; the spec
  reads it (was hardcoded), CI passes it through to all packagers.

Buyer download path now lives in the top-level README. Per-build
README documents the Phase 2 step (signing/notarization) that needs
the owner's Apple Developer + Windows code-signing credentials —
those are intentionally not in CI yet because they require setup
outside this repo.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 13:58:43 +00:00

88 lines
3.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# DataTools
Local CSV / Excel cleaning. CLI + browser GUI, no cloud, no install ceremony.
## Tools
| # | Tool | Status |
|---|------|--------|
| 01 | **Deduplicator** — exact + fuzzy match, 5 normalizers, survivor rules, audit | Ready |
| 02 | **Text Cleaner** — whitespace, smart chars, BOM, line endings, case ops | Ready |
| 03 | **Format Standardizer** — dates, phones, emails, addresses, names, currencies, booleans | Ready |
| 04 | **Missing Value Handler** — disguised-null detection, profile, mean/median/mode/ffill/bfill/interpolate, drop strategies | Ready |
| 05 | **Column Mapper** — fuzzy auto-rename, target schema with type coercion, required fields with defaults, drop/reorder | Ready |
| 06 | Outlier Detector | Coming Soon |
| 07 | Multi-File Merger | Coming Soon |
| 08 | Validator & Reporter | Coming Soon |
| 09 | **Pipeline Runner** — chain tools with recommended (not forced) order, save/load JSON, automate weekly cleanups | Ready |
## Download (non-technical users)
Pre-built installers — no Python required:
| Platform | Download | First-launch note |
|---|---|---|
| **macOS** | `DataTools-X.Y.Z-mac.dmg` | Drag DataTools.app into /Applications, then double-click. |
| **Windows** | `DataTools-X.Y.Z-win-setup.exe` | Run the installer; launches from Start Menu. |
| **Linux** | `DataTools-X.Y.Z-linux-x86_64.AppImage` | `chmod +x` the file, then double-click. |
Latest release: see [GitHub Releases](https://git.invixiom.com/giteadmin/datatools-dev/releases) (or the Gumroad listing). The installers are ~150200 MB; the launcher boots a local server at http://127.0.0.1:8501 and opens your browser. Nothing is sent to the cloud.
## Install from source (developers)
```bash
pip install -r requirements.txt
```
Python 3.10+ required.
## Run
**GUI** (recommended):
```bash
streamlit run src/gui/app.py
```
**CLI** — seven entry points:
```bash
python -m src.cli customers.csv [--apply] # dedup
python -m src.cli_text_clean messy.csv [--apply] # text clean
python -m src.cli_format intl.csv [--apply] # format standardize (auto-streams >100 MB)
python -m src.cli_missing holes.csv [--apply] # missing values
python -m src.cli_column_map vendor.csv [--apply] # column mapper
python -m src.cli_pipeline any_file.csv [--apply] # chain tools end-to-end
python -m src.cli_analyze any_file.csv [--json] # scan only
```
Every CLI runs preview-only by default; add `--apply` to write output.
## Review & Normalize gate
Every uploaded file passes through a CSV-normalization gate before any tool sees it. The analyzer flags ~15 issue types (whitespace, NBSP / zero-width chars, BOM, encoding, smart punct, dirty headers, null sentinels, mojibake, …) tagged by **confidence** (high / medium / low) and **fix action**. The GUI shows each finding with Auto-fix / Skip / Customize, a live before/after preview, and an encoding-override picker. Tool pages refuse to load until the gate passes.
## Output
Every run writes:
- `{input}_<tool>.csv` — the cleaned data
- `{input}_changes.csv` (text cleaner) or `{input}_match_groups.csv` (dedup) — audit trail
- `logs/<tool>_YYYYMMDD_HHMMSS.log` — debug-level run log
Original input file is never modified.
## Docs
- [User Guide](docs/USER-GUIDE.md) — install, GUI workflow, gate
- [CLI Reference](docs/CLI-REFERENCE.md) — every flag with recipes
- [Requirements](docs/REQUIREMENTS.md) — file sizes, encodings, detectors, perf targets
- [Technical](docs/TECHNICAL.md) — architecture, gate internals, fix registry
- [Developer Guide](docs/DEVELOPER.md) — adding fixes / detectors / standardizers
## Dependencies
`pandas`, `openpyxl`, `rapidfuzz`, `phonenumbers`, `typer`, `loguru`, `charset-normalizer`, `streamlit`. Optional: `ftfy` for mojibake repair.
## License
Proprietary.