Pick up and finish yesterday's cut-off Tier B pass. - build/: PyInstaller scaffold (datatools.spec + launcher.py + hook-streamlit.py + README) — folder-mode bundle, locked 127.0.0.1, per-OS recipe - marketing/COPY.md: single source of truth for every customer-facing string — landing H1/sub/CTAs, demo CTAs, email subjects, Gumroad listing, banned phrases - marketing/community-posts/: 9 drafts (3 posts × 3 niches: bookkeeper, revops, shopify-pet) — story / tip / soft-offer - marketing/emails/: 18 drafts (Gumroad delivery + 5-touch onboarding × 3 niches), per-niche segmentation guidance - docs/NEXT-STEPS.md: flip 2.2 / 2.4 / 3.1 / 3.4 to done with pointers to the new assets; add Phase 0 inventory rows - .gitignore: narrow `build/` ignore so PyInstaller spec + launcher + hooks get tracked, only generated artifacts (build/build/, build/__pycache__/, build/dist/) stay ignored Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DataTools
Local CSV / Excel cleaning. CLI + browser GUI, no cloud, no install ceremony.
Tools
| # | Tool | Status |
|---|---|---|
| 01 | Deduplicator — exact + fuzzy match, 5 normalizers, survivor rules, audit | Ready |
| 02 | Text Cleaner — whitespace, smart chars, BOM, line endings, case ops | Ready |
| 03 | Format Standardizer — dates, phones, emails, addresses, names, currencies, booleans | Ready |
| 04 | Missing Value Handler — disguised-null detection, profile, mean/median/mode/ffill/bfill/interpolate, drop strategies | Ready |
| 05 | Column Mapper — fuzzy auto-rename, target schema with type coercion, required fields with defaults, drop/reorder | Ready |
| 06 | Outlier Detector | Coming Soon |
| 07 | Multi-File Merger | Coming Soon |
| 08 | Validator & Reporter | Coming Soon |
| 09 | Pipeline Runner — chain tools with recommended (not forced) order, save/load JSON, automate weekly cleanups | Ready |
Install
pip install -r requirements.txt
Python 3.10+ required.
Run
GUI (recommended):
streamlit run src/gui/app.py
CLI — seven entry points:
python -m src.cli customers.csv [--apply] # dedup
python -m src.cli_text_clean messy.csv [--apply] # text clean
python -m src.cli_format intl.csv [--apply] # format standardize (auto-streams >100 MB)
python -m src.cli_missing holes.csv [--apply] # missing values
python -m src.cli_column_map vendor.csv [--apply] # column mapper
python -m src.cli_pipeline any_file.csv [--apply] # chain tools end-to-end
python -m src.cli_analyze any_file.csv [--json] # scan only
Every CLI runs preview-only by default; add --apply to write output.
Review & Normalize gate
Every uploaded file passes through a CSV-normalization gate before any tool sees it. The analyzer flags ~15 issue types (whitespace, NBSP / zero-width chars, BOM, encoding, smart punct, dirty headers, null sentinels, mojibake, …) tagged by confidence (high / medium / low) and fix action. The GUI shows each finding with Auto-fix / Skip / Customize, a live before/after preview, and an encoding-override picker. Tool pages refuse to load until the gate passes.
Output
Every run writes:
{input}_<tool>.csv— the cleaned data{input}_changes.csv(text cleaner) or{input}_match_groups.csv(dedup) — audit traillogs/<tool>_YYYYMMDD_HHMMSS.log— debug-level run log
Original input file is never modified.
Docs
- User Guide — install, GUI workflow, gate
- CLI Reference — every flag with recipes
- Requirements — file sizes, encodings, detectors, perf targets
- Technical — architecture, gate internals, fix registry
- Developer Guide — adding fixes / detectors / standardizers
Dependencies
pandas, openpyxl, rapidfuzz, phonenumbers, typer, loguru, charset-normalizer, streamlit. Optional: ftfy for mojibake repair.
License
Proprietary.