Refactors all 10 docs (README, USER-GUIDE, CLI-REFERENCE, REQUIREMENTS, TECHNICAL, DEVELOPER, BUSINESS, DECISIONS, RECOVERY, docs/README) from prose-heavy to bullet-heavy + table-heavy. Same information density, significantly less reading load. Net: 2600 → 1652 lines (~37% reduction) WHILE adding the new content that landed since v1.6: - Format Standardizer (3rd Ready tool) - 199-row buyer corpus - src/core/errors.py structured hierarchy + ensure_dataframe / ensure_choice / wrap_file_read|write / format_for_user helpers - src/core/_constants.py shared USPS/state lookup tables - Cross-tool audit fixes (NaN matching, removed_df schema, validation, enum-bounds checks, forward-compat config) - Per-domain error_policy across format standardizers - Inconsistent-date-format detector - Excel header-row auto-detection + write_file delimiter param Per-doc changes: - README.md (175 → 71): 9-tool table at top, status column, 3 CLI entry points listed, dropped repeated marketing prose. - docs/README.md (38 → 27): pure index — buyer-facing vs creator-only split + version footer. - USER-GUIDE.md (208 → 118): tool table replaces script descriptions, troubleshooting compressed to bullets, gate explanation tightened. - CLI-REFERENCE.md (451 → 235): collapsed flag tables, removed redundant intro text, kept full recipes section. - REQUIREMENTS.md (146 → 129): 18 numbered sections (was 17), added §18 Error Handling, formatting tightened to single-line entries. - TECHNICAL.md (570 → 350): collapsed §3 build pipeline tables, merged redundant §3.5-3.7 OS sections, added §7 (Error handling) + §11.3 (Format Standardizer spec) + §11.4-11.7 (analyzer / gate / Review page / repair_bytes promoted from §10.2.x sub-numbering). - DEVELOPER.md (285 → 161): module map table replaces per-file prose, extension recipes condensed, new §Errors covers when to use each hierarchy class. - BUSINESS.md (278 → 225): collapsed prose to tables (use cases, competitive landscape, costs, risks); honest-status updated. - DECISIONS.md (269 → 189): scoring rubric + GUI matrix preserved, decision log compressed to single-line entries, added v1.6 entries (Format Standardizer Ready, errors module). - RECOVERY.md (180 → 147): rebuild steps as numbered + tabular, external dependencies as one table, recovery priorities tightened. No information removed; redundancy compressed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
72 lines
2.5 KiB
Markdown
72 lines
2.5 KiB
Markdown
# DataTools
|
|
|
|
Local CSV / Excel cleaning. CLI + browser GUI, no cloud, no install ceremony.
|
|
|
|
## Tools
|
|
|
|
| # | Tool | Status |
|
|
|---|------|--------|
|
|
| 01 | **Deduplicator** — exact + fuzzy match, 5 normalizers, survivor rules, audit | Ready |
|
|
| 02 | **Text Cleaner** — whitespace, smart chars, BOM, line endings, case ops | Ready |
|
|
| 03 | **Format Standardizer** — dates, phones, emails, addresses, names, currencies, booleans | Ready |
|
|
| 04 | Missing Value Handler | Coming Soon |
|
|
| 05 | Column Mapper | Coming Soon |
|
|
| 06 | Outlier Detector | Coming Soon |
|
|
| 07 | Multi-File Merger | Coming Soon |
|
|
| 08 | Validator & Reporter | Coming Soon |
|
|
| 09 | Pipeline Runner | Coming Soon |
|
|
|
|
## Install
|
|
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
Python 3.10+ required.
|
|
|
|
## Run
|
|
|
|
**GUI** (recommended):
|
|
```bash
|
|
streamlit run src/gui/app.py
|
|
```
|
|
|
|
**CLI** — three entry points:
|
|
```bash
|
|
python -m src.cli customers.csv [--apply] # dedup
|
|
python -m src.cli_text_clean messy.csv [--apply] # text clean
|
|
python -m src.cli_analyze any_file.csv [--json] # scan only
|
|
```
|
|
|
|
Every CLI runs preview-only by default; add `--apply` to write output.
|
|
|
|
## Review & Normalize gate
|
|
|
|
Every uploaded file passes through a CSV-normalization gate before any tool sees it. The analyzer flags ~15 issue types (whitespace, NBSP / zero-width chars, BOM, encoding, smart punct, dirty headers, null sentinels, mojibake, …) tagged by **confidence** (high / medium / low) and **fix action**. The GUI shows each finding with Auto-fix / Skip / Customize, a live before/after preview, and an encoding-override picker. Tool pages refuse to load until the gate passes.
|
|
|
|
## Output
|
|
|
|
Every run writes:
|
|
|
|
- `{input}_<tool>.csv` — the cleaned data
|
|
- `{input}_changes.csv` (text cleaner) or `{input}_match_groups.csv` (dedup) — audit trail
|
|
- `logs/<tool>_YYYYMMDD_HHMMSS.log` — debug-level run log
|
|
|
|
Original input file is never modified.
|
|
|
|
## Docs
|
|
|
|
- [User Guide](docs/USER-GUIDE.md) — install, GUI workflow, gate
|
|
- [CLI Reference](docs/CLI-REFERENCE.md) — every flag with recipes
|
|
- [Requirements](docs/REQUIREMENTS.md) — file sizes, encodings, detectors, perf targets
|
|
- [Technical](docs/TECHNICAL.md) — architecture, gate internals, fix registry
|
|
- [Developer Guide](docs/DEVELOPER.md) — adding fixes / detectors / standardizers
|
|
|
|
## Dependencies
|
|
|
|
`pandas`, `openpyxl`, `rapidfuzz`, `phonenumbers`, `typer`, `loguru`, `charset-normalizer`, `streamlit`. Optional: `ftfy` for mojibake repair.
|
|
|
|
## License
|
|
|
|
Proprietary.
|