Files
datatools-dev/README.md
Michael abb720997e docs: tight, scannable rewrite — every item earns its place
Refactors all 10 docs (README, USER-GUIDE, CLI-REFERENCE, REQUIREMENTS,
TECHNICAL, DEVELOPER, BUSINESS, DECISIONS, RECOVERY, docs/README) from
prose-heavy to bullet-heavy + table-heavy. Same information density,
significantly less reading load.

Net: 2600 → 1652 lines (~37% reduction) WHILE adding the new content
that landed since v1.6:

- Format Standardizer (3rd Ready tool)
- 199-row buyer corpus
- src/core/errors.py structured hierarchy + ensure_dataframe /
  ensure_choice / wrap_file_read|write / format_for_user helpers
- src/core/_constants.py shared USPS/state lookup tables
- Cross-tool audit fixes (NaN matching, removed_df schema, validation,
  enum-bounds checks, forward-compat config)
- Per-domain error_policy across format standardizers
- Inconsistent-date-format detector
- Excel header-row auto-detection + write_file delimiter param

Per-doc changes:

- README.md (175 → 71): 9-tool table at top, status column, 3 CLI
  entry points listed, dropped repeated marketing prose.
- docs/README.md (38 → 27): pure index — buyer-facing vs creator-only
  split + version footer.
- USER-GUIDE.md (208 → 118): tool table replaces script descriptions,
  troubleshooting compressed to bullets, gate explanation tightened.
- CLI-REFERENCE.md (451 → 235): collapsed flag tables, removed
  redundant intro text, kept full recipes section.
- REQUIREMENTS.md (146 → 129): 18 numbered sections (was 17), added
  §18 Error Handling, formatting tightened to single-line entries.
- TECHNICAL.md (570 → 350): collapsed §3 build pipeline tables, merged
  redundant §3.5-3.7 OS sections, added §7 (Error handling) +
  §11.3 (Format Standardizer spec) + §11.4-11.7 (analyzer / gate /
  Review page / repair_bytes promoted from §10.2.x sub-numbering).
- DEVELOPER.md (285 → 161): module map table replaces per-file prose,
  extension recipes condensed, new §Errors covers when to use each
  hierarchy class.
- BUSINESS.md (278 → 225): collapsed prose to tables (use cases,
  competitive landscape, costs, risks); honest-status updated.
- DECISIONS.md (269 → 189): scoring rubric + GUI matrix preserved,
  decision log compressed to single-line entries, added v1.6 entries
  (Format Standardizer Ready, errors module).
- RECOVERY.md (180 → 147): rebuild steps as numbered + tabular,
  external dependencies as one table, recovery priorities tightened.

No information removed; redundancy compressed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 02:49:29 +00:00

72 lines
2.5 KiB
Markdown

# DataTools
Local CSV / Excel cleaning. CLI + browser GUI, no cloud, no install ceremony.
## Tools
| # | Tool | Status |
|---|------|--------|
| 01 | **Deduplicator** — exact + fuzzy match, 5 normalizers, survivor rules, audit | Ready |
| 02 | **Text Cleaner** — whitespace, smart chars, BOM, line endings, case ops | Ready |
| 03 | **Format Standardizer** — dates, phones, emails, addresses, names, currencies, booleans | Ready |
| 04 | Missing Value Handler | Coming Soon |
| 05 | Column Mapper | Coming Soon |
| 06 | Outlier Detector | Coming Soon |
| 07 | Multi-File Merger | Coming Soon |
| 08 | Validator & Reporter | Coming Soon |
| 09 | Pipeline Runner | Coming Soon |
## Install
```bash
pip install -r requirements.txt
```
Python 3.10+ required.
## Run
**GUI** (recommended):
```bash
streamlit run src/gui/app.py
```
**CLI** — three entry points:
```bash
python -m src.cli customers.csv [--apply] # dedup
python -m src.cli_text_clean messy.csv [--apply] # text clean
python -m src.cli_analyze any_file.csv [--json] # scan only
```
Every CLI runs preview-only by default; add `--apply` to write output.
## Review & Normalize gate
Every uploaded file passes through a CSV-normalization gate before any tool sees it. The analyzer flags ~15 issue types (whitespace, NBSP / zero-width chars, BOM, encoding, smart punct, dirty headers, null sentinels, mojibake, …) tagged by **confidence** (high / medium / low) and **fix action**. The GUI shows each finding with Auto-fix / Skip / Customize, a live before/after preview, and an encoding-override picker. Tool pages refuse to load until the gate passes.
## Output
Every run writes:
- `{input}_<tool>.csv` — the cleaned data
- `{input}_changes.csv` (text cleaner) or `{input}_match_groups.csv` (dedup) — audit trail
- `logs/<tool>_YYYYMMDD_HHMMSS.log` — debug-level run log
Original input file is never modified.
## Docs
- [User Guide](docs/USER-GUIDE.md) — install, GUI workflow, gate
- [CLI Reference](docs/CLI-REFERENCE.md) — every flag with recipes
- [Requirements](docs/REQUIREMENTS.md) — file sizes, encodings, detectors, perf targets
- [Technical](docs/TECHNICAL.md) — architecture, gate internals, fix registry
- [Developer Guide](docs/DEVELOPER.md) — adding fixes / detectors / standardizers
## Dependencies
`pandas`, `openpyxl`, `rapidfuzz`, `phonenumbers`, `typer`, `loguru`, `charset-normalizer`, `streamlit`. Optional: `ftfy` for mojibake repair.
## License
Proprietary.