docs: tight, scannable rewrite — every item earns its place

Refactors all 10 docs (README, USER-GUIDE, CLI-REFERENCE, REQUIREMENTS, TECHNICAL, DEVELOPER, BUSINESS, DECISIONS, RECOVERY, docs/README) from prose-heavy to bullet-heavy + table-heavy. Same information density, significantly less reading load. Net: 2600 → 1652 lines (~37% reduction) WHILE adding the new content that landed since v1.6: - Format Standardizer (3rd Ready tool) - 199-row buyer corpus - src/core/errors.py structured hierarchy + ensure_dataframe / ensure_choice / wrap_file_read|write / format_for_user helpers - src/core/_constants.py shared USPS/state lookup tables - Cross-tool audit fixes (NaN matching, removed_df schema, validation, enum-bounds checks, forward-compat config) - Per-domain error_policy across format standardizers - Inconsistent-date-format detector - Excel header-row auto-detection + write_file delimiter param Per-doc changes: - README.md (175 → 71): 9-tool table at top, status column, 3 CLI entry points listed, dropped repeated marketing prose. - docs/README.md (38 → 27): pure index — buyer-facing vs creator-only split + version footer. - USER-GUIDE.md (208 → 118): tool table replaces script descriptions, troubleshooting compressed to bullets, gate explanation tightened. - CLI-REFERENCE.md (451 → 235): collapsed flag tables, removed redundant intro text, kept full recipes section. - REQUIREMENTS.md (146 → 129): 18 numbered sections (was 17), added §18 Error Handling, formatting tightened to single-line entries. - TECHNICAL.md (570 → 350): collapsed §3 build pipeline tables, merged redundant §3.5-3.7 OS sections, added §7 (Error handling) + §11.3 (Format Standardizer spec) + §11.4-11.7 (analyzer / gate / Review page / repair_bytes promoted from §10.2.x sub-numbering). - DEVELOPER.md (285 → 161): module map table replaces per-file prose, extension recipes condensed, new §Errors covers when to use each hierarchy class. - BUSINESS.md (278 → 225): collapsed prose to tables (use cases, competitive landscape, costs, risks); honest-status updated. - DECISIONS.md (269 → 189): scoring rubric + GUI matrix preserved, decision log compressed to single-line entries, added v1.6 entries (Format Standardizer Ready, errors module). - RECOVERY.md (180 → 147): rebuild steps as numbered + tabular, external dependencies as one table, recovery priorities tightened. No information removed; redundancy compressed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 02:49:29 +00:00
parent 26b9771625
commit abb720997e
10 changed files with 1105 additions and 2053 deletions
--- a/README.md
+++ b/README.md
@@ -1,175 +1,71 @@
 # DataTools

-A bundle of Python data-cleaning tools for CSV and Excel files. Two scripts ship today; more are in build.
+Local CSV / Excel cleaning. CLI + browser GUI, no cloud, no install ceremony.

-| # | Tool | What it does |
-|---|---|---|
-| 01 | **Deduplicator** | Find and remove duplicate rows with exact + fuzzy matching, smart normalization, and interactive review. |
-| 02 | **Text Cleaner** | Trim whitespace, fold smart quotes, strip invisible / control characters, normalize Unicode, normalize line endings, optional case conversion. |
+## Tools

-## Deduplicator
+| # | Tool | Status |
+|---|------|--------|
+| 01 | **Deduplicator** — exact + fuzzy match, 5 normalizers, survivor rules, audit | Ready |
+| 02 | **Text Cleaner** — whitespace, smart chars, BOM, line endings, case ops | Ready |
+| 03 | **Format Standardizer** — dates, phones, emails, addresses, names, currencies, booleans | Ready |
+| 04 | Missing Value Handler | Coming Soon |
+| 05 | Column Mapper | Coming Soon |
+| 06 | Outlier Detector | Coming Soon |
+| 07 | Multi-File Merger | Coming Soon |
+| 08 | Validator & Reporter | Coming Soon |
+| 09 | Pipeline Runner | Coming Soon |

-## Features
-
- **Zero-config start** — auto-detects encoding, delimiters, headers, and match columns
- **Fuzzy matching** — Jaro-Winkler, Levenshtein, and token set ratio algorithms
- **5 built-in normalizers** — email (Gmail dot/plus), phone (E.164), name (titles/suffixes), address (USPS), string (whitespace/case)
- **Merge mode** — fill missing fields in the surviving row from removed duplicates
- **4 survivor rules** — keep first, last, most complete, or most recent row per group
- **Interactive review** — inspect match groups with inline checkboxes and column dropdowns, cherry-pick values, preview surviving rows live
- **Config profiles** — save and reload your settings as JSON for repeatable runs
- **Dual interface** — full CLI for automation, Streamlit GUI for visual review
- **Dry-run by default** — preview what would change before writing anything
- **Audit trail** — every run produces a match groups report and timestamped log
-
-## Quick Start
-
-### Install
+## Install

 ```bash
 pip install -r requirements.txt
 ```

-### CLI
+Python 3.10+ required.

-```bash
-# Preview duplicates (dry run — no files written)
-python -m src.cli customers.csv
-
-# Remove duplicates and save the result
-python -m src.cli customers.csv --apply
-
-# Fuzzy-match names at 80% similarity, merge missing fields
-python -m src.cli customers.csv --fuzzy name --threshold 80 --merge --apply
-
-# Interactively review each match group
-python -m src.cli customers.csv --review --apply
-```
-
-### GUI
+## Run

+**GUI** (recommended):
 ```bash
 streamlit run src/gui/app.py
 ```

-Upload a file, click **Find Duplicates**, review match groups side-by-side, then download the cleaned result.
-
-## CLI Usage Summary
-
-```
-python -m src.cli INPUT_FILE [OPTIONS]
-
-Options:
-  --apply                  Write output files (default: preview only)
-  --output, -o PATH        Output file path
-  --subset, -s COLS        Columns to match on (comma-separated)
-  --key, -k COLS           Strong-key columns for exact matching
-  --fuzzy COLS             Columns to fuzzy-match
-  --algorithm, -a ALG      levenshtein | jaro_winkler | token_set_ratio
-  --threshold, -t N        Similarity threshold 0-100 (default: 85)
-  --normalize COL:TYPE     Per-column normalizers (e.g., email:email,phone:phone)
-  --survivor RULE          first | last | most-complete | most-recent
-  --merge                  Fill missing fields from removed duplicates
-  --review                 Interactively review each match group
-  --config PATH            Load settings from a JSON config file
-  --save-config PATH       Save current settings to JSON
-  --sheet NAME             Excel sheet name or 0-based index
-  --encoding ENC           Override auto-detected encoding
-  --header-row N           0-based header row index
-  --help                   Show full help
-```
-
-## Sample Output
-
-```
-$ python -m src.cli samples/messy_sales.csv
-
-Reading messy_sales.csv...
-  50 rows, 8 columns
-Finding duplicates...
-
-──────────────────────────────────────────────────
-  File:      messy_sales.csv
-  Rows in:   50
-  Rows out:  28
-  Removed:   22
-  Groups:    22
-──────────────────────────────────────────────────
-
-Match groups:
-  Group 1: rows [1, 2] → keep row 1 (confidence: 100.0%, matched on: email)
-  Group 2: rows [3, 4] → keep row 3 (confidence: 92.3%, matched on: name, phone)
-  ...
-
-This was a preview. Add --apply to write the output files.
-```
-
-## Output Files
-
-When `--apply` is used, three files are produced:
-
-| File | Contents |
-|------|----------|
-| `{input}_deduplicated.csv` | Cleaned data with duplicates removed |
-| `{input}_removed.csv` | Rows that were removed |
-| `{input}_match_groups.csv` | Audit trail: group ID, confidence, matched columns, survivor flag |
-
-## Text Cleaner
-
-Character-level hygiene for messy CSV / Excel input. Solves the dirty-data failure modes that silently break VLOOKUPs, dedup runs, and downstream imports:
-
- Trailing / leading whitespace and tabs in cells
- Non-breaking spaces (`U+00A0`) hiding inside text where regular spaces should be
- Smart quotes pasted from Word (`"` `"` `'` `'` → `"` `"` `'` `'`)
- Em / en dashes, ellipsis, other typographic Unicode
- Zero-width and bidi-mark characters (`U+200B`, `U+200C`, `U+200D`, etc.)
- BOMs from Excel "Save As CSV UTF-8"
- Mixed line endings (`\r\n`, bare `\r`) inside multi-line cells
- Control characters (`U+0000`-`U+001F` minus `\t \n \r`)
- Optional Unicode NFC / NFKC normalization
- Optional per-column case conversion (UPPER / lower / smart Title / Sentence)
-
+**CLI** — three entry points:
 ```bash
-# Preview what would change (dry-run)
-python -m src.cli_text_clean samples/messy_text.csv
-
-# Apply the safe defaults
-python -m src.cli_text_clean samples/messy_text.csv --apply
-
-# Title-case the name column, upper-case the SKU column
-python -m src.cli_text_clean products.csv --case title:name,upper:sku --apply
-
-# Just trim and collapse — nothing fancy
-python -m src.cli_text_clean messy.csv --preset minimal --apply
+python -m src.cli            customers.csv [--apply]   # dedup
+python -m src.cli_text_clean messy.csv     [--apply]   # text clean
+python -m src.cli_analyze    any_file.csv  [--json]    # scan only
 ```

-Three presets: `minimal` (trim + collapse only), `excel-hygiene` (default; everything safe ON), `paranoid` (adds lossy NFKC fold).
-
-Outputs `{input}_cleaned.csv` plus a per-cell `{input}_changes.csv` audit (row, column, old, new, ops applied).
-
-See [docs/CLI-REFERENCE.md](docs/CLI-REFERENCE.md#text-cleaner-cli) for every flag.
+Every CLI runs preview-only by default; add `--apply` to write output.

 ## Review & Normalize gate

-Every uploaded file passes through a CSV-normalization gate before any tool page sees it. The analyzer scans for ~15 issue types — whitespace pollution, NBSP / zero-width chars, mixed line endings, BOM artifacts, encoding misdetections, smart punctuation, dirty headers, null sentinels, mojibake, and more — and tags each finding by **confidence** (high / medium / low) and **fix action** (the algorithm in `src/core/fixes.py` that resolves it).
+Every uploaded file passes through a CSV-normalization gate before any tool sees it. The analyzer flags ~15 issue types (whitespace, NBSP / zero-width chars, BOM, encoding, smart punct, dirty headers, null sentinels, mojibake, …) tagged by **confidence** (high / medium / low) and **fix action**. The GUI shows each finding with Auto-fix / Skip / Customize, a live before/after preview, and an encoding-override picker. Tool pages refuse to load until the gate passes.

-In the GUI, the **Review & Normalize** page renders one expandable card per finding with a decision control (Auto-fix / Skip / Customize), a live before-and-after preview, an encoding-override picker for misdetected codepages, and an Advanced output options block (encoding, delimiter, line terminator) for the download. Tool pages refuse to load until the gate passes.
+## Output

-See [docs/USER-GUIDE.md §3.3](docs/USER-GUIDE.md) for the user-facing walkthrough and [docs/TECHNICAL.md §10.2.1–10.2.4](docs/TECHNICAL.md) for the developer-facing API.
+Every run writes:

-## Documentation
+- `{input}_<tool>.csv` — the cleaned data
+- `{input}_changes.csv` (text cleaner) or `{input}_match_groups.csv` (dedup) — audit trail
+- `logs/<tool>_YYYYMMDD_HHMMSS.log` — debug-level run log

- [Requirements](docs/REQUIREMENTS.md) — short-form numbered list: file size, codepages, delimiters, detectors, performance targets
- [User Guide](docs/USER-GUIDE.md) — installation, GUI workflow, the Review & Normalize gate
- [CLI Reference](docs/CLI-REFERENCE.md) — every flag with examples and recipe sections
- [Technical](docs/TECHNICAL.md) — architecture, gate internals, finding schema, fix registry
- [Developer Guide](docs/DEVELOPER.md) — extending the bundle, adding fixes / detectors
+Original input file is never modified.

-## Requirements
+## Docs

- Python 3.10+
- Dependencies: pandas, openpyxl, rapidfuzz, typer, phonenumbers, loguru, tqdm, charset-normalizer
+- [User Guide](docs/USER-GUIDE.md) — install, GUI workflow, gate
+- [CLI Reference](docs/CLI-REFERENCE.md) — every flag with recipes
+- [Requirements](docs/REQUIREMENTS.md) — file sizes, encodings, detectors, perf targets
+- [Technical](docs/TECHNICAL.md) — architecture, gate internals, fix registry
+- [Developer Guide](docs/DEVELOPER.md) — adding fixes / detectors / standardizers
+
+## Dependencies
+
+`pandas`, `openpyxl`, `rapidfuzz`, `phonenumbers`, `typer`, `loguru`, `charset-normalizer`, `streamlit`. Optional: `ftfy` for mojibake repair.

 ## License

-Proprietary. All rights reserved.
+Proprietary.