New docs/REQUIREMENTS.md catalogs every shipped capability in 17 numbered categories — file handling, input/output encodings, delimiters, line endings, detectors, finding schema, confidence tiers, decisions, performance targets (1 GB), tools, gate behavior, interfaces, platforms, deps, test coverage, privacy. Linked from README and USER-GUIDE so a buyer / integrator can scan compliance in under a minute. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
176 lines
7.5 KiB
Markdown
176 lines
7.5 KiB
Markdown
# DataTools
|
||
|
||
A bundle of Python data-cleaning tools for CSV and Excel files. Two scripts ship today; more are in build.
|
||
|
||
| # | Tool | What it does |
|
||
|---|---|---|
|
||
| 01 | **Deduplicator** | Find and remove duplicate rows with exact + fuzzy matching, smart normalization, and interactive review. |
|
||
| 02 | **Text Cleaner** | Trim whitespace, fold smart quotes, strip invisible / control characters, normalize Unicode, normalize line endings, optional case conversion. |
|
||
|
||
## Deduplicator
|
||
|
||
## Features
|
||
|
||
- **Zero-config start** — auto-detects encoding, delimiters, headers, and match columns
|
||
- **Fuzzy matching** — Jaro-Winkler, Levenshtein, and token set ratio algorithms
|
||
- **5 built-in normalizers** — email (Gmail dot/plus), phone (E.164), name (titles/suffixes), address (USPS), string (whitespace/case)
|
||
- **Merge mode** — fill missing fields in the surviving row from removed duplicates
|
||
- **4 survivor rules** — keep first, last, most complete, or most recent row per group
|
||
- **Interactive review** — inspect match groups with inline checkboxes and column dropdowns, cherry-pick values, preview surviving rows live
|
||
- **Config profiles** — save and reload your settings as JSON for repeatable runs
|
||
- **Dual interface** — full CLI for automation, Streamlit GUI for visual review
|
||
- **Dry-run by default** — preview what would change before writing anything
|
||
- **Audit trail** — every run produces a match groups report and timestamped log
|
||
|
||
## Quick Start
|
||
|
||
### Install
|
||
|
||
```bash
|
||
pip install -r requirements.txt
|
||
```
|
||
|
||
### CLI
|
||
|
||
```bash
|
||
# Preview duplicates (dry run — no files written)
|
||
python -m src.cli customers.csv
|
||
|
||
# Remove duplicates and save the result
|
||
python -m src.cli customers.csv --apply
|
||
|
||
# Fuzzy-match names at 80% similarity, merge missing fields
|
||
python -m src.cli customers.csv --fuzzy name --threshold 80 --merge --apply
|
||
|
||
# Interactively review each match group
|
||
python -m src.cli customers.csv --review --apply
|
||
```
|
||
|
||
### GUI
|
||
|
||
```bash
|
||
streamlit run src/gui/app.py
|
||
```
|
||
|
||
Upload a file, click **Find Duplicates**, review match groups side-by-side, then download the cleaned result.
|
||
|
||
## CLI Usage Summary
|
||
|
||
```
|
||
python -m src.cli INPUT_FILE [OPTIONS]
|
||
|
||
Options:
|
||
--apply Write output files (default: preview only)
|
||
--output, -o PATH Output file path
|
||
--subset, -s COLS Columns to match on (comma-separated)
|
||
--key, -k COLS Strong-key columns for exact matching
|
||
--fuzzy COLS Columns to fuzzy-match
|
||
--algorithm, -a ALG levenshtein | jaro_winkler | token_set_ratio
|
||
--threshold, -t N Similarity threshold 0-100 (default: 85)
|
||
--normalize COL:TYPE Per-column normalizers (e.g., email:email,phone:phone)
|
||
--survivor RULE first | last | most-complete | most-recent
|
||
--merge Fill missing fields from removed duplicates
|
||
--review Interactively review each match group
|
||
--config PATH Load settings from a JSON config file
|
||
--save-config PATH Save current settings to JSON
|
||
--sheet NAME Excel sheet name or 0-based index
|
||
--encoding ENC Override auto-detected encoding
|
||
--header-row N 0-based header row index
|
||
--help Show full help
|
||
```
|
||
|
||
## Sample Output
|
||
|
||
```
|
||
$ python -m src.cli samples/messy_sales.csv
|
||
|
||
Reading messy_sales.csv...
|
||
50 rows, 8 columns
|
||
Finding duplicates...
|
||
|
||
──────────────────────────────────────────────────
|
||
File: messy_sales.csv
|
||
Rows in: 50
|
||
Rows out: 28
|
||
Removed: 22
|
||
Groups: 22
|
||
──────────────────────────────────────────────────
|
||
|
||
Match groups:
|
||
Group 1: rows [1, 2] → keep row 1 (confidence: 100.0%, matched on: email)
|
||
Group 2: rows [3, 4] → keep row 3 (confidence: 92.3%, matched on: name, phone)
|
||
...
|
||
|
||
This was a preview. Add --apply to write the output files.
|
||
```
|
||
|
||
## Output Files
|
||
|
||
When `--apply` is used, three files are produced:
|
||
|
||
| File | Contents |
|
||
|------|----------|
|
||
| `{input}_deduplicated.csv` | Cleaned data with duplicates removed |
|
||
| `{input}_removed.csv` | Rows that were removed |
|
||
| `{input}_match_groups.csv` | Audit trail: group ID, confidence, matched columns, survivor flag |
|
||
|
||
## Text Cleaner
|
||
|
||
Character-level hygiene for messy CSV / Excel input. Solves the dirty-data failure modes that silently break VLOOKUPs, dedup runs, and downstream imports:
|
||
|
||
- Trailing / leading whitespace and tabs in cells
|
||
- Non-breaking spaces (`U+00A0`) hiding inside text where regular spaces should be
|
||
- Smart quotes pasted from Word (`"` `"` `'` `'` → `"` `"` `'` `'`)
|
||
- Em / en dashes, ellipsis, other typographic Unicode
|
||
- Zero-width and bidi-mark characters (`U+200B`, `U+200C`, `U+200D`, etc.)
|
||
- BOMs from Excel "Save As CSV UTF-8"
|
||
- Mixed line endings (`\r\n`, bare `\r`) inside multi-line cells
|
||
- Control characters (`U+0000`-`U+001F` minus `\t \n \r`)
|
||
- Optional Unicode NFC / NFKC normalization
|
||
- Optional per-column case conversion (UPPER / lower / smart Title / Sentence)
|
||
|
||
```bash
|
||
# Preview what would change (dry-run)
|
||
python -m src.cli_text_clean samples/messy_text.csv
|
||
|
||
# Apply the safe defaults
|
||
python -m src.cli_text_clean samples/messy_text.csv --apply
|
||
|
||
# Title-case the name column, upper-case the SKU column
|
||
python -m src.cli_text_clean products.csv --case title:name,upper:sku --apply
|
||
|
||
# Just trim and collapse — nothing fancy
|
||
python -m src.cli_text_clean messy.csv --preset minimal --apply
|
||
```
|
||
|
||
Three presets: `minimal` (trim + collapse only), `excel-hygiene` (default; everything safe ON), `paranoid` (adds lossy NFKC fold).
|
||
|
||
Outputs `{input}_cleaned.csv` plus a per-cell `{input}_changes.csv` audit (row, column, old, new, ops applied).
|
||
|
||
See [docs/CLI-REFERENCE.md](docs/CLI-REFERENCE.md#text-cleaner-cli) for every flag.
|
||
|
||
## Review & Normalize gate
|
||
|
||
Every uploaded file passes through a CSV-normalization gate before any tool page sees it. The analyzer scans for ~15 issue types — whitespace pollution, NBSP / zero-width chars, mixed line endings, BOM artifacts, encoding misdetections, smart punctuation, dirty headers, null sentinels, mojibake, and more — and tags each finding by **confidence** (high / medium / low) and **fix action** (the algorithm in `src/core/fixes.py` that resolves it).
|
||
|
||
In the GUI, the **Review & Normalize** page renders one expandable card per finding with a decision control (Auto-fix / Skip / Customize), a live before-and-after preview, an encoding-override picker for misdetected codepages, and an Advanced output options block (encoding, delimiter, line terminator) for the download. Tool pages refuse to load until the gate passes.
|
||
|
||
See [docs/USER-GUIDE.md §3.3](docs/USER-GUIDE.md) for the user-facing walkthrough and [docs/TECHNICAL.md §10.2.1–10.2.4](docs/TECHNICAL.md) for the developer-facing API.
|
||
|
||
## Documentation
|
||
|
||
- [Requirements](docs/REQUIREMENTS.md) — short-form numbered list: file size, codepages, delimiters, detectors, performance targets
|
||
- [User Guide](docs/USER-GUIDE.md) — installation, GUI workflow, the Review & Normalize gate
|
||
- [CLI Reference](docs/CLI-REFERENCE.md) — every flag with examples and recipe sections
|
||
- [Technical](docs/TECHNICAL.md) — architecture, gate internals, finding schema, fix registry
|
||
- [Developer Guide](docs/DEVELOPER.md) — extending the bundle, adding fixes / detectors
|
||
|
||
## Requirements
|
||
|
||
- Python 3.10+
|
||
- Dependencies: pandas, openpyxl, rapidfuzz, typer, phonenumbers, loguru, tqdm, charset-normalizer
|
||
|
||
## License
|
||
|
||
Proprietary. All rights reserved.
|