feat: add documentation, Streamlit GUI, and full source tree
- Rewrite README.md with project overview, quick-start, and CLI summary - Add docs/CLI-REFERENCE.md with full flag reference and 8 recipe sections - Add docs/DEVELOPER.md with architecture, data flow, and extension guides - Rewrite src/core/__init__.py with public API exports and module docstring - Add Streamlit GUI (src/gui/) with file upload, advanced options, interactive match group review with side-by-side diff, and download buttons - Add .gitignore, requirements.txt, all source code, tests, and sample data - Add streamlit to requirements.txt Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
124
README.md
124
README.md
@@ -1,3 +1,123 @@
|
||||
# datatools-dev
|
||||
# DataTools Deduplicator
|
||||
|
||||
Data tools development
|
||||
Find and remove duplicate rows in CSV and Excel files — with fuzzy matching, smart normalization, and interactive review.
|
||||
|
||||
## Features
|
||||
|
||||
- **Zero-config start** — auto-detects encoding, delimiters, headers, and match columns
|
||||
- **Fuzzy matching** — Jaro-Winkler, Levenshtein, and token set ratio algorithms
|
||||
- **5 built-in normalizers** — email (Gmail dot/plus), phone (E.164), name (titles/suffixes), address (USPS), string (whitespace/case)
|
||||
- **Merge mode** — fill missing fields in the surviving row from removed duplicates
|
||||
- **4 survivor rules** — keep first, last, most complete, or most recent row per group
|
||||
- **Interactive review** — inspect each match group and decide: merge, keep both, or skip
|
||||
- **Config profiles** — save and reload your settings as JSON for repeatable runs
|
||||
- **Dual interface** — full CLI for automation, Streamlit GUI for visual review
|
||||
- **Dry-run by default** — preview what would change before writing anything
|
||||
- **Audit trail** — every run produces a match groups report and timestamped log
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Install
|
||||
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
### CLI
|
||||
|
||||
```bash
|
||||
# Preview duplicates (dry run — no files written)
|
||||
python -m src.cli customers.csv
|
||||
|
||||
# Remove duplicates and save the result
|
||||
python -m src.cli customers.csv --apply
|
||||
|
||||
# Fuzzy-match names at 80% similarity, merge missing fields
|
||||
python -m src.cli customers.csv --fuzzy name --threshold 80 --merge --apply
|
||||
|
||||
# Interactively review each match group
|
||||
python -m src.cli customers.csv --review --apply
|
||||
```
|
||||
|
||||
### GUI
|
||||
|
||||
```bash
|
||||
streamlit run src/gui/app.py
|
||||
```
|
||||
|
||||
Upload a file, click **Find Duplicates**, review match groups side-by-side, then download the cleaned result.
|
||||
|
||||
## CLI Usage Summary
|
||||
|
||||
```
|
||||
python -m src.cli INPUT_FILE [OPTIONS]
|
||||
|
||||
Options:
|
||||
--apply Write output files (default: preview only)
|
||||
--output, -o PATH Output file path
|
||||
--subset, -s COLS Columns to match on (comma-separated)
|
||||
--key, -k COLS Strong-key columns for exact matching
|
||||
--fuzzy COLS Columns to fuzzy-match
|
||||
--algorithm, -a ALG levenshtein | jaro_winkler | token_set_ratio
|
||||
--threshold, -t N Similarity threshold 0-100 (default: 85)
|
||||
--normalize COL:TYPE Per-column normalizers (e.g., email:email,phone:phone)
|
||||
--survivor RULE first | last | most-complete | most-recent
|
||||
--merge Fill missing fields from removed duplicates
|
||||
--review Interactively review each match group
|
||||
--config PATH Load settings from a JSON config file
|
||||
--save-config PATH Save current settings to JSON
|
||||
--sheet NAME Excel sheet name or 0-based index
|
||||
--encoding ENC Override auto-detected encoding
|
||||
--header-row N 0-based header row index
|
||||
--help Show full help
|
||||
```
|
||||
|
||||
## Sample Output
|
||||
|
||||
```
|
||||
$ python -m src.cli samples/messy_sales.csv
|
||||
|
||||
Reading messy_sales.csv...
|
||||
50 rows, 8 columns
|
||||
Finding duplicates...
|
||||
|
||||
──────────────────────────────────────────────────
|
||||
File: messy_sales.csv
|
||||
Rows in: 50
|
||||
Rows out: 28
|
||||
Removed: 22
|
||||
Groups: 22
|
||||
──────────────────────────────────────────────────
|
||||
|
||||
Match groups:
|
||||
Group 1: rows [1, 2] → keep row 1 (confidence: 100.0%, matched on: email)
|
||||
Group 2: rows [3, 4] → keep row 3 (confidence: 92.3%, matched on: name, phone)
|
||||
...
|
||||
|
||||
This was a preview. Add --apply to write the output files.
|
||||
```
|
||||
|
||||
## Output Files
|
||||
|
||||
When `--apply` is used, three files are produced:
|
||||
|
||||
| File | Contents |
|
||||
|------|----------|
|
||||
| `{input}_deduplicated.csv` | Cleaned data with duplicates removed |
|
||||
| `{input}_removed.csv` | Rows that were removed |
|
||||
| `{input}_match_groups.csv` | Audit trail: group ID, confidence, matched columns, survivor flag |
|
||||
|
||||
## Documentation
|
||||
|
||||
- [CLI Reference](docs/CLI-REFERENCE.md) — every flag with examples and recipe sections
|
||||
- [Developer Guide](docs/DEVELOPER.md) — architecture, data flow, how to extend
|
||||
- [User Guide](docs/USER-GUIDE.md) — installation and usage for end users
|
||||
|
||||
## Requirements
|
||||
|
||||
- Python 3.10+
|
||||
- Dependencies: pandas, openpyxl, rapidfuzz, typer, phonenumbers, loguru, tqdm, charset-normalizer
|
||||
|
||||
## License
|
||||
|
||||
Proprietary. All rights reserved.
|
||||
|
||||
Reference in New Issue
Block a user