feat: add documentation, Streamlit GUI, and full source tree

- Rewrite README.md with project overview, quick-start, and CLI summary - Add docs/CLI-REFERENCE.md with full flag reference and 8 recipe sections - Add docs/DEVELOPER.md with architecture, data flow, and extension guides - Rewrite src/core/__init__.py with public API exports and module docstring - Add Streamlit GUI (src/gui/) with file upload, advanced options, interactive match group review with side-by-side diff, and download buttons - Add .gitignore, requirements.txt, all source code, tests, and sample data - Add streamlit to requirements.txt Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-28 23:06:39 +00:00
parent 0613dc420c
commit b871ab24fc
47 changed files with 4413 additions and 2 deletions
--- a/README.md
+++ b/README.md
@@ -1,3 +1,123 @@
-# datatools-dev
+# DataTools Deduplicator

-Data tools development
+Find and remove duplicate rows in CSV and Excel files — with fuzzy matching, smart normalization, and interactive review.
+
+## Features
+
+- **Zero-config start** — auto-detects encoding, delimiters, headers, and match columns
+- **Fuzzy matching** — Jaro-Winkler, Levenshtein, and token set ratio algorithms
+- **5 built-in normalizers** — email (Gmail dot/plus), phone (E.164), name (titles/suffixes), address (USPS), string (whitespace/case)
+- **Merge mode** — fill missing fields in the surviving row from removed duplicates
+- **4 survivor rules** — keep first, last, most complete, or most recent row per group
+- **Interactive review** — inspect each match group and decide: merge, keep both, or skip
+- **Config profiles** — save and reload your settings as JSON for repeatable runs
+- **Dual interface** — full CLI for automation, Streamlit GUI for visual review
+- **Dry-run by default** — preview what would change before writing anything
+- **Audit trail** — every run produces a match groups report and timestamped log
+
+## Quick Start
+
+### Install
+
+```bash
+pip install -r requirements.txt
+```
+
+### CLI
+
+```bash
+# Preview duplicates (dry run — no files written)
+python -m src.cli customers.csv
+
+# Remove duplicates and save the result
+python -m src.cli customers.csv --apply
+
+# Fuzzy-match names at 80% similarity, merge missing fields
+python -m src.cli customers.csv --fuzzy name --threshold 80 --merge --apply
+
+# Interactively review each match group
+python -m src.cli customers.csv --review --apply
+```
+
+### GUI
+
+```bash
+streamlit run src/gui/app.py
+```
+
+Upload a file, click **Find Duplicates**, review match groups side-by-side, then download the cleaned result.
+
+## CLI Usage Summary
+
+```
+python -m src.cli INPUT_FILE [OPTIONS]
+
+Options:
+  --apply                  Write output files (default: preview only)
+  --output, -o PATH        Output file path
+  --subset, -s COLS        Columns to match on (comma-separated)
+  --key, -k COLS           Strong-key columns for exact matching
+  --fuzzy COLS             Columns to fuzzy-match
+  --algorithm, -a ALG      levenshtein | jaro_winkler | token_set_ratio
+  --threshold, -t N        Similarity threshold 0-100 (default: 85)
+  --normalize COL:TYPE     Per-column normalizers (e.g., email:email,phone:phone)
+  --survivor RULE          first | last | most-complete | most-recent
+  --merge                  Fill missing fields from removed duplicates
+  --review                 Interactively review each match group
+  --config PATH            Load settings from a JSON config file
+  --save-config PATH       Save current settings to JSON
+  --sheet NAME             Excel sheet name or 0-based index
+  --encoding ENC           Override auto-detected encoding
+  --header-row N           0-based header row index
+  --help                   Show full help
+```
+
+## Sample Output
+
+```
+$ python -m src.cli samples/messy_sales.csv
+
+Reading messy_sales.csv...
+  50 rows, 8 columns
+Finding duplicates...
+
+──────────────────────────────────────────────────
+  File:      messy_sales.csv
+  Rows in:   50
+  Rows out:  28
+  Removed:   22
+  Groups:    22
+──────────────────────────────────────────────────
+
+Match groups:
+  Group 1: rows [1, 2] → keep row 1 (confidence: 100.0%, matched on: email)
+  Group 2: rows [3, 4] → keep row 3 (confidence: 92.3%, matched on: name, phone)
+  ...
+
+This was a preview. Add --apply to write the output files.
+```
+
+## Output Files
+
+When `--apply` is used, three files are produced:
+
+| File | Contents |
+|------|----------|
+| `{input}_deduplicated.csv` | Cleaned data with duplicates removed |
+| `{input}_removed.csv` | Rows that were removed |
+| `{input}_match_groups.csv` | Audit trail: group ID, confidence, matched columns, survivor flag |
+
+## Documentation
+
+- [CLI Reference](docs/CLI-REFERENCE.md) — every flag with examples and recipe sections
+- [Developer Guide](docs/DEVELOPER.md) — architecture, data flow, how to extend
+- [User Guide](docs/USER-GUIDE.md) — installation and usage for end users
+
+## Requirements
+
+- Python 3.10+
+- Dependencies: pandas, openpyxl, rapidfuzz, typer, phonenumbers, loguru, tqdm, charset-normalizer
+
+## License
+
+Proprietary. All rights reserved.