datatools-dev/README.md

# DataTools Deduplicator

Find and remove duplicate rows in CSV, delimited text, and Excel files — with fuzzy matching, smart normalization, and interactive review.

## Features

- **Zero-config start** — auto-detects encoding, delimiters, headers, and match columns
- **Fuzzy matching** — Jaro-Winkler, Levenshtein, and token set ratio algorithms
- **5 built-in normalizers** — email (Gmail dot/plus), phone (E.164), name (titles/suffixes), address (USPS), string (whitespace/case)
- **Merge mode** — fill missing fields in the surviving row from removed duplicates
- **4 survivor rules** — keep first, last, most complete, or most recent row per group
- **Interactive review** — inspect match groups with inline checkboxes and column dropdowns, cherry-pick values, preview surviving rows live
- **Config profiles** — save and reload your settings as JSON for repeatable runs
- **Dual interface** — full CLI for automation, Streamlit GUI for visual review
- **Dry-run by default** — preview what would change before writing anything
- **Audit trail** — every run produces a match groups report and timestamped log

## Quick Start

### Install

```bash
pip install -r requirements.txt
```

### CLI

```bash
# Preview duplicates (dry run — no files written)
python -m src.cli customers.csv

# Remove duplicates and save the result
python -m src.cli customers.csv --apply

# Fuzzy-match names at 80% similarity, merge missing fields
python -m src.cli customers.csv --fuzzy name --threshold 80 --merge --apply

# Interactively review each match group
python -m src.cli customers.csv --review --apply
```

### GUI

```bash
streamlit run src/gui/app.py
```

Upload a file, click **Find Duplicates**, review match groups side-by-side, then download the cleaned result.

## CLI Usage Summary

```
python -m src.cli INPUT_FILE [OPTIONS]

Options:
  --apply                  Write output files (default: preview only)
  --output, -o PATH        Output file path
  --subset, -s COLS        Columns to match on (comma-separated)
  --key, -k COLS           Strong-key columns for exact matching
  --fuzzy COLS             Columns to fuzzy-match
  --algorithm, -a ALG      levenshtein | jaro_winkler | token_set_ratio
  --threshold, -t N        Similarity threshold 0-100 (default: 85)
  --normalize COL:TYPE     Per-column normalizers (e.g., email:email,phone:phone)
  --survivor RULE          first | last | most-complete | most-recent
  --merge                  Fill missing fields from removed duplicates
  --review                 Interactively review each match group
  --config PATH            Load settings from a JSON config file
  --save-config PATH       Save current settings to JSON
  --sheet NAME             Excel sheet name or 0-based index
  --encoding ENC           Override auto-detected encoding
  --header-row N           0-based header row index
  --help                   Show full help
```

## Sample Output

```
$ python -m src.cli samples/messy_sales.csv

Reading messy_sales.csv...
  50 rows, 8 columns
Finding duplicates...

──────────────────────────────────────────────────
  File:      messy_sales.csv
  Rows in:   50
  Rows out:  28
  Removed:   22
  Groups:    22
──────────────────────────────────────────────────

Match groups:
  Group 1: rows [1, 2] → keep row 1 (confidence: 100.0%, matched on: email)
  Group 2: rows [3, 4] → keep row 3 (confidence: 92.3%, matched on: name, phone)
  ...

This was a preview. Add --apply to write the output files.
```

## Output Files

When `--apply` is used, three files are produced:

| File | Contents |
|------|----------|
| `{input}_deduplicated.csv` | Cleaned data with duplicates removed |
| `{input}_removed.csv` | Rows that were removed |
| `{input}_match_groups.csv` | Audit trail: group ID, confidence, matched columns, survivor flag |

## Documentation

- [CLI Reference](docs/CLI-REFERENCE.md) — every flag with examples and recipe sections
- [Developer Guide](docs/DEVELOPER.md) — architecture, data flow, how to extend

## Requirements

- Python 3.10+
- Dependencies: pandas, openpyxl, rapidfuzz, typer, phonenumbers, loguru, tqdm, charset-normalizer

## License

Proprietary. All rights reserved.