feat: add documentation, Streamlit GUI, and full source tree

- Rewrite README.md with project overview, quick-start, and CLI summary - Add docs/CLI-REFERENCE.md with full flag reference and 8 recipe sections - Add docs/DEVELOPER.md with architecture, data flow, and extension guides - Rewrite src/core/__init__.py with public API exports and module docstring - Add Streamlit GUI (src/gui/) with file upload, advanced options, interactive match group review with side-by-side diff, and download buttons - Add .gitignore, requirements.txt, all source code, tests, and sample data - Add streamlit to requirements.txt Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-28 23:06:39 +00:00
parent 0613dc420c
commit b871ab24fc
47 changed files with 4413 additions and 2 deletions
--- a/docs/CLI-REFERENCE.md
+++ b/docs/CLI-REFERENCE.md
@@ -0,0 +1,284 @@
+# CLI Reference
+
+Complete command-line reference for the DataTools Deduplicator.
+
+```
+python -m src.cli INPUT_FILE [OPTIONS]
+```
+
+## Arguments
+
+| Argument | Required | Description |
+|----------|----------|-------------|
+| `INPUT_FILE` | Yes | Path to the CSV or Excel file to deduplicate |
+
+## Options
+
+### Core
+
+| Flag | Short | Default | Description |
+|------|-------|---------|-------------|
+| `--apply` | | `false` | Write output files. Without this flag, only a preview is shown. |
+| `--output` | `-o` | `{input}_deduplicated.csv` | Output file path. |
+
+### Column Selection
+
+| Flag | Short | Default | Description |
+|------|-------|---------|-------------|
+| `--subset` | `-s` | auto-detect | Comma-separated columns to match on. When omitted, columns are auto-detected by name pattern (email, phone, name, address). |
+| `--key` | `-k` | none | Comma-separated strong-key columns. Each becomes an independent exact-match strategy. Use for identifiers like `fb_id`, `ein`, `sku`. |
+
+### Fuzzy Matching
+
+| Flag | Short | Default | Description |
+|------|-------|---------|-------------|
+| `--fuzzy` | | none | Comma-separated columns to fuzzy-match. Other columns in the strategy use exact matching. |
+| `--algorithm` | `-a` | `jaro_winkler` | Fuzzy algorithm: `levenshtein`, `jaro_winkler`, or `token_set_ratio`. |
+| `--threshold` | `-t` | `85` | Similarity threshold 0-100. Lower values find more matches but increase false positives. |
+
+### Normalization
+
+| Flag | Short | Default | Description |
+|------|-------|---------|-------------|
+| `--normalize` | | auto-detect | Column normalizers as `col:type` pairs, comma-separated. Types: `email`, `phone`, `name`, `address`, `string`. |
+
+**Normalizer details:**
+
+| Type | What it does | Example |
+|------|-------------|---------|
+| `email` | Lowercase, strip Gmail dots, strip `+tag` suffixes | `John.Doe+tag@gmail.com` → `johndoe@gmail.com` |
+| `phone` | Parse to E.164 format; fallback: digits only | `(555) 123-4567` → `+15551234567` |
+| `name` | Strip titles (Dr., Mr.) and suffixes (Jr., PhD), case-fold | `Dr. John Smith Jr.` → `john smith` |
+| `address` | USPS abbreviations (Street→St, Avenue→Ave), case-fold | `123 Main Street, Suite 4` → `123 main st ste 4` |
+| `string` | Trim, collapse whitespace, case-fold | `  HELLO   WORLD  ` → `hello world` |
+
+### Survivor Selection
+
+| Flag | Short | Default | Description |
+|------|-------|---------|-------------|
+| `--survivor` | | `first` | Which row to keep per duplicate group. |
+| `--date-column` | | none | Date column for the `most-recent` rule. |
+| `--merge` | | `false` | Fill missing fields in the surviving row from removed duplicates. |
+
+**Survivor rules:**
+
+| Rule | Behavior |
+|------|----------|
+| `first` | Keep the first row encountered (lowest row number) |
+| `last` | Keep the last row encountered (highest row number) |
+| `most-complete` | Keep the row with the fewest blank/empty cells |
+| `most-recent` | Keep the row with the latest date (requires `--date-column`) |
+
+### Interactive Review
+
+| Flag | Short | Default | Description |
+|------|-------|---------|-------------|
+| `--review` | | `false` | Interactively review each match group. For each group, choose: merge (y), keep both (n), or skip remaining (s). |
+
+### Configuration
+
+| Flag | Short | Default | Description |
+|------|-------|---------|-------------|
+| `--config` | | none | Load all settings from a saved JSON config file. |
+| `--save-config` | | none | Save current settings to a JSON config file for reuse. |
+
+### File Handling
+
+| Flag | Short | Default | Description |
+|------|-------|---------|-------------|
+| `--sheet` | | first sheet | Excel sheet name or 0-based index. Ignored for CSV files. |
+| `--encoding` | | auto-detect | Override auto-detected file encoding (e.g., `utf-8`, `windows-1252`). |
+| `--header-row` | | auto-detect | 0-based row index for the header row. |
+
+---
+
+## Recipes
+
+### 1. Basic Dedup (Auto-Detect)
+
+Let the engine detect email, phone, name, and address columns automatically.
+
+```bash
+# Preview
+python -m src.cli customers.csv
+
+# Apply
+python -m src.cli customers.csv --apply
+```
+
+The engine scans column names for patterns like `email`, `phone`, `name`, `address` and builds strategies automatically. Strong keys (email, phone) become standalone strategies; weak keys (name, address) are paired with strong keys.
+
+### 2. Fuzzy Name Matching
+
+Match rows where names are similar but not identical — catches typos, nickname variations, and formatting differences.
+
+```bash
+# Fuzzy-match on the "name" column at 80% similarity
+python -m src.cli customers.csv --fuzzy name --threshold 80 --apply
+
+# Fuzzy-match on multiple columns
+python -m src.cli customers.csv --fuzzy name,address --threshold 85 --apply
+
+# Use Levenshtein distance instead of Jaro-Winkler
+python -m src.cli customers.csv --fuzzy name --algorithm levenshtein --threshold 80 --apply
+```
+
+**Algorithm comparison:**
+- `jaro_winkler` (default) — best for short strings like names; weights early characters more heavily
+- `levenshtein` — edit-distance ratio; works well for typos and transpositions
+- `token_set_ratio` — best for addresses and long strings; ignores word order
+
+### 3. Custom Strong Keys
+
+Use specific identifier columns to find exact duplicates.
+
+```bash
+# Deduplicate by Facebook ID
+python -m src.cli donors.csv --key fb_id --apply
+
+# Multiple strong keys (each is independent — matched with OR)
+python -m src.cli donors.csv --key fb_id,ein --apply
+```
+
+Strong keys are OR'd: a match on `fb_id` alone OR `ein` alone marks rows as duplicates.
+
+### 4. Merge Mode
+
+Keep the most complete row and fill any remaining blanks from the duplicates.
+
+```bash
+# Most complete row + merge missing fields
+python -m src.cli contacts.csv --survivor most-complete --merge --apply
+
+# Keep most recent row and merge
+python -m src.cli contacts.csv --survivor most-recent --date-column updated_at --merge --apply
+```
+
+**How merge works:** The survivor row keeps all its non-empty fields. For any blank/null fields, the engine fills from the removed rows (in row order). The result is a single row with maximum data retention.
+
+### 5. Multi-Column Subset
+
+Match on a specific combination of columns rather than auto-detecting.
+
+```bash
+# Exact match on email + phone only
+python -m src.cli customers.csv --subset email,phone --apply
+
+# Mix exact and fuzzy within a subset
+python -m src.cli customers.csv --subset email,name --fuzzy name --threshold 85 --apply
+```
+
+When using `--subset`, all listed columns must match (AND logic) for a pair to be considered duplicates.
+
+### 6. Save and Load Config Profiles
+
+Save your settings for repeatable runs on similar files.
+
+```bash
+# Save settings to a file
+python -m src.cli customers.csv --fuzzy name --threshold 80 --merge \
+    --survivor most-complete --save-config customer_dedup.json
+
+# Load saved settings
+python -m src.cli new_customers.csv --config customer_dedup.json --apply
+```
+
+Config files are JSON. Example:
+
+```json
+{
+  "strategies": [],
+  "survivor_rule": "most_complete",
+  "merge": true,
+  "default_algorithm": "jaro_winkler",
+  "default_threshold": 80.0,
+  "fuzzy_columns": ["name"]
+}
+```
+
+### 7. Interactive Review
+
+Step through each match group and decide whether to merge.
+
+```bash
+python -m src.cli customers.csv --review --apply
+```
+
+For each group, the CLI displays both rows side-by-side and prompts:
+
+```
+============================================================
+Match Group 1 — Confidence: 92.3%
+Matched on: name, phone
+============================================================
+
+  Row 1:
+    name: John Smith
+    email: john@example.com
+    phone: (555) 123-4567
+
+  Row 2:
+    name: Jon Smith
+    email:
+    phone: 555-123-4567
+
+  [y] Merge  [n] Keep both  [s] Skip remaining:
+```
+
+- **y** — accept the match; merge/remove duplicate
+- **n** — reject the match; keep both rows
+- **s** — skip all remaining groups (keep both for all)
+
+### 8. Excel Files and Multi-Sheet
+
+Work with Excel files directly — no CSV conversion needed.
+
+```bash
+# Deduplicate first sheet (default)
+python -m src.cli data.xlsx --apply
+
+# Specify sheet by name
+python -m src.cli data.xlsx --sheet "Sales Data" --apply
+
+# Specify sheet by index (0-based)
+python -m src.cli data.xlsx --sheet 1 --apply
+```
+
+Output is always CSV by default. To write Excel output, use `-o`:
+
+```bash
+python -m src.cli data.xlsx -o cleaned.xlsx --apply
+```
+
+---
+
+## Auto-Detection Details
+
+When no `--subset` or `--fuzzy` flags are provided, the engine scans column names and builds strategies:
+
+| Column pattern | Detection regex | Algorithm | Threshold | Normalizer | Key type |
+|---------------|----------------|-----------|-----------|------------|----------|
+| Email | `e[-_]?mail` | exact | 100% | email | strong |
+| Phone | `phone\|telephone\|mobile\|cell` | exact | 100% | phone | strong |
+| Name | `^(name\|full_name\|customer_name\|...)$` | jaro_winkler | 85% | name | weak |
+| Address | `address\|street\|addr` | token_set_ratio | 80% | address | weak |
+
+**Strategy building rules:**
+- Strong keys → standalone OR strategies (email match alone is enough)
+- Weak keys → paired with each strong key via AND (name match requires email or phone match too)
+- No strong keys found → weak keys promoted to standalone
+- No patterns matched → exact match on all columns (equivalent to `drop_duplicates`)
+
+## Output Files
+
+When `--apply` is set, three files are written:
+
+| File | Description |
+|------|-------------|
+| `{stem}_deduplicated.csv` | Cleaned DataFrame with duplicates removed |
+| `{stem}_removed.csv` | Rows that were removed |
+| `{stem}_match_groups.csv` | Audit trail with `_group_id`, `_is_survivor`, `_confidence`, `_matched_on`, `_original_row`, plus all original columns |
+
+## Logging
+
+Every run writes a timestamped log to `logs/dedup_YYYYMMDD_HHMMSS.log` with full debug-level details: strategies used, pair comparisons, survivor decisions, and merge actions.