feat: add documentation, Streamlit GUI, and full source tree
- Rewrite README.md with project overview, quick-start, and CLI summary - Add docs/CLI-REFERENCE.md with full flag reference and 8 recipe sections - Add docs/DEVELOPER.md with architecture, data flow, and extension guides - Rewrite src/core/__init__.py with public API exports and module docstring - Add Streamlit GUI (src/gui/) with file upload, advanced options, interactive match group review with side-by-side diff, and download buttons - Add .gitignore, requirements.txt, all source code, tests, and sample data - Add streamlit to requirements.txt Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
9
.gitignore
vendored
Normal file
9
.gitignore
vendored
Normal file
@@ -0,0 +1,9 @@
|
|||||||
|
.venv/
|
||||||
|
__pycache__/
|
||||||
|
*.pyc
|
||||||
|
*.pyo
|
||||||
|
logs/
|
||||||
|
*.egg-info/
|
||||||
|
dist/
|
||||||
|
build/
|
||||||
|
.pytest_cache/
|
||||||
124
README.md
124
README.md
@@ -1,3 +1,123 @@
|
|||||||
# datatools-dev
|
# DataTools Deduplicator
|
||||||
|
|
||||||
Data tools development
|
Find and remove duplicate rows in CSV and Excel files — with fuzzy matching, smart normalization, and interactive review.
|
||||||
|
|
||||||
|
## Features
|
||||||
|
|
||||||
|
- **Zero-config start** — auto-detects encoding, delimiters, headers, and match columns
|
||||||
|
- **Fuzzy matching** — Jaro-Winkler, Levenshtein, and token set ratio algorithms
|
||||||
|
- **5 built-in normalizers** — email (Gmail dot/plus), phone (E.164), name (titles/suffixes), address (USPS), string (whitespace/case)
|
||||||
|
- **Merge mode** — fill missing fields in the surviving row from removed duplicates
|
||||||
|
- **4 survivor rules** — keep first, last, most complete, or most recent row per group
|
||||||
|
- **Interactive review** — inspect each match group and decide: merge, keep both, or skip
|
||||||
|
- **Config profiles** — save and reload your settings as JSON for repeatable runs
|
||||||
|
- **Dual interface** — full CLI for automation, Streamlit GUI for visual review
|
||||||
|
- **Dry-run by default** — preview what would change before writing anything
|
||||||
|
- **Audit trail** — every run produces a match groups report and timestamped log
|
||||||
|
|
||||||
|
## Quick Start
|
||||||
|
|
||||||
|
### Install
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install -r requirements.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
### CLI
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Preview duplicates (dry run — no files written)
|
||||||
|
python -m src.cli customers.csv
|
||||||
|
|
||||||
|
# Remove duplicates and save the result
|
||||||
|
python -m src.cli customers.csv --apply
|
||||||
|
|
||||||
|
# Fuzzy-match names at 80% similarity, merge missing fields
|
||||||
|
python -m src.cli customers.csv --fuzzy name --threshold 80 --merge --apply
|
||||||
|
|
||||||
|
# Interactively review each match group
|
||||||
|
python -m src.cli customers.csv --review --apply
|
||||||
|
```
|
||||||
|
|
||||||
|
### GUI
|
||||||
|
|
||||||
|
```bash
|
||||||
|
streamlit run src/gui/app.py
|
||||||
|
```
|
||||||
|
|
||||||
|
Upload a file, click **Find Duplicates**, review match groups side-by-side, then download the cleaned result.
|
||||||
|
|
||||||
|
## CLI Usage Summary
|
||||||
|
|
||||||
|
```
|
||||||
|
python -m src.cli INPUT_FILE [OPTIONS]
|
||||||
|
|
||||||
|
Options:
|
||||||
|
--apply Write output files (default: preview only)
|
||||||
|
--output, -o PATH Output file path
|
||||||
|
--subset, -s COLS Columns to match on (comma-separated)
|
||||||
|
--key, -k COLS Strong-key columns for exact matching
|
||||||
|
--fuzzy COLS Columns to fuzzy-match
|
||||||
|
--algorithm, -a ALG levenshtein | jaro_winkler | token_set_ratio
|
||||||
|
--threshold, -t N Similarity threshold 0-100 (default: 85)
|
||||||
|
--normalize COL:TYPE Per-column normalizers (e.g., email:email,phone:phone)
|
||||||
|
--survivor RULE first | last | most-complete | most-recent
|
||||||
|
--merge Fill missing fields from removed duplicates
|
||||||
|
--review Interactively review each match group
|
||||||
|
--config PATH Load settings from a JSON config file
|
||||||
|
--save-config PATH Save current settings to JSON
|
||||||
|
--sheet NAME Excel sheet name or 0-based index
|
||||||
|
--encoding ENC Override auto-detected encoding
|
||||||
|
--header-row N 0-based header row index
|
||||||
|
--help Show full help
|
||||||
|
```
|
||||||
|
|
||||||
|
## Sample Output
|
||||||
|
|
||||||
|
```
|
||||||
|
$ python -m src.cli samples/messy_sales.csv
|
||||||
|
|
||||||
|
Reading messy_sales.csv...
|
||||||
|
50 rows, 8 columns
|
||||||
|
Finding duplicates...
|
||||||
|
|
||||||
|
──────────────────────────────────────────────────
|
||||||
|
File: messy_sales.csv
|
||||||
|
Rows in: 50
|
||||||
|
Rows out: 28
|
||||||
|
Removed: 22
|
||||||
|
Groups: 22
|
||||||
|
──────────────────────────────────────────────────
|
||||||
|
|
||||||
|
Match groups:
|
||||||
|
Group 1: rows [1, 2] → keep row 1 (confidence: 100.0%, matched on: email)
|
||||||
|
Group 2: rows [3, 4] → keep row 3 (confidence: 92.3%, matched on: name, phone)
|
||||||
|
...
|
||||||
|
|
||||||
|
This was a preview. Add --apply to write the output files.
|
||||||
|
```
|
||||||
|
|
||||||
|
## Output Files
|
||||||
|
|
||||||
|
When `--apply` is used, three files are produced:
|
||||||
|
|
||||||
|
| File | Contents |
|
||||||
|
|------|----------|
|
||||||
|
| `{input}_deduplicated.csv` | Cleaned data with duplicates removed |
|
||||||
|
| `{input}_removed.csv` | Rows that were removed |
|
||||||
|
| `{input}_match_groups.csv` | Audit trail: group ID, confidence, matched columns, survivor flag |
|
||||||
|
|
||||||
|
## Documentation
|
||||||
|
|
||||||
|
- [CLI Reference](docs/CLI-REFERENCE.md) — every flag with examples and recipe sections
|
||||||
|
- [Developer Guide](docs/DEVELOPER.md) — architecture, data flow, how to extend
|
||||||
|
- [User Guide](docs/USER-GUIDE.md) — installation and usage for end users
|
||||||
|
|
||||||
|
## Requirements
|
||||||
|
|
||||||
|
- Python 3.10+
|
||||||
|
- Dependencies: pandas, openpyxl, rapidfuzz, typer, phonenumbers, loguru, tqdm, charset-normalizer
|
||||||
|
|
||||||
|
## License
|
||||||
|
|
||||||
|
Proprietary. All rights reserved.
|
||||||
|
|||||||
284
docs/CLI-REFERENCE.md
Normal file
284
docs/CLI-REFERENCE.md
Normal file
@@ -0,0 +1,284 @@
|
|||||||
|
# CLI Reference
|
||||||
|
|
||||||
|
Complete command-line reference for the DataTools Deduplicator.
|
||||||
|
|
||||||
|
```
|
||||||
|
python -m src.cli INPUT_FILE [OPTIONS]
|
||||||
|
```
|
||||||
|
|
||||||
|
## Arguments
|
||||||
|
|
||||||
|
| Argument | Required | Description |
|
||||||
|
|----------|----------|-------------|
|
||||||
|
| `INPUT_FILE` | Yes | Path to the CSV or Excel file to deduplicate |
|
||||||
|
|
||||||
|
## Options
|
||||||
|
|
||||||
|
### Core
|
||||||
|
|
||||||
|
| Flag | Short | Default | Description |
|
||||||
|
|------|-------|---------|-------------|
|
||||||
|
| `--apply` | | `false` | Write output files. Without this flag, only a preview is shown. |
|
||||||
|
| `--output` | `-o` | `{input}_deduplicated.csv` | Output file path. |
|
||||||
|
|
||||||
|
### Column Selection
|
||||||
|
|
||||||
|
| Flag | Short | Default | Description |
|
||||||
|
|------|-------|---------|-------------|
|
||||||
|
| `--subset` | `-s` | auto-detect | Comma-separated columns to match on. When omitted, columns are auto-detected by name pattern (email, phone, name, address). |
|
||||||
|
| `--key` | `-k` | none | Comma-separated strong-key columns. Each becomes an independent exact-match strategy. Use for identifiers like `fb_id`, `ein`, `sku`. |
|
||||||
|
|
||||||
|
### Fuzzy Matching
|
||||||
|
|
||||||
|
| Flag | Short | Default | Description |
|
||||||
|
|------|-------|---------|-------------|
|
||||||
|
| `--fuzzy` | | none | Comma-separated columns to fuzzy-match. Other columns in the strategy use exact matching. |
|
||||||
|
| `--algorithm` | `-a` | `jaro_winkler` | Fuzzy algorithm: `levenshtein`, `jaro_winkler`, or `token_set_ratio`. |
|
||||||
|
| `--threshold` | `-t` | `85` | Similarity threshold 0-100. Lower values find more matches but increase false positives. |
|
||||||
|
|
||||||
|
### Normalization
|
||||||
|
|
||||||
|
| Flag | Short | Default | Description |
|
||||||
|
|------|-------|---------|-------------|
|
||||||
|
| `--normalize` | | auto-detect | Column normalizers as `col:type` pairs, comma-separated. Types: `email`, `phone`, `name`, `address`, `string`. |
|
||||||
|
|
||||||
|
**Normalizer details:**
|
||||||
|
|
||||||
|
| Type | What it does | Example |
|
||||||
|
|------|-------------|---------|
|
||||||
|
| `email` | Lowercase, strip Gmail dots, strip `+tag` suffixes | `John.Doe+tag@gmail.com` → `johndoe@gmail.com` |
|
||||||
|
| `phone` | Parse to E.164 format; fallback: digits only | `(555) 123-4567` → `+15551234567` |
|
||||||
|
| `name` | Strip titles (Dr., Mr.) and suffixes (Jr., PhD), case-fold | `Dr. John Smith Jr.` → `john smith` |
|
||||||
|
| `address` | USPS abbreviations (Street→St, Avenue→Ave), case-fold | `123 Main Street, Suite 4` → `123 main st ste 4` |
|
||||||
|
| `string` | Trim, collapse whitespace, case-fold | ` HELLO WORLD ` → `hello world` |
|
||||||
|
|
||||||
|
### Survivor Selection
|
||||||
|
|
||||||
|
| Flag | Short | Default | Description |
|
||||||
|
|------|-------|---------|-------------|
|
||||||
|
| `--survivor` | | `first` | Which row to keep per duplicate group. |
|
||||||
|
| `--date-column` | | none | Date column for the `most-recent` rule. |
|
||||||
|
| `--merge` | | `false` | Fill missing fields in the surviving row from removed duplicates. |
|
||||||
|
|
||||||
|
**Survivor rules:**
|
||||||
|
|
||||||
|
| Rule | Behavior |
|
||||||
|
|------|----------|
|
||||||
|
| `first` | Keep the first row encountered (lowest row number) |
|
||||||
|
| `last` | Keep the last row encountered (highest row number) |
|
||||||
|
| `most-complete` | Keep the row with the fewest blank/empty cells |
|
||||||
|
| `most-recent` | Keep the row with the latest date (requires `--date-column`) |
|
||||||
|
|
||||||
|
### Interactive Review
|
||||||
|
|
||||||
|
| Flag | Short | Default | Description |
|
||||||
|
|------|-------|---------|-------------|
|
||||||
|
| `--review` | | `false` | Interactively review each match group. For each group, choose: merge (y), keep both (n), or skip remaining (s). |
|
||||||
|
|
||||||
|
### Configuration
|
||||||
|
|
||||||
|
| Flag | Short | Default | Description |
|
||||||
|
|------|-------|---------|-------------|
|
||||||
|
| `--config` | | none | Load all settings from a saved JSON config file. |
|
||||||
|
| `--save-config` | | none | Save current settings to a JSON config file for reuse. |
|
||||||
|
|
||||||
|
### File Handling
|
||||||
|
|
||||||
|
| Flag | Short | Default | Description |
|
||||||
|
|------|-------|---------|-------------|
|
||||||
|
| `--sheet` | | first sheet | Excel sheet name or 0-based index. Ignored for CSV files. |
|
||||||
|
| `--encoding` | | auto-detect | Override auto-detected file encoding (e.g., `utf-8`, `windows-1252`). |
|
||||||
|
| `--header-row` | | auto-detect | 0-based row index for the header row. |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Recipes
|
||||||
|
|
||||||
|
### 1. Basic Dedup (Auto-Detect)
|
||||||
|
|
||||||
|
Let the engine detect email, phone, name, and address columns automatically.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Preview
|
||||||
|
python -m src.cli customers.csv
|
||||||
|
|
||||||
|
# Apply
|
||||||
|
python -m src.cli customers.csv --apply
|
||||||
|
```
|
||||||
|
|
||||||
|
The engine scans column names for patterns like `email`, `phone`, `name`, `address` and builds strategies automatically. Strong keys (email, phone) become standalone strategies; weak keys (name, address) are paired with strong keys.
|
||||||
|
|
||||||
|
### 2. Fuzzy Name Matching
|
||||||
|
|
||||||
|
Match rows where names are similar but not identical — catches typos, nickname variations, and formatting differences.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Fuzzy-match on the "name" column at 80% similarity
|
||||||
|
python -m src.cli customers.csv --fuzzy name --threshold 80 --apply
|
||||||
|
|
||||||
|
# Fuzzy-match on multiple columns
|
||||||
|
python -m src.cli customers.csv --fuzzy name,address --threshold 85 --apply
|
||||||
|
|
||||||
|
# Use Levenshtein distance instead of Jaro-Winkler
|
||||||
|
python -m src.cli customers.csv --fuzzy name --algorithm levenshtein --threshold 80 --apply
|
||||||
|
```
|
||||||
|
|
||||||
|
**Algorithm comparison:**
|
||||||
|
- `jaro_winkler` (default) — best for short strings like names; weights early characters more heavily
|
||||||
|
- `levenshtein` — edit-distance ratio; works well for typos and transpositions
|
||||||
|
- `token_set_ratio` — best for addresses and long strings; ignores word order
|
||||||
|
|
||||||
|
### 3. Custom Strong Keys
|
||||||
|
|
||||||
|
Use specific identifier columns to find exact duplicates.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Deduplicate by Facebook ID
|
||||||
|
python -m src.cli donors.csv --key fb_id --apply
|
||||||
|
|
||||||
|
# Multiple strong keys (each is independent — matched with OR)
|
||||||
|
python -m src.cli donors.csv --key fb_id,ein --apply
|
||||||
|
```
|
||||||
|
|
||||||
|
Strong keys are OR'd: a match on `fb_id` alone OR `ein` alone marks rows as duplicates.
|
||||||
|
|
||||||
|
### 4. Merge Mode
|
||||||
|
|
||||||
|
Keep the most complete row and fill any remaining blanks from the duplicates.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Most complete row + merge missing fields
|
||||||
|
python -m src.cli contacts.csv --survivor most-complete --merge --apply
|
||||||
|
|
||||||
|
# Keep most recent row and merge
|
||||||
|
python -m src.cli contacts.csv --survivor most-recent --date-column updated_at --merge --apply
|
||||||
|
```
|
||||||
|
|
||||||
|
**How merge works:** The survivor row keeps all its non-empty fields. For any blank/null fields, the engine fills from the removed rows (in row order). The result is a single row with maximum data retention.
|
||||||
|
|
||||||
|
### 5. Multi-Column Subset
|
||||||
|
|
||||||
|
Match on a specific combination of columns rather than auto-detecting.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Exact match on email + phone only
|
||||||
|
python -m src.cli customers.csv --subset email,phone --apply
|
||||||
|
|
||||||
|
# Mix exact and fuzzy within a subset
|
||||||
|
python -m src.cli customers.csv --subset email,name --fuzzy name --threshold 85 --apply
|
||||||
|
```
|
||||||
|
|
||||||
|
When using `--subset`, all listed columns must match (AND logic) for a pair to be considered duplicates.
|
||||||
|
|
||||||
|
### 6. Save and Load Config Profiles
|
||||||
|
|
||||||
|
Save your settings for repeatable runs on similar files.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Save settings to a file
|
||||||
|
python -m src.cli customers.csv --fuzzy name --threshold 80 --merge \
|
||||||
|
--survivor most-complete --save-config customer_dedup.json
|
||||||
|
|
||||||
|
# Load saved settings
|
||||||
|
python -m src.cli new_customers.csv --config customer_dedup.json --apply
|
||||||
|
```
|
||||||
|
|
||||||
|
Config files are JSON. Example:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"strategies": [],
|
||||||
|
"survivor_rule": "most_complete",
|
||||||
|
"merge": true,
|
||||||
|
"default_algorithm": "jaro_winkler",
|
||||||
|
"default_threshold": 80.0,
|
||||||
|
"fuzzy_columns": ["name"]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### 7. Interactive Review
|
||||||
|
|
||||||
|
Step through each match group and decide whether to merge.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python -m src.cli customers.csv --review --apply
|
||||||
|
```
|
||||||
|
|
||||||
|
For each group, the CLI displays both rows side-by-side and prompts:
|
||||||
|
|
||||||
|
```
|
||||||
|
============================================================
|
||||||
|
Match Group 1 — Confidence: 92.3%
|
||||||
|
Matched on: name, phone
|
||||||
|
============================================================
|
||||||
|
|
||||||
|
Row 1:
|
||||||
|
name: John Smith
|
||||||
|
email: john@example.com
|
||||||
|
phone: (555) 123-4567
|
||||||
|
|
||||||
|
Row 2:
|
||||||
|
name: Jon Smith
|
||||||
|
email:
|
||||||
|
phone: 555-123-4567
|
||||||
|
|
||||||
|
[y] Merge [n] Keep both [s] Skip remaining:
|
||||||
|
```
|
||||||
|
|
||||||
|
- **y** — accept the match; merge/remove duplicate
|
||||||
|
- **n** — reject the match; keep both rows
|
||||||
|
- **s** — skip all remaining groups (keep both for all)
|
||||||
|
|
||||||
|
### 8. Excel Files and Multi-Sheet
|
||||||
|
|
||||||
|
Work with Excel files directly — no CSV conversion needed.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Deduplicate first sheet (default)
|
||||||
|
python -m src.cli data.xlsx --apply
|
||||||
|
|
||||||
|
# Specify sheet by name
|
||||||
|
python -m src.cli data.xlsx --sheet "Sales Data" --apply
|
||||||
|
|
||||||
|
# Specify sheet by index (0-based)
|
||||||
|
python -m src.cli data.xlsx --sheet 1 --apply
|
||||||
|
```
|
||||||
|
|
||||||
|
Output is always CSV by default. To write Excel output, use `-o`:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python -m src.cli data.xlsx -o cleaned.xlsx --apply
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Auto-Detection Details
|
||||||
|
|
||||||
|
When no `--subset` or `--fuzzy` flags are provided, the engine scans column names and builds strategies:
|
||||||
|
|
||||||
|
| Column pattern | Detection regex | Algorithm | Threshold | Normalizer | Key type |
|
||||||
|
|---------------|----------------|-----------|-----------|------------|----------|
|
||||||
|
| Email | `e[-_]?mail` | exact | 100% | email | strong |
|
||||||
|
| Phone | `phone\|telephone\|mobile\|cell` | exact | 100% | phone | strong |
|
||||||
|
| Name | `^(name\|full_name\|customer_name\|...)$` | jaro_winkler | 85% | name | weak |
|
||||||
|
| Address | `address\|street\|addr` | token_set_ratio | 80% | address | weak |
|
||||||
|
|
||||||
|
**Strategy building rules:**
|
||||||
|
- Strong keys → standalone OR strategies (email match alone is enough)
|
||||||
|
- Weak keys → paired with each strong key via AND (name match requires email or phone match too)
|
||||||
|
- No strong keys found → weak keys promoted to standalone
|
||||||
|
- No patterns matched → exact match on all columns (equivalent to `drop_duplicates`)
|
||||||
|
|
||||||
|
## Output Files
|
||||||
|
|
||||||
|
When `--apply` is set, three files are written:
|
||||||
|
|
||||||
|
| File | Description |
|
||||||
|
|------|-------------|
|
||||||
|
| `{stem}_deduplicated.csv` | Cleaned DataFrame with duplicates removed |
|
||||||
|
| `{stem}_removed.csv` | Rows that were removed |
|
||||||
|
| `{stem}_match_groups.csv` | Audit trail with `_group_id`, `_is_survivor`, `_confidence`, `_matched_on`, `_original_row`, plus all original columns |
|
||||||
|
|
||||||
|
## Logging
|
||||||
|
|
||||||
|
Every run writes a timestamped log to `logs/dedup_YYYYMMDD_HHMMSS.log` with full debug-level details: strategies used, pair comparisons, survivor decisions, and merge actions.
|
||||||
282
docs/DEVELOPER.md
Normal file
282
docs/DEVELOPER.md
Normal file
@@ -0,0 +1,282 @@
|
|||||||
|
# Developer Guide
|
||||||
|
|
||||||
|
Architecture, data flow, and extension guide for the DataTools Deduplicator.
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
```
|
||||||
|
CLI (src/cli.py) GUI (src/gui/app.py)
|
||||||
|
│ │
|
||||||
|
│ flags → strategies │ widgets → strategies
|
||||||
|
│ _interactive_review() │ match_group_card()
|
||||||
|
│ tqdm progress bar │ st.progress()
|
||||||
|
│ │
|
||||||
|
└──────────┐ ┌────────────────┘
|
||||||
|
│ │
|
||||||
|
▼ ▼
|
||||||
|
┌─────────────────┐
|
||||||
|
│ core.dedup │
|
||||||
|
│ deduplicate() │
|
||||||
|
└────────┬────────┘
|
||||||
|
│
|
||||||
|
┌────────────┼────────────┐
|
||||||
|
▼ ▼ ▼
|
||||||
|
core.io core.normalizers core.config
|
||||||
|
read/write normalize_*() save/load JSON
|
||||||
|
```
|
||||||
|
|
||||||
|
**Key principle:** All business logic lives in `src/core/`. The CLI and GUI are thin wrappers that translate user input into `deduplicate()` arguments and display the `DeduplicationResult`.
|
||||||
|
|
||||||
|
## File-by-File Reference
|
||||||
|
|
||||||
|
### src/core/dedup.py — Deduplication Engine
|
||||||
|
|
||||||
|
The central module. Contains:
|
||||||
|
|
||||||
|
- **Enums:** `Algorithm` (4 fuzzy algorithms), `SurvivorRule` (4 selection rules)
|
||||||
|
- **Data classes:** `ColumnMatchStrategy`, `MatchStrategy`, `MatchResult`, `DeduplicationResult`
|
||||||
|
- **`deduplicate()`** — main entry point. Takes a DataFrame + optional strategies/rules, returns a `DeduplicationResult` with deduplicated DataFrame, removed rows, match groups, and log entries.
|
||||||
|
- **`build_default_strategies()`** — scans column names with regex patterns to auto-detect email, phone, name, and address columns. Builds strong/weak key strategies with appropriate algorithms and normalizers.
|
||||||
|
- **`_UnionFind`** — disjoint-set data structure for transitive closure. If A matches B and B matches C, all three end up in one group.
|
||||||
|
- **`_find_match_groups()`** — O(n^2) pairwise comparison. For each pair, tries all strategies (OR semantics). Feeds matches into union-find. Returns match groups with confidence scores.
|
||||||
|
- **`_select_survivor()`** — picks the row to keep based on the survivor rule.
|
||||||
|
- **`_merge_group()`** — fills blank fields in the survivor from loser rows.
|
||||||
|
|
||||||
|
### src/core/normalizers.py — Text Normalization
|
||||||
|
|
||||||
|
Five normalizer functions, each `str → str`, idempotent, None-safe:
|
||||||
|
|
||||||
|
- **`normalize_email()`** — lowercase, strip Gmail dots, strip `+tag` suffixes
|
||||||
|
- **`normalize_phone()`** — parse with `phonenumbers` to E.164; fallback to digits-only
|
||||||
|
- **`normalize_name()`** — strip title prefixes (Dr., Mr.) and suffixes (Jr., PhD), case-fold
|
||||||
|
- **`normalize_address()`** — USPS abbreviations (Street→St, Avenue→Ave), case-fold
|
||||||
|
- **`normalize_string()`** — trim, collapse whitespace, case-fold
|
||||||
|
|
||||||
|
The `get_normalizer()` registry function maps `NormalizerType` enum values to functions.
|
||||||
|
|
||||||
|
### src/core/io.py — File I/O
|
||||||
|
|
||||||
|
Auto-detection stack:
|
||||||
|
|
||||||
|
1. **`detect_encoding()`** — checks BOM, then uses `charset-normalizer` heuristics
|
||||||
|
2. **`detect_delimiter()`** — uses `csv.Sniffer` on first 20 lines
|
||||||
|
3. **`detect_header_row()`** — finds first row where all cells look like column names
|
||||||
|
|
||||||
|
Main functions:
|
||||||
|
- **`read_file()`** — reads CSV/TSV/Excel with full auto-detection. Returns a DataFrame.
|
||||||
|
- **`write_file()`** — writes DataFrame to CSV or Excel. Uses `utf-8-sig` by default for Windows Excel compatibility.
|
||||||
|
- **`list_sheets()`** — returns sheet names from an Excel workbook.
|
||||||
|
|
||||||
|
### src/core/config.py — Configuration Profiles
|
||||||
|
|
||||||
|
Save/load deduplication settings as JSON:
|
||||||
|
|
||||||
|
- **`DeduplicationConfig`** — flat dataclass with all settings: strategies, survivor rule, merge flag, algorithm, threshold, normalizer map.
|
||||||
|
- **`.to_file()` / `.from_file()`** — JSON serialization
|
||||||
|
- **`.to_strategies()`** — converts config back to `MatchStrategy` objects for the engine
|
||||||
|
- **`.to_survivor_rule()`** — converts string to `SurvivorRule` enum
|
||||||
|
|
||||||
|
### src/cli.py — Command-Line Interface
|
||||||
|
|
||||||
|
Typer-based CLI with 17 options. Key responsibilities:
|
||||||
|
|
||||||
|
- Parse flags into strategies, survivor rule, and other config
|
||||||
|
- Set up logging (timestamped log files in `logs/`)
|
||||||
|
- Column name validation with fuzzy suggestions on typos
|
||||||
|
- `_interactive_review()` — side-by-side row display with y/n/s prompts
|
||||||
|
- Progress bar via `tqdm` for files > 10,000 rows
|
||||||
|
- Output formatting and file writing
|
||||||
|
|
||||||
|
### src/gui/app.py — Streamlit GUI
|
||||||
|
|
||||||
|
Single-page layout:
|
||||||
|
- File upload with instant preview
|
||||||
|
- Advanced options expander (column selection, fuzzy, normalizers, survivor rule, merge, config profiles)
|
||||||
|
- Find Duplicates button → runs `deduplicate()` with `progress_callback`
|
||||||
|
- Interactive review: expandable match group cards with merge/keep/skip buttons
|
||||||
|
- Download buttons for deduplicated CSV, removed rows, and match groups report
|
||||||
|
|
||||||
|
### src/gui/components.py — Reusable GUI Widgets
|
||||||
|
|
||||||
|
- **`match_group_card()`** — expandable card showing side-by-side row comparison with diff highlighting
|
||||||
|
- **`config_panel()`** — the advanced options expander, returns a `DeduplicationConfig`
|
||||||
|
- **`results_summary()`** — summary stats and download buttons
|
||||||
|
|
||||||
|
## Data Flow
|
||||||
|
|
||||||
|
```
|
||||||
|
Input File
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
read_file() ← auto-detect encoding, delimiter, header
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
DataFrame
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
build_default_strategies() ← (if no explicit strategies)
|
||||||
|
│ scan column names → regex patterns
|
||||||
|
│ strong keys: email, phone (standalone OR)
|
||||||
|
│ weak keys: name, address (AND with strong)
|
||||||
|
▼
|
||||||
|
_apply_normalizations() ← add _norm_* shadow columns
|
||||||
|
│ normalize_email(), normalize_phone(), etc.
|
||||||
|
▼
|
||||||
|
_find_match_groups() ← O(n²) pairwise comparison
|
||||||
|
│ for each pair: try all strategies (OR)
|
||||||
|
│ _compute_similarity() per column
|
||||||
|
│ union-find for transitive closure
|
||||||
|
▼
|
||||||
|
[review_callback()] ← optional: interactive review per group
|
||||||
|
│ True=accept, False=reject, None=skip
|
||||||
|
▼
|
||||||
|
_select_survivor() ← per group: first/last/most-complete/most-recent
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
[_merge_group()] ← optional: fill blanks from losers
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
DeduplicationResult
|
||||||
|
├── deduplicated_df ← cleaned DataFrame (shadow cols dropped)
|
||||||
|
├── removed_df ← rows that were removed
|
||||||
|
├── match_groups ← list of MatchResult with confidence, columns
|
||||||
|
└── log_entries ← human-readable audit log
|
||||||
|
```
|
||||||
|
|
||||||
|
## How to Add a Normalizer
|
||||||
|
|
||||||
|
1. **Add the function** in `src/core/normalizers.py`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
def normalize_company(value: Optional[str]) -> str:
|
||||||
|
"""Strip legal suffixes (Inc, LLC, Corp), case-fold."""
|
||||||
|
if not value or not isinstance(value, str):
|
||||||
|
return ""
|
||||||
|
name = value.strip().casefold()
|
||||||
|
# Strip common suffixes
|
||||||
|
for suffix in ("inc", "llc", "corp", "ltd", "co"):
|
||||||
|
name = re.sub(rf"\b{suffix}\.?\s*$", "", name).strip()
|
||||||
|
return name
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Register it** in the same file:
|
||||||
|
|
||||||
|
```python
|
||||||
|
class NormalizerType(str, Enum):
|
||||||
|
# ... existing types ...
|
||||||
|
COMPANY = "company" # ← add enum value
|
||||||
|
|
||||||
|
_NORMALIZER_MAP: dict[NormalizerType, Callable[[str], str]] = {
|
||||||
|
# ... existing entries ...
|
||||||
|
NormalizerType.COMPANY: normalize_company, # ← add mapping
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Add auto-detection pattern** in `src/core/dedup.py` (optional):
|
||||||
|
|
||||||
|
```python
|
||||||
|
_COLUMN_TYPE_PATTERNS = [
|
||||||
|
# ... existing patterns ...
|
||||||
|
(re.compile(r"company|organization|org_name", re.I),
|
||||||
|
NormalizerType.COMPANY, Algorithm.TOKEN_SET_RATIO, 85.0, False),
|
||||||
|
]
|
||||||
|
```
|
||||||
|
|
||||||
|
## How to Add a Matching Algorithm
|
||||||
|
|
||||||
|
1. **Add the enum value** in `src/core/dedup.py`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
class Algorithm(str, Enum):
|
||||||
|
# ... existing values ...
|
||||||
|
SOUNDEX = "soundex"
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Add the computation** in `_compute_similarity()`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
def _compute_similarity(val_a: str, val_b: str, algorithm: Algorithm) -> float:
|
||||||
|
# ... existing cases ...
|
||||||
|
if algorithm == Algorithm.SOUNDEX:
|
||||||
|
return 100.0 if _soundex(val_a) == _soundex(val_b) else 0.0
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Add the CLI flag value** in `src/cli.py` help text for `--algorithm`.
|
||||||
|
|
||||||
|
## How to Add a Survivor Strategy
|
||||||
|
|
||||||
|
1. **Add the enum value** in `src/core/dedup.py`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
class SurvivorRule(str, Enum):
|
||||||
|
# ... existing values ...
|
||||||
|
KEEP_LONGEST = "longest"
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Add the logic** in `_select_survivor()`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
if rule == SurvivorRule.KEEP_LONGEST:
|
||||||
|
return max(indices, key=lambda i: len(str(df.iloc[i].to_dict())))
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Add to the CLI** survivor map in `src/cli.py`.
|
||||||
|
|
||||||
|
## Testing
|
||||||
|
|
||||||
|
### Run Tests
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# All tests
|
||||||
|
pytest tests/ -q
|
||||||
|
|
||||||
|
# Specific module
|
||||||
|
pytest tests/test_dedup.py -q
|
||||||
|
pytest tests/test_normalizers.py -q
|
||||||
|
pytest tests/test_io.py -q
|
||||||
|
pytest tests/test_config.py -q
|
||||||
|
pytest tests/test_cli.py -q
|
||||||
|
|
||||||
|
# Verbose with output
|
||||||
|
pytest tests/ -v
|
||||||
|
|
||||||
|
# Stop on first failure
|
||||||
|
pytest tests/ -x
|
||||||
|
```
|
||||||
|
|
||||||
|
### Test Structure
|
||||||
|
|
||||||
|
```
|
||||||
|
tests/
|
||||||
|
├── conftest.py # Shared fixtures
|
||||||
|
│ ├── sample_csv_path # Path to samples/messy_sales.csv
|
||||||
|
│ ├── sample_df # Loaded sample CSV as DataFrame
|
||||||
|
│ ├── simple_df # Small 5-row DataFrame with obvious duplicates
|
||||||
|
│ ├── merge_df # DataFrame with partial records
|
||||||
|
│ └── tmp_csv # Temporary CSV from simple_df
|
||||||
|
├── test_dedup.py # Engine tests: similarity, union-find, pairs, integration
|
||||||
|
├── test_normalizers.py # Normalizer tests: all 5 types with edge cases
|
||||||
|
├── test_io.py # I/O tests: encoding, delimiter, header, read/write
|
||||||
|
├── test_config.py # Config tests: serialization round-trip
|
||||||
|
└── test_cli.py # CLI tests: argument parsing, file handling
|
||||||
|
```
|
||||||
|
|
||||||
|
### Writing Tests
|
||||||
|
|
||||||
|
Follow existing patterns. Tests use pytest fixtures from `conftest.py`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
def test_my_feature(simple_df):
|
||||||
|
"""Test description."""
|
||||||
|
result = deduplicate(simple_df, ...)
|
||||||
|
assert len(result.match_groups) == expected
|
||||||
|
assert result.deduplicated_df.shape[0] == expected_rows
|
||||||
|
```
|
||||||
|
|
||||||
|
## Known Limitations
|
||||||
|
|
||||||
|
- **O(n^2) pairwise comparison** — no blocking or indexing. Works well up to ~50,000 rows. Beyond that, performance degrades quadratically. Future optimization: add blocking (partition by first letter, zip code prefix, etc.) to reduce comparison space.
|
||||||
|
- **No multi-sheet dedup** — each Excel sheet is processed independently. Cross-sheet deduplication is not supported.
|
||||||
|
- **Phone normalization requires valid-length numbers** — the `phonenumbers` library rejects numbers that are too short or too long for the detected region. Fallback is digits-only, which may produce false negatives for international numbers without country codes.
|
||||||
|
- **Single-threaded** — no parallel comparison. Could benefit from `multiprocessing` for large files.
|
||||||
|
- **Memory-bound** — entire file is loaded into a pandas DataFrame. Files larger than available RAM will fail. Chunked reading exists but is not integrated with the dedup engine.
|
||||||
2
requirements-dev.txt
Normal file
2
requirements-dev.txt
Normal file
@@ -0,0 +1,2 @@
|
|||||||
|
pytest>=8.0,<9
|
||||||
|
pytest-cov>=5.0,<6
|
||||||
10
requirements.txt
Normal file
10
requirements.txt
Normal file
@@ -0,0 +1,10 @@
|
|||||||
|
pandas>=2.2,<3
|
||||||
|
openpyxl>=3.1,<4
|
||||||
|
numpy>=1.26,<3
|
||||||
|
rapidfuzz>=3.6,<4
|
||||||
|
charset-normalizer>=3.3,<4
|
||||||
|
loguru>=0.7,<1
|
||||||
|
tqdm>=4.66,<5
|
||||||
|
typer>=0.12,<1
|
||||||
|
phonenumbers>=8.13,<9
|
||||||
|
streamlit>=1.35,<2
|
||||||
51
samples/messy_sales.csv
Normal file
51
samples/messy_sales.csv
Normal file
@@ -0,0 +1,51 @@
|
|||||||
|
order_id,customer_name,email,phone,address,product,amount,date
|
||||||
|
1001,John Smith,john.smith@gmail.com,(555) 123-4567,123 Main Street Apt 4,Widget Pro,49.99,2024-01-15
|
||||||
|
1002,John Smith,johnsmith@gmail.com,555-123-4567,123 Main St Apt 4,Widget Pro,49.99,2024-01-15
|
||||||
|
1003,Mr. John Smith,john.smith+promo@gmail.com,+1 555-123-4567,123 Main St. Apartment 4,Widget Pro,,2024-01-15
|
||||||
|
1004,Sarah Johnson,sarah.j@example.com,(555) 234-5678,456 Oak Avenue,Gadget X,89.99,2024-02-01
|
||||||
|
1005,Sara Johnson,sarah.j@example.com,555.234.5678,456 Oak Ave,Gadget X,89.99,2024-02-03
|
||||||
|
1006,Sarah M. Johnson,sarahj@example.com,5552345678,456 Oak Ave Suite 2,,89.99,2024-02-05
|
||||||
|
1007,Michael Williams,mike.w@company.org,(555) 345-6789,789 Pine Boulevard,Super Tool,129.99,2024-02-10
|
||||||
|
1008,Mike Williams,mike.w@company.org,555-345-6789,789 Pine Blvd,Super Tool,129.99,2024-02-10
|
||||||
|
1009,Dr. Michael Williams III,michael.williams@company.org,+15553456789,789 Pine Blvd,,129.99,2024-02-12
|
||||||
|
1010,Emily Davis,emily.davis@mail.com,(555) 456-7890,321 Elm Drive,Basic Set,29.99,2024-03-01
|
||||||
|
1011,Emily Davis,emily.davis@mail.com,(555) 456-7890,321 Elm Drive,Basic Set,29.99,2024-03-01
|
||||||
|
1012,Robert Brown,r.brown@email.net,(555) 567-8901,654 Cedar Lane,Premium Kit,199.99,2024-03-15
|
||||||
|
1013,Bob Brown,r.brown@email.net,555-567-8901,654 Cedar Ln,Premium Kit,199.99,2024-03-15
|
||||||
|
1014,Robert J. Brown Jr.,rj.brown@email.net,5555678901,654 Cedar Lane,Premium Kit,,2024-03-17
|
||||||
|
1015,Jennifer Wilson,jen.wilson@web.io,(555) 678-9012,987 Birch Court,Starter Pack,19.99,2024-04-01
|
||||||
|
1016,Jennifer Wilson,jen.wilson@web.io,(555) 678-9012,987 Birch Ct,Starter Pack,19.99,2024-04-01
|
||||||
|
1017,David Martinez,david.m@service.com,(555) 789-0123,246 Maple Road,Deluxe Widget,79.99,2024-04-15
|
||||||
|
1018,David Martinez,d.martinez@service.com,555-789-0123,246 Maple Rd,Deluxe Widget,79.99,2024-04-15
|
||||||
|
1019,Lisa Anderson,lisa.a@domain.com,(555) 890-1234,135 Walnut Way,Mini Tool,14.99,2024-05-01
|
||||||
|
1020,Lisa M Anderson,lisa.a@domain.com,555-890-1234,135 Walnut Way,Mini Tool,14.99,2024-05-01
|
||||||
|
1021,Thomas Taylor,tom.taylor@inbox.com,(555) 901-2345,864 Spruce Trail,Combo Pack,59.99,2024-05-15
|
||||||
|
1022,Tom Taylor,tom.taylor@inbox.com,555-901-2345,864 Spruce Trl,Combo Pack,59.99,2024-05-15
|
||||||
|
1023,Prof. Thomas R. Taylor,thomas.taylor@inbox.com,+1-555-901-2345,864 Spruce Trail Unit 3,,59.99,2024-05-18
|
||||||
|
1024,Karen White,karen.w@test.com,(555) 012-3456,579 Ash Circle,Widget Pro,49.99,2024-06-01
|
||||||
|
1025,Karen White,karen.w+newsletter@test.com,555-012-3456,579 Ash Cir,Widget Pro,49.99,2024-06-01
|
||||||
|
1026,James Clark,j.clark@biz.co,(555) 111-2222,111 First Street North,Enterprise License,499.99,2024-06-15
|
||||||
|
1027,James Clark,j.clark@biz.co,555-111-2222,111 1st St N,Enterprise License,499.99,2024-06-15
|
||||||
|
1028,Patricia Lee,pat.lee@mail.com,(555) 222-3333,222 Second Avenue South,Basic Set,29.99,2024-07-01
|
||||||
|
1029,Pat Lee,pat.lee@mail.com,555-222-3333,222 2nd Ave S,Basic Set,29.99,2024-07-01
|
||||||
|
1030,Christopher Hall,chris.hall@email.org,(555) 333-4444,333 Third Boulevard East,Gadget X,89.99,2024-07-15
|
||||||
|
1031,Chris Hall,chris.hall@email.org,555-333-4444,333 3rd Blvd E,Gadget X,89.99,2024-07-15
|
||||||
|
1032,Amanda Young,amanda.y@web.net,(555) 444-5555,444 Oak Parkway,Super Tool,129.99,2024-08-01
|
||||||
|
1033,Amanda Young,amanda.y@web.net,(555) 444-5555,444 Oak Pkwy,Super Tool,129.99,2024-08-01
|
||||||
|
1034,Daniel King,dan.king@corp.io,(555) 555-6666,555 Elm Square,Premium Kit,199.99,2024-08-15
|
||||||
|
1035,Dan King,dan.king@corp.io,555.555.6666,555 Elm Sq,Premium Kit,199.99,2024-08-15
|
||||||
|
1036,Michelle Scott,m.scott@place.com,(555) 666-7777,666 Pine Highway,Deluxe Widget,79.99,2024-09-01
|
||||||
|
1037,Michelle Scott,m.scott@place.com,(555) 666-7777,666 Pine Hwy,Deluxe Widget,79.99,2024-09-01
|
||||||
|
1038,Kevin Adams,kevin.a@shop.biz,(555) 777-8888,777 Cedar Terrace,Starter Pack,19.99,2024-09-15
|
||||||
|
1039,Kevin Adams,kevin.a+deals@shop.biz,555-777-8888,777 Cedar Ter,Starter Pack,19.99,2024-09-15
|
||||||
|
1040,Nancy Wright,nancy.w@home.net,(555) 888-9999,888 Maple Place,Mini Tool,14.99,2024-10-01
|
||||||
|
1041,Nancy Wright,nancy.w@home.net,(555) 888-9999,888 Maple Pl,Mini Tool,14.99,2024-10-01
|
||||||
|
1042,George Lopez,g.lopez@firma.com,(555) 999-0000,999 Walnut Expressway,Combo Pack,59.99,2024-10-15
|
||||||
|
1043,George Lopez,g.lopez@firma.com,555-999-0000,999 Walnut Expy,Combo Pack,59.99,2024-10-15
|
||||||
|
1044,Sandra Hill,s.hill@provider.co,(555) 101-2020,1010 Spruce Crossing,Widget Pro,49.99,2024-11-01
|
||||||
|
1045,Sandra Hill,s.hill@provider.co,555-101-2020,1010 Spruce Xing,Widget Pro,49.99,2024-11-01
|
||||||
|
1046,Mark Robinson,mark.r@office.com,(555) 202-3030,2020 Ash Heights,Enterprise License,499.99,2024-11-15
|
||||||
|
1047,Mark Robinson,mark.r@office.com,555-202-3030,2020 Ash Hts,Enterprise License,499.99,2024-11-15
|
||||||
|
1048,Unique Customer One,unique1@solo.com,(555) 000-0001,1 Unique Road,Widget Pro,49.99,2024-12-01
|
||||||
|
1049,Unique Customer Two,unique2@solo.com,(555) 000-0002,2 Unique Road,Gadget X,89.99,2024-12-02
|
||||||
|
1050,Unique Customer Three,unique3@solo.com,(555) 000-0003,3 Unique Road,Super Tool,129.99,2024-12-03
|
||||||
|
50
samples/messy_sales_match_groups.csv
Normal file
50
samples/messy_sales_match_groups.csv
Normal file
@@ -0,0 +1,50 @@
|
|||||||
|
_group_id,_is_survivor,_confidence,_matched_on,_original_row,order_id,customer_name,email,phone,address,product,amount,date
|
||||||
|
1,True,83.81,customer_name,1,1001,John Smith,john.smith@gmail.com,(555) 123-4567,123 Main Street Apt 4,Widget Pro,49.99,2024-01-15
|
||||||
|
1,False,83.81,customer_name,2,1002,John Smith,johnsmith@gmail.com,555-123-4567,123 Main St Apt 4,Widget Pro,49.99,2024-01-15
|
||||||
|
1,False,83.81,customer_name,3,1003,Mr. John Smith,john.smith+promo@gmail.com,+1 555-123-4567,123 Main St. Apartment 4,Widget Pro,,2024-01-15
|
||||||
|
2,True,91.67,customer_name,4,1004,Sarah Johnson,sarah.j@example.com,(555) 234-5678,456 Oak Avenue,Gadget X,89.99,2024-02-01
|
||||||
|
2,False,91.67,customer_name,5,1005,Sara Johnson,sarah.j@example.com,555.234.5678,456 Oak Ave,Gadget X,89.99,2024-02-03
|
||||||
|
2,False,91.67,customer_name,6,1006,Sarah M. Johnson,sarahj@example.com,5552345678,456 Oak Ave Suite 2,,89.99,2024-02-05
|
||||||
|
3,True,80.56,customer_name,7,1007,Michael Williams,mike.w@company.org,(555) 345-6789,789 Pine Boulevard,Super Tool,129.99,2024-02-10
|
||||||
|
3,False,80.56,customer_name,8,1008,Mike Williams,mike.w@company.org,555-345-6789,789 Pine Blvd,Super Tool,129.99,2024-02-10
|
||||||
|
3,False,80.56,customer_name,9,1009,Dr. Michael Williams III,michael.williams@company.org,+15553456789,789 Pine Blvd,,129.99,2024-02-12
|
||||||
|
4,True,100.0,customer_name,10,1010,Emily Davis,emily.davis@mail.com,(555) 456-7890,321 Elm Drive,Basic Set,29.99,2024-03-01
|
||||||
|
4,False,100.0,customer_name,11,1011,Emily Davis,emily.davis@mail.com,(555) 456-7890,321 Elm Drive,Basic Set,29.99,2024-03-01
|
||||||
|
5,True,81.02,customer_name,12,1012,Robert Brown,r.brown@email.net,(555) 567-8901,654 Cedar Lane,Premium Kit,199.99,2024-03-15
|
||||||
|
5,False,81.02,customer_name,13,1013,Bob Brown,r.brown@email.net,555-567-8901,654 Cedar Ln,Premium Kit,199.99,2024-03-15
|
||||||
|
5,False,81.02,customer_name,14,1014,Robert J. Brown Jr.,rj.brown@email.net,5555678901,654 Cedar Lane,Premium Kit,,2024-03-17
|
||||||
|
6,True,100.0,customer_name,15,1015,Jennifer Wilson,jen.wilson@web.io,(555) 678-9012,987 Birch Court,Starter Pack,19.99,2024-04-01
|
||||||
|
6,False,100.0,customer_name,16,1016,Jennifer Wilson,jen.wilson@web.io,(555) 678-9012,987 Birch Ct,Starter Pack,19.99,2024-04-01
|
||||||
|
7,True,100.0,customer_name,17,1017,David Martinez,david.m@service.com,(555) 789-0123,246 Maple Road,Deluxe Widget,79.99,2024-04-15
|
||||||
|
7,False,100.0,customer_name,18,1018,David Martinez,d.martinez@service.com,555-789-0123,246 Maple Rd,Deluxe Widget,79.99,2024-04-15
|
||||||
|
8,True,97.33,customer_name,19,1019,Lisa Anderson,lisa.a@domain.com,(555) 890-1234,135 Walnut Way,Mini Tool,14.99,2024-05-01
|
||||||
|
8,False,97.33,customer_name,20,1020,Lisa M Anderson,lisa.a@domain.com,555-890-1234,135 Walnut Way,Mini Tool,14.99,2024-05-01
|
||||||
|
9,True,90.08,customer_name,21,1021,Thomas Taylor,tom.taylor@inbox.com,(555) 901-2345,864 Spruce Trail,Combo Pack,59.99,2024-05-15
|
||||||
|
9,False,90.08,customer_name,22,1022,Tom Taylor,tom.taylor@inbox.com,555-901-2345,864 Spruce Trl,Combo Pack,59.99,2024-05-15
|
||||||
|
10,True,100.0,customer_name,24,1024,Karen White,karen.w@test.com,(555) 012-3456,579 Ash Circle,Widget Pro,49.99,2024-06-01
|
||||||
|
10,False,100.0,customer_name,25,1025,Karen White,karen.w+newsletter@test.com,555-012-3456,579 Ash Cir,Widget Pro,49.99,2024-06-01
|
||||||
|
11,True,100.0,customer_name,26,1026,James Clark,j.clark@biz.co,(555) 111-2222,111 First Street North,Enterprise License,499.99,2024-06-15
|
||||||
|
11,False,100.0,customer_name,27,1027,James Clark,j.clark@biz.co,555-111-2222,111 1st St N,Enterprise License,499.99,2024-06-15
|
||||||
|
12,True,90.28,customer_name,28,1028,Patricia Lee,pat.lee@mail.com,(555) 222-3333,222 Second Avenue South,Basic Set,29.99,2024-07-01
|
||||||
|
12,False,90.28,customer_name,29,1029,Pat Lee,pat.lee@mail.com,555-222-3333,222 2nd Ave S,Basic Set,29.99,2024-07-01
|
||||||
|
13,True,92.5,customer_name,30,1030,Christopher Hall,chris.hall@email.org,(555) 333-4444,333 Third Boulevard East,Gadget X,89.99,2024-07-15
|
||||||
|
13,False,92.5,customer_name,31,1031,Chris Hall,chris.hall@email.org,555-333-4444,333 3rd Blvd E,Gadget X,89.99,2024-07-15
|
||||||
|
14,True,100.0,customer_name,32,1032,Amanda Young,amanda.y@web.net,(555) 444-5555,444 Oak Parkway,Super Tool,129.99,2024-08-01
|
||||||
|
14,False,100.0,customer_name,33,1033,Amanda Young,amanda.y@web.net,(555) 444-5555,444 Oak Pkwy,Super Tool,129.99,2024-08-01
|
||||||
|
15,True,90.72,customer_name,34,1034,Daniel King,dan.king@corp.io,(555) 555-6666,555 Elm Square,Premium Kit,199.99,2024-08-15
|
||||||
|
15,False,90.72,customer_name,35,1035,Dan King,dan.king@corp.io,555.555.6666,555 Elm Sq,Premium Kit,199.99,2024-08-15
|
||||||
|
16,True,100.0,customer_name,36,1036,Michelle Scott,m.scott@place.com,(555) 666-7777,666 Pine Highway,Deluxe Widget,79.99,2024-09-01
|
||||||
|
16,False,100.0,customer_name,37,1037,Michelle Scott,m.scott@place.com,(555) 666-7777,666 Pine Hwy,Deluxe Widget,79.99,2024-09-01
|
||||||
|
17,True,100.0,customer_name,38,1038,Kevin Adams,kevin.a@shop.biz,(555) 777-8888,777 Cedar Terrace,Starter Pack,19.99,2024-09-15
|
||||||
|
17,False,100.0,customer_name,39,1039,Kevin Adams,kevin.a+deals@shop.biz,555-777-8888,777 Cedar Ter,Starter Pack,19.99,2024-09-15
|
||||||
|
18,True,100.0,customer_name,40,1040,Nancy Wright,nancy.w@home.net,(555) 888-9999,888 Maple Place,Mini Tool,14.99,2024-10-01
|
||||||
|
18,False,100.0,customer_name,41,1041,Nancy Wright,nancy.w@home.net,(555) 888-9999,888 Maple Pl,Mini Tool,14.99,2024-10-01
|
||||||
|
19,True,100.0,customer_name,42,1042,George Lopez,g.lopez@firma.com,(555) 999-0000,999 Walnut Expressway,Combo Pack,59.99,2024-10-15
|
||||||
|
19,False,100.0,customer_name,43,1043,George Lopez,g.lopez@firma.com,555-999-0000,999 Walnut Expy,Combo Pack,59.99,2024-10-15
|
||||||
|
20,True,100.0,customer_name,44,1044,Sandra Hill,s.hill@provider.co,(555) 101-2020,1010 Spruce Crossing,Widget Pro,49.99,2024-11-01
|
||||||
|
20,False,100.0,customer_name,45,1045,Sandra Hill,s.hill@provider.co,555-101-2020,1010 Spruce Xing,Widget Pro,49.99,2024-11-01
|
||||||
|
21,True,100.0,customer_name,46,1046,Mark Robinson,mark.r@office.com,(555) 202-3030,2020 Ash Heights,Enterprise License,499.99,2024-11-15
|
||||||
|
21,False,100.0,customer_name,47,1047,Mark Robinson,mark.r@office.com,555-202-3030,2020 Ash Hts,Enterprise License,499.99,2024-11-15
|
||||||
|
22,True,93.68,customer_name,48,1048,Unique Customer One,unique1@solo.com,(555) 000-0001,1 Unique Road,Widget Pro,49.99,2024-12-01
|
||||||
|
22,False,93.68,customer_name,49,1049,Unique Customer Two,unique2@solo.com,(555) 000-0002,2 Unique Road,Gadget X,89.99,2024-12-02
|
||||||
|
22,False,93.68,customer_name,50,1050,Unique Customer Three,unique3@solo.com,(555) 000-0003,3 Unique Road,Super Tool,129.99,2024-12-03
|
||||||
|
28
samples/messy_sales_removed.csv
Normal file
28
samples/messy_sales_removed.csv
Normal file
@@ -0,0 +1,28 @@
|
|||||||
|
order_id,customer_name,email,phone,address,product,amount,date
|
||||||
|
1002,John Smith,johnsmith@gmail.com,555-123-4567,123 Main St Apt 4,Widget Pro,49.99,2024-01-15
|
||||||
|
1003,Mr. John Smith,john.smith+promo@gmail.com,+1 555-123-4567,123 Main St. Apartment 4,Widget Pro,,2024-01-15
|
||||||
|
1005,Sara Johnson,sarah.j@example.com,555.234.5678,456 Oak Ave,Gadget X,89.99,2024-02-03
|
||||||
|
1006,Sarah M. Johnson,sarahj@example.com,5552345678,456 Oak Ave Suite 2,,89.99,2024-02-05
|
||||||
|
1008,Mike Williams,mike.w@company.org,555-345-6789,789 Pine Blvd,Super Tool,129.99,2024-02-10
|
||||||
|
1009,Dr. Michael Williams III,michael.williams@company.org,+15553456789,789 Pine Blvd,,129.99,2024-02-12
|
||||||
|
1011,Emily Davis,emily.davis@mail.com,(555) 456-7890,321 Elm Drive,Basic Set,29.99,2024-03-01
|
||||||
|
1013,Bob Brown,r.brown@email.net,555-567-8901,654 Cedar Ln,Premium Kit,199.99,2024-03-15
|
||||||
|
1014,Robert J. Brown Jr.,rj.brown@email.net,5555678901,654 Cedar Lane,Premium Kit,,2024-03-17
|
||||||
|
1016,Jennifer Wilson,jen.wilson@web.io,(555) 678-9012,987 Birch Ct,Starter Pack,19.99,2024-04-01
|
||||||
|
1018,David Martinez,d.martinez@service.com,555-789-0123,246 Maple Rd,Deluxe Widget,79.99,2024-04-15
|
||||||
|
1020,Lisa M Anderson,lisa.a@domain.com,555-890-1234,135 Walnut Way,Mini Tool,14.99,2024-05-01
|
||||||
|
1022,Tom Taylor,tom.taylor@inbox.com,555-901-2345,864 Spruce Trl,Combo Pack,59.99,2024-05-15
|
||||||
|
1025,Karen White,karen.w+newsletter@test.com,555-012-3456,579 Ash Cir,Widget Pro,49.99,2024-06-01
|
||||||
|
1027,James Clark,j.clark@biz.co,555-111-2222,111 1st St N,Enterprise License,499.99,2024-06-15
|
||||||
|
1029,Pat Lee,pat.lee@mail.com,555-222-3333,222 2nd Ave S,Basic Set,29.99,2024-07-01
|
||||||
|
1031,Chris Hall,chris.hall@email.org,555-333-4444,333 3rd Blvd E,Gadget X,89.99,2024-07-15
|
||||||
|
1033,Amanda Young,amanda.y@web.net,(555) 444-5555,444 Oak Pkwy,Super Tool,129.99,2024-08-01
|
||||||
|
1035,Dan King,dan.king@corp.io,555.555.6666,555 Elm Sq,Premium Kit,199.99,2024-08-15
|
||||||
|
1037,Michelle Scott,m.scott@place.com,(555) 666-7777,666 Pine Hwy,Deluxe Widget,79.99,2024-09-01
|
||||||
|
1039,Kevin Adams,kevin.a+deals@shop.biz,555-777-8888,777 Cedar Ter,Starter Pack,19.99,2024-09-15
|
||||||
|
1041,Nancy Wright,nancy.w@home.net,(555) 888-9999,888 Maple Pl,Mini Tool,14.99,2024-10-01
|
||||||
|
1043,George Lopez,g.lopez@firma.com,555-999-0000,999 Walnut Expy,Combo Pack,59.99,2024-10-15
|
||||||
|
1045,Sandra Hill,s.hill@provider.co,555-101-2020,1010 Spruce Xing,Widget Pro,49.99,2024-11-01
|
||||||
|
1047,Mark Robinson,mark.r@office.com,555-202-3030,2020 Ash Hts,Enterprise License,499.99,2024-11-15
|
||||||
|
1049,Unique Customer Two,unique2@solo.com,(555) 000-0002,2 Unique Road,Gadget X,89.99,2024-12-02
|
||||||
|
1050,Unique Customer Three,unique3@solo.com,(555) 000-0003,3 Unique Road,Super Tool,129.99,2024-12-03
|
||||||
|
0
src/__init__.py
Normal file
0
src/__init__.py
Normal file
4
src/__main__.py
Normal file
4
src/__main__.py
Normal file
@@ -0,0 +1,4 @@
|
|||||||
|
"""Allow running as ``python -m src``."""
|
||||||
|
from src.cli import main
|
||||||
|
|
||||||
|
main()
|
||||||
502
src/cli.py
Normal file
502
src/cli.py
Normal file
@@ -0,0 +1,502 @@
|
|||||||
|
"""CLI for the DataTools deduplicator.
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
python -m src.cli input.csv # dry-run preview
|
||||||
|
python -m src.cli input.csv --apply # write deduplicated output
|
||||||
|
python -m src.cli input.csv --fuzzy name --merge # fuzzy match + merge
|
||||||
|
python -m src.cli --help # full help
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import sys
|
||||||
|
from datetime import datetime
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Optional
|
||||||
|
|
||||||
|
import typer
|
||||||
|
from loguru import logger
|
||||||
|
from rapidfuzz import process as rf_process
|
||||||
|
|
||||||
|
app = typer.Typer(
|
||||||
|
name="dedup",
|
||||||
|
help=(
|
||||||
|
"Find and remove duplicate rows in CSV and Excel files.\n\n"
|
||||||
|
"By default, runs in preview mode — shows what would change without "
|
||||||
|
"modifying anything. Add --apply to write the output.\n\n"
|
||||||
|
"Examples:\n\n"
|
||||||
|
" # Preview duplicates in a CSV file\n"
|
||||||
|
" python -m src.cli customers.csv\n\n"
|
||||||
|
" # Remove duplicates and save the result\n"
|
||||||
|
" python -m src.cli customers.csv --apply\n\n"
|
||||||
|
" # Fuzzy-match on the 'name' column with 80% threshold\n"
|
||||||
|
" python -m src.cli customers.csv --fuzzy name --threshold 80 --apply\n\n"
|
||||||
|
" # Match on specific columns only\n"
|
||||||
|
" python -m src.cli customers.csv --subset email,phone --apply\n\n"
|
||||||
|
" # Keep the most complete row and merge missing fields\n"
|
||||||
|
" python -m src.cli customers.csv --survivor most-complete --merge --apply\n"
|
||||||
|
),
|
||||||
|
add_completion=False,
|
||||||
|
no_args_is_help=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Helpers
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
def _setup_logging(log_dir: Path) -> Path:
|
||||||
|
"""Configure loguru to write a timestamped log file. Returns the log path."""
|
||||||
|
log_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
ts = datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||||
|
log_path = log_dir / f"dedup_{ts}.log"
|
||||||
|
logger.remove() # remove default stderr handler
|
||||||
|
logger.add(sys.stderr, level="WARNING", format="{message}")
|
||||||
|
logger.add(str(log_path), level="DEBUG",
|
||||||
|
format="{time:YYYY-MM-DD HH:mm:ss} | {level:<8} | {message}")
|
||||||
|
return log_path
|
||||||
|
|
||||||
|
|
||||||
|
def _suggest_column(name: str, available: list[str]) -> str:
|
||||||
|
"""Return a helpful error message when a column is not found."""
|
||||||
|
cols_str = ", ".join(available)
|
||||||
|
matches = rf_process.extract(name, available, limit=1, score_cutoff=50)
|
||||||
|
if matches:
|
||||||
|
suggestion = matches[0][0]
|
||||||
|
return (
|
||||||
|
f"Column '{name}' not found. "
|
||||||
|
f"Available columns: {cols_str}. "
|
||||||
|
f"Did you mean '{suggestion}'?"
|
||||||
|
)
|
||||||
|
return f"Column '{name}' not found. Available columns: {cols_str}."
|
||||||
|
|
||||||
|
|
||||||
|
def _validate_columns(requested: list[str], available: list[str]) -> None:
|
||||||
|
"""Raise typer.BadParameter if any requested column doesn't exist."""
|
||||||
|
for col in requested:
|
||||||
|
if col not in available:
|
||||||
|
raise typer.BadParameter(_suggest_column(col, available))
|
||||||
|
|
||||||
|
|
||||||
|
def _parse_normalize_map(raw: Optional[str]) -> dict[str, str]:
|
||||||
|
"""Parse 'col:type,col:type' into a dict."""
|
||||||
|
if not raw:
|
||||||
|
return {}
|
||||||
|
result = {}
|
||||||
|
for pair in raw.split(","):
|
||||||
|
pair = pair.strip()
|
||||||
|
if ":" not in pair:
|
||||||
|
raise typer.BadParameter(
|
||||||
|
f"Invalid normalize format: '{pair}'. "
|
||||||
|
f"Expected 'column:type' (e.g., 'email:email,phone:phone')."
|
||||||
|
)
|
||||||
|
col, ntype = pair.split(":", 1)
|
||||||
|
result[col.strip()] = ntype.strip()
|
||||||
|
return result
|
||||||
|
|
||||||
|
|
||||||
|
def _interactive_review(group, df) -> Optional[bool]:
|
||||||
|
"""Side-by-side CLI review for a match group. Returns True/False/None."""
|
||||||
|
from src.core.dedup import MatchResult
|
||||||
|
group: MatchResult
|
||||||
|
|
||||||
|
print(f"\n{'='*60}")
|
||||||
|
print(f"Match Group {group.group_id + 1} — Confidence: {group.confidence:.1f}%")
|
||||||
|
print(f"Matched on: {', '.join(group.matched_on)}")
|
||||||
|
print(f"{'='*60}")
|
||||||
|
|
||||||
|
display_cols = [c for c in df.columns if not str(c).startswith("_norm_")]
|
||||||
|
for idx in group.row_indices:
|
||||||
|
print(f"\n Row {idx + 1}:")
|
||||||
|
for col in display_cols:
|
||||||
|
val = df.iloc[idx].get(col, "")
|
||||||
|
if str(val).strip():
|
||||||
|
print(f" {col}: {val}")
|
||||||
|
|
||||||
|
while True:
|
||||||
|
choice = input("\n [y] Merge [n] Keep both [s] Skip remaining: ").strip().lower()
|
||||||
|
if choice == "y":
|
||||||
|
return True
|
||||||
|
if choice == "n":
|
||||||
|
return False
|
||||||
|
if choice == "s":
|
||||||
|
return None
|
||||||
|
print(" Please enter y, n, or s.")
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Main command
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
@app.command()
|
||||||
|
def dedup(
|
||||||
|
input_file: str = typer.Argument(
|
||||||
|
...,
|
||||||
|
help="Path to the CSV or Excel file to deduplicate.",
|
||||||
|
),
|
||||||
|
output: Optional[str] = typer.Option(
|
||||||
|
None, "--output", "-o",
|
||||||
|
help="Output file path. Default: {input}_deduplicated.csv",
|
||||||
|
),
|
||||||
|
apply: bool = typer.Option(
|
||||||
|
False, "--apply",
|
||||||
|
help="Write the output file. Without this flag, only a preview is shown.",
|
||||||
|
),
|
||||||
|
key: Optional[str] = typer.Option(
|
||||||
|
None, "--key", "-k",
|
||||||
|
help="Comma-separated strong-key columns (e.g., 'fb_id,ein'). Each is an independent exact-match dedup key.",
|
||||||
|
),
|
||||||
|
subset: Optional[str] = typer.Option(
|
||||||
|
None, "--subset", "-s",
|
||||||
|
help="Comma-separated columns to match on (default: auto-detect).",
|
||||||
|
),
|
||||||
|
fuzzy: Optional[str] = typer.Option(
|
||||||
|
None, "--fuzzy",
|
||||||
|
help="Comma-separated columns to fuzzy-match (others use exact match).",
|
||||||
|
),
|
||||||
|
algorithm: str = typer.Option(
|
||||||
|
"jaro_winkler", "--algorithm", "-a",
|
||||||
|
help="Fuzzy algorithm: levenshtein, jaro_winkler, or token_set_ratio.",
|
||||||
|
),
|
||||||
|
threshold: int = typer.Option(
|
||||||
|
85, "--threshold", "-t",
|
||||||
|
help="Similarity threshold 0-100 for fuzzy matching.",
|
||||||
|
),
|
||||||
|
normalize: Optional[str] = typer.Option(
|
||||||
|
None, "--normalize",
|
||||||
|
help="Column normalizers as 'col:type' pairs (e.g., 'email:email,phone:phone').",
|
||||||
|
),
|
||||||
|
survivor: str = typer.Option(
|
||||||
|
"first", "--survivor",
|
||||||
|
help="Survivor rule: first, last, most-complete, or most-recent.",
|
||||||
|
),
|
||||||
|
date_column: Optional[str] = typer.Option(
|
||||||
|
None, "--date-column",
|
||||||
|
help="Date column for most-recent survivor rule.",
|
||||||
|
),
|
||||||
|
merge: bool = typer.Option(
|
||||||
|
False, "--merge",
|
||||||
|
help="Fill missing fields in the surviving row from removed duplicates.",
|
||||||
|
),
|
||||||
|
review: bool = typer.Option(
|
||||||
|
False, "--review",
|
||||||
|
help="Interactively review each match group before merging.",
|
||||||
|
),
|
||||||
|
config: Optional[str] = typer.Option(
|
||||||
|
None, "--config",
|
||||||
|
help="Load settings from a saved JSON config file.",
|
||||||
|
),
|
||||||
|
save_config: Optional[str] = typer.Option(
|
||||||
|
None, "--save-config",
|
||||||
|
help="Save current settings to a JSON config file.",
|
||||||
|
),
|
||||||
|
sheet: Optional[str] = typer.Option(
|
||||||
|
None, "--sheet",
|
||||||
|
help="Excel sheet name or index (default: first sheet).",
|
||||||
|
),
|
||||||
|
encoding_override: Optional[str] = typer.Option(
|
||||||
|
None, "--encoding",
|
||||||
|
help="Override auto-detected file encoding.",
|
||||||
|
),
|
||||||
|
header_row: Optional[int] = typer.Option(
|
||||||
|
None, "--header-row",
|
||||||
|
help="0-based row index for the header (default: auto-detect).",
|
||||||
|
),
|
||||||
|
):
|
||||||
|
"""Find and remove duplicate rows in CSV and Excel files."""
|
||||||
|
from src.core.io import read_file, write_file, list_sheets
|
||||||
|
from src.core.dedup import (
|
||||||
|
Algorithm, ColumnMatchStrategy, MatchStrategy, SurvivorRule,
|
||||||
|
build_default_strategies, deduplicate,
|
||||||
|
)
|
||||||
|
from src.core.normalizers import NormalizerType
|
||||||
|
from src.core.config import DeduplicationConfig
|
||||||
|
|
||||||
|
# Setup
|
||||||
|
input_path = Path(input_file)
|
||||||
|
if not input_path.exists():
|
||||||
|
typer.echo(f"Error: File not found: {input_path}", err=True)
|
||||||
|
raise typer.Exit(1)
|
||||||
|
|
||||||
|
log_path = _setup_logging(Path("logs"))
|
||||||
|
|
||||||
|
# Load config if provided
|
||||||
|
cfg: Optional[DeduplicationConfig] = None
|
||||||
|
if config:
|
||||||
|
config_path = Path(config)
|
||||||
|
if not config_path.exists():
|
||||||
|
typer.echo(f"Error: Config file not found: {config_path}", err=True)
|
||||||
|
raise typer.Exit(1)
|
||||||
|
cfg = DeduplicationConfig.from_file(config_path)
|
||||||
|
logger.info("Loaded config from {}", config_path)
|
||||||
|
|
||||||
|
# Read input
|
||||||
|
typer.echo(f"Reading {input_path.name}...")
|
||||||
|
try:
|
||||||
|
sheet_arg: str | int | None = None
|
||||||
|
if sheet is not None:
|
||||||
|
try:
|
||||||
|
sheet_arg = int(sheet)
|
||||||
|
except ValueError:
|
||||||
|
sheet_arg = sheet
|
||||||
|
|
||||||
|
df = read_file(
|
||||||
|
input_path,
|
||||||
|
encoding=encoding_override,
|
||||||
|
header_row=header_row,
|
||||||
|
sheet_name=sheet_arg if sheet_arg is not None else 0,
|
||||||
|
)
|
||||||
|
if not isinstance(df, __import__("pandas").DataFrame):
|
||||||
|
# chunked reading returns generator — materialise for v1
|
||||||
|
import pandas as pd
|
||||||
|
df = pd.concat(list(df), ignore_index=True)
|
||||||
|
except Exception as e:
|
||||||
|
typer.echo(f"Error reading file: {e}", err=True)
|
||||||
|
raise typer.Exit(1)
|
||||||
|
|
||||||
|
typer.echo(f" {len(df)} rows, {len(df.columns)} columns")
|
||||||
|
available_columns = list(df.columns)
|
||||||
|
|
||||||
|
# Build strategies
|
||||||
|
strategies: Optional[list[MatchStrategy]] = None
|
||||||
|
|
||||||
|
if cfg and cfg.strategies:
|
||||||
|
strategies = cfg.to_strategies()
|
||||||
|
elif subset or fuzzy:
|
||||||
|
# Build from CLI flags
|
||||||
|
normalize_map = _parse_normalize_map(normalize)
|
||||||
|
strategies = []
|
||||||
|
|
||||||
|
fuzzy_cols = set(c.strip() for c in fuzzy.split(",")) if fuzzy else set()
|
||||||
|
if subset:
|
||||||
|
subset_cols = [c.strip() for c in subset.split(",")]
|
||||||
|
elif fuzzy_cols:
|
||||||
|
# When only --fuzzy is given, match on just those columns
|
||||||
|
subset_cols = list(fuzzy_cols)
|
||||||
|
else:
|
||||||
|
subset_cols = available_columns
|
||||||
|
|
||||||
|
_validate_columns(subset_cols, available_columns)
|
||||||
|
if fuzzy_cols:
|
||||||
|
_validate_columns(list(fuzzy_cols), available_columns)
|
||||||
|
|
||||||
|
col_strats: list[ColumnMatchStrategy] = []
|
||||||
|
for col in subset_cols:
|
||||||
|
norm = None
|
||||||
|
if col in normalize_map:
|
||||||
|
norm = NormalizerType(normalize_map[col])
|
||||||
|
|
||||||
|
if col in fuzzy_cols:
|
||||||
|
algo = Algorithm(algorithm)
|
||||||
|
thresh = float(threshold)
|
||||||
|
else:
|
||||||
|
algo = Algorithm.EXACT
|
||||||
|
thresh = 100.0
|
||||||
|
|
||||||
|
col_strats.append(ColumnMatchStrategy(
|
||||||
|
column=col, algorithm=algo, threshold=thresh, normalizer=norm,
|
||||||
|
))
|
||||||
|
|
||||||
|
strategies = [MatchStrategy(column_strategies=col_strats)]
|
||||||
|
|
||||||
|
# Apply normalizer overrides even with auto-detect
|
||||||
|
if normalize and strategies is None:
|
||||||
|
normalize_map = _parse_normalize_map(normalize)
|
||||||
|
auto_strats = build_default_strategies(df)
|
||||||
|
# Inject normalize_map into auto strategies
|
||||||
|
for strat in auto_strats:
|
||||||
|
for cs in strat.column_strategies:
|
||||||
|
if cs.column in normalize_map:
|
||||||
|
cs.normalizer = NormalizerType(normalize_map[cs.column])
|
||||||
|
strategies = auto_strats
|
||||||
|
|
||||||
|
# --key: add user-declared strong keys as standalone exact-match strategies
|
||||||
|
if key:
|
||||||
|
key_cols = [c.strip() for c in key.split(",")]
|
||||||
|
_validate_columns(key_cols, available_columns)
|
||||||
|
key_strats = [
|
||||||
|
MatchStrategy(column_strategies=[
|
||||||
|
ColumnMatchStrategy(column=col, algorithm=Algorithm.EXACT, threshold=100.0)
|
||||||
|
])
|
||||||
|
for col in key_cols
|
||||||
|
]
|
||||||
|
if strategies is None:
|
||||||
|
# Combine with auto-detect so user gets both
|
||||||
|
strategies = build_default_strategies(df) + key_strats
|
||||||
|
else:
|
||||||
|
strategies.extend(key_strats)
|
||||||
|
|
||||||
|
# Survivor rule
|
||||||
|
survivor_map = {
|
||||||
|
"first": SurvivorRule.KEEP_FIRST,
|
||||||
|
"last": SurvivorRule.KEEP_LAST,
|
||||||
|
"most-complete": SurvivorRule.KEEP_MOST_COMPLETE,
|
||||||
|
"most_complete": SurvivorRule.KEEP_MOST_COMPLETE,
|
||||||
|
"most-recent": SurvivorRule.KEEP_MOST_RECENT,
|
||||||
|
"most_recent": SurvivorRule.KEEP_MOST_RECENT,
|
||||||
|
}
|
||||||
|
if cfg:
|
||||||
|
surv_rule = cfg.to_survivor_rule()
|
||||||
|
do_merge = cfg.merge
|
||||||
|
dc = cfg.date_column
|
||||||
|
else:
|
||||||
|
surv_key = survivor.lower().replace("-", "_")
|
||||||
|
if surv_key not in {r.value for r in SurvivorRule} and surv_key not in survivor_map:
|
||||||
|
typer.echo(
|
||||||
|
f"Error: Unknown survivor rule '{survivor}'. "
|
||||||
|
f"Choose from: first, last, most-complete, most-recent.",
|
||||||
|
err=True,
|
||||||
|
)
|
||||||
|
raise typer.Exit(1)
|
||||||
|
surv_rule = survivor_map.get(survivor.lower(), SurvivorRule(surv_key))
|
||||||
|
do_merge = merge
|
||||||
|
dc = date_column
|
||||||
|
|
||||||
|
# Save config if requested
|
||||||
|
if save_config:
|
||||||
|
from src.core.config import DeduplicationConfig, StrategyConfig, ColumnStrategyConfig
|
||||||
|
save_cfg = DeduplicationConfig(
|
||||||
|
survivor_rule=surv_rule.value,
|
||||||
|
date_column=dc,
|
||||||
|
merge=do_merge,
|
||||||
|
subset_columns=[c.strip() for c in subset.split(",")] if subset else None,
|
||||||
|
fuzzy_columns=[c.strip() for c in fuzzy.split(",")] if fuzzy else None,
|
||||||
|
default_algorithm=algorithm,
|
||||||
|
default_threshold=float(threshold),
|
||||||
|
normalize_map=_parse_normalize_map(normalize),
|
||||||
|
)
|
||||||
|
if strategies:
|
||||||
|
save_cfg.strategies = [
|
||||||
|
StrategyConfig(columns=[
|
||||||
|
ColumnStrategyConfig(
|
||||||
|
column=cs.column,
|
||||||
|
algorithm=cs.algorithm.value,
|
||||||
|
threshold=cs.threshold,
|
||||||
|
normalizer=cs.normalizer.value if cs.normalizer else None,
|
||||||
|
)
|
||||||
|
for cs in s.column_strategies
|
||||||
|
])
|
||||||
|
for s in strategies
|
||||||
|
]
|
||||||
|
saved = save_cfg.to_file(save_config)
|
||||||
|
typer.echo(f"Config saved to {saved}")
|
||||||
|
|
||||||
|
# Progress bar
|
||||||
|
progress_cb = None
|
||||||
|
if len(df) > 10_000:
|
||||||
|
from tqdm import tqdm
|
||||||
|
pbar = tqdm(total=len(df) * (len(df) - 1) // 2, desc="Comparing rows",
|
||||||
|
unit="pairs", leave=False)
|
||||||
|
|
||||||
|
def _progress(current: int, total: int):
|
||||||
|
pbar.update(current - pbar.n)
|
||||||
|
if current >= total:
|
||||||
|
pbar.close()
|
||||||
|
|
||||||
|
progress_cb = _progress
|
||||||
|
|
||||||
|
# Review callback
|
||||||
|
review_cb = _interactive_review if review else None
|
||||||
|
|
||||||
|
# Run dedup
|
||||||
|
typer.echo("Finding duplicates...")
|
||||||
|
result = deduplicate(
|
||||||
|
df,
|
||||||
|
strategies=strategies,
|
||||||
|
survivor_rule=surv_rule,
|
||||||
|
date_column=dc,
|
||||||
|
merge=do_merge,
|
||||||
|
preview=not apply,
|
||||||
|
review_callback=review_cb,
|
||||||
|
progress_callback=progress_cb,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Print results
|
||||||
|
_print_results(result, input_path)
|
||||||
|
|
||||||
|
# Write output files
|
||||||
|
if apply:
|
||||||
|
stem = input_path.stem
|
||||||
|
suffix = input_path.suffix
|
||||||
|
|
||||||
|
out_path = Path(output) if output else input_path.parent / f"{stem}_deduplicated.csv"
|
||||||
|
write_file(result.deduplicated_df, out_path)
|
||||||
|
typer.echo(f"\nDeduplicated file: {out_path}")
|
||||||
|
|
||||||
|
if not result.removed_df.empty:
|
||||||
|
removed_path = input_path.parent / f"{stem}_removed.csv"
|
||||||
|
write_file(result.removed_df, removed_path)
|
||||||
|
typer.echo(f"Removed rows: {removed_path}")
|
||||||
|
|
||||||
|
if result.match_groups:
|
||||||
|
groups_path = input_path.parent / f"{stem}_match_groups.csv"
|
||||||
|
_write_match_groups(result, df, groups_path)
|
||||||
|
typer.echo(f"Match groups: {groups_path}")
|
||||||
|
else:
|
||||||
|
typer.echo("\nThis was a preview. Add --apply to write the output files.")
|
||||||
|
|
||||||
|
typer.echo(f"Log: {log_path}")
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Output formatting
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
def _print_results(result, input_path: Path) -> None:
|
||||||
|
"""Print a human-readable summary."""
|
||||||
|
removed = result.original_row_count - len(result.deduplicated_df)
|
||||||
|
typer.echo(f"\n{'─'*50}")
|
||||||
|
typer.echo(f" File: {input_path.name}")
|
||||||
|
typer.echo(f" Rows in: {result.original_row_count}")
|
||||||
|
typer.echo(f" Rows out: {len(result.deduplicated_df)}")
|
||||||
|
typer.echo(f" Removed: {removed}")
|
||||||
|
typer.echo(f" Groups: {len(result.match_groups)}")
|
||||||
|
typer.echo(f"{'─'*50}")
|
||||||
|
|
||||||
|
if result.match_groups:
|
||||||
|
typer.echo("\nMatch groups:")
|
||||||
|
for g in result.match_groups[:20]: # cap display
|
||||||
|
rows_str = ", ".join(str(i + 1) for i in g.row_indices)
|
||||||
|
surv = g.survivor_index + 1
|
||||||
|
typer.echo(
|
||||||
|
f" Group {g.group_id + 1}: rows [{rows_str}] "
|
||||||
|
f"→ keep row {surv} "
|
||||||
|
f"(confidence: {g.confidence:.1f}%, "
|
||||||
|
f"matched on: {', '.join(g.matched_on)})"
|
||||||
|
)
|
||||||
|
if len(result.match_groups) > 20:
|
||||||
|
typer.echo(f" ... and {len(result.match_groups) - 20} more groups")
|
||||||
|
|
||||||
|
|
||||||
|
def _write_match_groups(result, original_df, path: Path) -> None:
|
||||||
|
"""Write match groups to a CSV for audit."""
|
||||||
|
import pandas as pd
|
||||||
|
from src.core.io import write_file
|
||||||
|
|
||||||
|
rows = []
|
||||||
|
for g in result.match_groups:
|
||||||
|
for idx in g.row_indices:
|
||||||
|
row_data = {"_group_id": g.group_id + 1}
|
||||||
|
row_data["_is_survivor"] = idx == g.survivor_index
|
||||||
|
row_data["_confidence"] = g.confidence
|
||||||
|
row_data["_matched_on"] = ", ".join(g.matched_on)
|
||||||
|
row_data["_original_row"] = idx + 1
|
||||||
|
# Include original data
|
||||||
|
for col in original_df.columns:
|
||||||
|
row_data[col] = original_df.iloc[idx].get(col, "")
|
||||||
|
rows.append(row_data)
|
||||||
|
|
||||||
|
groups_df = pd.DataFrame(rows)
|
||||||
|
write_file(groups_df, path)
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# __main__ support
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
def main():
|
||||||
|
app()
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
93
src/core/__init__.py
Normal file
93
src/core/__init__.py
Normal file
@@ -0,0 +1,93 @@
|
|||||||
|
"""DataTools deduplication engine.
|
||||||
|
|
||||||
|
Public API
|
||||||
|
----------
|
||||||
|
Core:
|
||||||
|
deduplicate(df, ...) -> DeduplicationResult
|
||||||
|
build_default_strategies(df) -> list[MatchStrategy]
|
||||||
|
|
||||||
|
Types:
|
||||||
|
Algorithm, SurvivorRule, ColumnMatchStrategy, MatchStrategy
|
||||||
|
MatchResult, DeduplicationResult
|
||||||
|
|
||||||
|
Normalizers:
|
||||||
|
get_normalizer(type) -> Callable
|
||||||
|
NormalizerType
|
||||||
|
normalize_email, normalize_phone, normalize_name,
|
||||||
|
normalize_address, normalize_string
|
||||||
|
|
||||||
|
I/O:
|
||||||
|
read_file(path, ...) -> DataFrame
|
||||||
|
write_file(df, path, ...)
|
||||||
|
list_sheets(path) -> list[str]
|
||||||
|
detect_encoding, detect_delimiter, detect_header_row
|
||||||
|
|
||||||
|
Configuration:
|
||||||
|
DeduplicationConfig.from_file(path) -> DeduplicationConfig
|
||||||
|
DeduplicationConfig.to_file(path)
|
||||||
|
"""
|
||||||
|
|
||||||
|
from .dedup import (
|
||||||
|
Algorithm,
|
||||||
|
ColumnMatchStrategy,
|
||||||
|
DeduplicationResult,
|
||||||
|
MatchResult,
|
||||||
|
MatchStrategy,
|
||||||
|
SurvivorRule,
|
||||||
|
build_default_strategies,
|
||||||
|
deduplicate,
|
||||||
|
)
|
||||||
|
from .normalizers import (
|
||||||
|
NormalizerType,
|
||||||
|
get_normalizer,
|
||||||
|
normalize_address,
|
||||||
|
normalize_email,
|
||||||
|
normalize_name,
|
||||||
|
normalize_phone,
|
||||||
|
normalize_string,
|
||||||
|
)
|
||||||
|
from .io import (
|
||||||
|
detect_delimiter,
|
||||||
|
detect_encoding,
|
||||||
|
detect_header_row,
|
||||||
|
list_sheets,
|
||||||
|
read_file,
|
||||||
|
write_file,
|
||||||
|
)
|
||||||
|
from .config import (
|
||||||
|
ColumnStrategyConfig,
|
||||||
|
DeduplicationConfig,
|
||||||
|
StrategyConfig,
|
||||||
|
)
|
||||||
|
|
||||||
|
__all__ = [
|
||||||
|
# Core
|
||||||
|
"deduplicate",
|
||||||
|
"build_default_strategies",
|
||||||
|
# Types
|
||||||
|
"Algorithm",
|
||||||
|
"SurvivorRule",
|
||||||
|
"ColumnMatchStrategy",
|
||||||
|
"MatchStrategy",
|
||||||
|
"MatchResult",
|
||||||
|
"DeduplicationResult",
|
||||||
|
# Normalizers
|
||||||
|
"NormalizerType",
|
||||||
|
"get_normalizer",
|
||||||
|
"normalize_email",
|
||||||
|
"normalize_phone",
|
||||||
|
"normalize_name",
|
||||||
|
"normalize_address",
|
||||||
|
"normalize_string",
|
||||||
|
# I/O
|
||||||
|
"read_file",
|
||||||
|
"write_file",
|
||||||
|
"list_sheets",
|
||||||
|
"detect_encoding",
|
||||||
|
"detect_delimiter",
|
||||||
|
"detect_header_row",
|
||||||
|
# Config
|
||||||
|
"DeduplicationConfig",
|
||||||
|
"StrategyConfig",
|
||||||
|
"ColumnStrategyConfig",
|
||||||
|
]
|
||||||
117
src/core/config.py
Normal file
117
src/core/config.py
Normal file
@@ -0,0 +1,117 @@
|
|||||||
|
"""Configuration profiles: save/load deduplication settings as JSON."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import json
|
||||||
|
from dataclasses import dataclass, field, asdict
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Optional
|
||||||
|
|
||||||
|
from .dedup import (
|
||||||
|
Algorithm,
|
||||||
|
ColumnMatchStrategy,
|
||||||
|
MatchStrategy,
|
||||||
|
NormalizerType,
|
||||||
|
SurvivorRule,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class ColumnStrategyConfig:
|
||||||
|
"""JSON-serializable mirror of ColumnMatchStrategy."""
|
||||||
|
column: str
|
||||||
|
algorithm: str = "exact"
|
||||||
|
threshold: float = 100.0
|
||||||
|
normalizer: Optional[str] = None
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class StrategyConfig:
|
||||||
|
"""JSON-serializable mirror of MatchStrategy."""
|
||||||
|
columns: list[ColumnStrategyConfig] = field(default_factory=list)
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class DeduplicationConfig:
|
||||||
|
"""All deduplication settings as a flat JSON-serializable structure."""
|
||||||
|
|
||||||
|
strategies: list[StrategyConfig] = field(default_factory=list)
|
||||||
|
survivor_rule: str = "first"
|
||||||
|
date_column: Optional[str] = None
|
||||||
|
merge: bool = False
|
||||||
|
subset_columns: Optional[list[str]] = None
|
||||||
|
fuzzy_columns: Optional[list[str]] = None
|
||||||
|
default_algorithm: str = "jaro_winkler"
|
||||||
|
default_threshold: float = 85.0
|
||||||
|
normalize_map: Optional[dict[str, str]] = None # column -> normalizer type
|
||||||
|
|
||||||
|
# -----------------------------------------------------------------------
|
||||||
|
# Serialisation
|
||||||
|
# -----------------------------------------------------------------------
|
||||||
|
|
||||||
|
def to_dict(self) -> dict:
|
||||||
|
return asdict(self)
|
||||||
|
|
||||||
|
def to_file(self, path: str | Path) -> Path:
|
||||||
|
"""Save configuration to a JSON file."""
|
||||||
|
out = Path(path)
|
||||||
|
out.write_text(json.dumps(self.to_dict(), indent=2))
|
||||||
|
return out
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def from_dict(cls, data: dict) -> DeduplicationConfig:
|
||||||
|
strategies = []
|
||||||
|
for s in data.get("strategies", []):
|
||||||
|
cols = [ColumnStrategyConfig(**c) for c in s.get("columns", [])]
|
||||||
|
strategies.append(StrategyConfig(columns=cols))
|
||||||
|
return cls(
|
||||||
|
strategies=strategies,
|
||||||
|
survivor_rule=data.get("survivor_rule", "first"),
|
||||||
|
date_column=data.get("date_column"),
|
||||||
|
merge=data.get("merge", False),
|
||||||
|
subset_columns=data.get("subset_columns"),
|
||||||
|
fuzzy_columns=data.get("fuzzy_columns"),
|
||||||
|
default_algorithm=data.get("default_algorithm", "jaro_winkler"),
|
||||||
|
default_threshold=data.get("default_threshold", 85.0),
|
||||||
|
normalize_map=data.get("normalize_map"),
|
||||||
|
)
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def from_file(cls, path: str | Path) -> DeduplicationConfig:
|
||||||
|
"""Load configuration from a JSON file."""
|
||||||
|
data = json.loads(Path(path).read_text())
|
||||||
|
return cls.from_dict(data)
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def default(cls) -> DeduplicationConfig:
|
||||||
|
"""Return sensible defaults (auto-detect strategies at runtime)."""
|
||||||
|
return cls()
|
||||||
|
|
||||||
|
# -----------------------------------------------------------------------
|
||||||
|
# Convert to engine objects
|
||||||
|
# -----------------------------------------------------------------------
|
||||||
|
|
||||||
|
def to_strategies(self) -> Optional[list[MatchStrategy]]:
|
||||||
|
"""Convert the config back to MatchStrategy objects.
|
||||||
|
|
||||||
|
Returns None if no explicit strategies are configured
|
||||||
|
(the engine will auto-detect).
|
||||||
|
"""
|
||||||
|
if not self.strategies:
|
||||||
|
return None
|
||||||
|
|
||||||
|
result: list[MatchStrategy] = []
|
||||||
|
for sc in self.strategies:
|
||||||
|
col_strats = []
|
||||||
|
for cc in sc.columns:
|
||||||
|
col_strats.append(ColumnMatchStrategy(
|
||||||
|
column=cc.column,
|
||||||
|
algorithm=Algorithm(cc.algorithm),
|
||||||
|
threshold=cc.threshold,
|
||||||
|
normalizer=NormalizerType(cc.normalizer) if cc.normalizer else None,
|
||||||
|
))
|
||||||
|
result.append(MatchStrategy(column_strategies=col_strats))
|
||||||
|
return result
|
||||||
|
|
||||||
|
def to_survivor_rule(self) -> SurvivorRule:
|
||||||
|
return SurvivorRule(self.survivor_rule)
|
||||||
568
src/core/dedup.py
Normal file
568
src/core/dedup.py
Normal file
@@ -0,0 +1,568 @@
|
|||||||
|
"""Deduplication engine: matching, survivor selection, and merge.
|
||||||
|
|
||||||
|
Core algorithm:
|
||||||
|
1. Normalise columns → shadow ``_norm_*`` columns (computed once).
|
||||||
|
2. Pairwise comparison within each strategy → candidate pairs.
|
||||||
|
3. Union-find for transitive closure (A~B, B~C ⇒ one group).
|
||||||
|
4. Multi-strategy OR: feed all pairs from all strategies into the same union-find.
|
||||||
|
5. Survivor selection per group + optional field merge.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import re
|
||||||
|
from dataclasses import dataclass, field
|
||||||
|
from enum import Enum
|
||||||
|
from typing import Callable, Optional
|
||||||
|
|
||||||
|
import pandas as pd
|
||||||
|
from loguru import logger
|
||||||
|
from rapidfuzz import fuzz as rf_fuzz
|
||||||
|
from rapidfuzz import distance as rf_distance
|
||||||
|
|
||||||
|
from .normalizers import NormalizerType, get_normalizer
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Enums & data structures
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
class Algorithm(str, Enum):
|
||||||
|
EXACT = "exact"
|
||||||
|
LEVENSHTEIN = "levenshtein"
|
||||||
|
JARO_WINKLER = "jaro_winkler"
|
||||||
|
TOKEN_SET_RATIO = "token_set_ratio"
|
||||||
|
|
||||||
|
|
||||||
|
class SurvivorRule(str, Enum):
|
||||||
|
KEEP_FIRST = "first"
|
||||||
|
KEEP_LAST = "last"
|
||||||
|
KEEP_MOST_COMPLETE = "most_complete"
|
||||||
|
KEEP_MOST_RECENT = "most_recent"
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class ColumnMatchStrategy:
|
||||||
|
"""How to match on a single column."""
|
||||||
|
column: str
|
||||||
|
algorithm: Algorithm = Algorithm.EXACT
|
||||||
|
threshold: float = 100.0 # 0-100 scale
|
||||||
|
normalizer: Optional[NormalizerType] = None
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class MatchStrategy:
|
||||||
|
"""A set of column strategies combined with AND.
|
||||||
|
|
||||||
|
Multiple ``MatchStrategy`` instances are combined with OR at the top level.
|
||||||
|
"""
|
||||||
|
column_strategies: list[ColumnMatchStrategy]
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class MatchResult:
|
||||||
|
"""One group of duplicate rows."""
|
||||||
|
group_id: int
|
||||||
|
row_indices: list[int]
|
||||||
|
confidence: float # min confidence across pairs in the group
|
||||||
|
matched_on: list[str] # column names that contributed to the match
|
||||||
|
survivor_index: int # index of the row to keep
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class DeduplicationResult:
|
||||||
|
"""Full result of a deduplication run."""
|
||||||
|
original_row_count: int
|
||||||
|
deduplicated_df: pd.DataFrame
|
||||||
|
removed_df: pd.DataFrame
|
||||||
|
match_groups: list[MatchResult]
|
||||||
|
log_entries: list[str] = field(default_factory=list)
|
||||||
|
is_preview: bool = True
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Union-Find
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
class _UnionFind:
|
||||||
|
"""Disjoint-set / union-find for transitive closure of match pairs."""
|
||||||
|
|
||||||
|
def __init__(self, n: int):
|
||||||
|
self._parent = list(range(n))
|
||||||
|
self._rank = [0] * n
|
||||||
|
|
||||||
|
def find(self, x: int) -> int:
|
||||||
|
while self._parent[x] != x:
|
||||||
|
self._parent[x] = self._parent[self._parent[x]] # path halving
|
||||||
|
x = self._parent[x]
|
||||||
|
return x
|
||||||
|
|
||||||
|
def union(self, a: int, b: int) -> None:
|
||||||
|
ra, rb = self.find(a), self.find(b)
|
||||||
|
if ra == rb:
|
||||||
|
return
|
||||||
|
if self._rank[ra] < self._rank[rb]:
|
||||||
|
ra, rb = rb, ra
|
||||||
|
self._parent[rb] = ra
|
||||||
|
if self._rank[ra] == self._rank[rb]:
|
||||||
|
self._rank[ra] += 1
|
||||||
|
|
||||||
|
def groups(self) -> dict[int, list[int]]:
|
||||||
|
"""Return {root: [members]} for all non-singleton groups."""
|
||||||
|
from collections import defaultdict
|
||||||
|
g: dict[int, list[int]] = defaultdict(list)
|
||||||
|
for i in range(len(self._parent)):
|
||||||
|
g[self.find(i)].append(i)
|
||||||
|
return {root: members for root, members in g.items() if len(members) > 1}
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Similarity computation
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
def _compute_similarity(val_a: str, val_b: str, algorithm: Algorithm) -> float:
|
||||||
|
"""Return similarity score on a 0-100 scale."""
|
||||||
|
if algorithm == Algorithm.EXACT:
|
||||||
|
return 100.0 if val_a == val_b else 0.0
|
||||||
|
if algorithm == Algorithm.LEVENSHTEIN:
|
||||||
|
return rf_fuzz.ratio(val_a, val_b)
|
||||||
|
if algorithm == Algorithm.JARO_WINKLER:
|
||||||
|
# rapidfuzz jaro_winkler_similarity returns 0-100
|
||||||
|
return rf_distance.JaroWinkler.similarity(val_a, val_b) * 100
|
||||||
|
if algorithm == Algorithm.TOKEN_SET_RATIO:
|
||||||
|
return rf_fuzz.token_set_ratio(val_a, val_b)
|
||||||
|
raise ValueError(f"Unknown algorithm: {algorithm}")
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Pair comparison
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
def _compare_pair(
|
||||||
|
row_a: pd.Series,
|
||||||
|
row_b: pd.Series,
|
||||||
|
strategy: MatchStrategy,
|
||||||
|
norm_prefix: str = "_norm_",
|
||||||
|
) -> tuple[bool, float, list[str]]:
|
||||||
|
"""Compare two rows using a single MatchStrategy (AND of column strategies).
|
||||||
|
|
||||||
|
Returns ``(is_match, confidence, matched_columns)``.
|
||||||
|
"""
|
||||||
|
min_score = 100.0
|
||||||
|
matched_cols: list[str] = []
|
||||||
|
|
||||||
|
for cs in strategy.column_strategies:
|
||||||
|
col = f"{norm_prefix}{cs.column}" if cs.normalizer else cs.column
|
||||||
|
va = str(row_a.get(col, ""))
|
||||||
|
vb = str(row_b.get(col, ""))
|
||||||
|
|
||||||
|
# Skip if both empty
|
||||||
|
if not va and not vb:
|
||||||
|
continue
|
||||||
|
# If one empty and one not — no match for this column
|
||||||
|
if not va or not vb:
|
||||||
|
return False, 0.0, []
|
||||||
|
|
||||||
|
score = _compute_similarity(va, vb, cs.algorithm)
|
||||||
|
if score < cs.threshold:
|
||||||
|
return False, 0.0, []
|
||||||
|
min_score = min(min_score, score)
|
||||||
|
matched_cols.append(cs.column)
|
||||||
|
|
||||||
|
if not matched_cols:
|
||||||
|
return False, 0.0, []
|
||||||
|
|
||||||
|
return True, min_score, matched_cols
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Match-group finding
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
def _find_match_groups(
|
||||||
|
df: pd.DataFrame,
|
||||||
|
strategies: list[MatchStrategy],
|
||||||
|
*,
|
||||||
|
progress_callback: Optional[Callable[[int, int], None]] = None,
|
||||||
|
) -> tuple[list[MatchResult], dict[tuple[int, int], tuple[float, list[str]]]]:
|
||||||
|
"""Pairwise comparison + union-find for transitive closure.
|
||||||
|
|
||||||
|
Returns ``(match_groups, pair_info)`` where *pair_info* maps
|
||||||
|
``(i, j)`` → ``(confidence, matched_columns)`` for logging.
|
||||||
|
"""
|
||||||
|
n = len(df)
|
||||||
|
uf = _UnionFind(n)
|
||||||
|
pair_info: dict[tuple[int, int], tuple[float, list[str]]] = {}
|
||||||
|
total_pairs = n * (n - 1) // 2
|
||||||
|
checked = 0
|
||||||
|
|
||||||
|
for i in range(n):
|
||||||
|
for j in range(i + 1, n):
|
||||||
|
for strategy in strategies:
|
||||||
|
is_match, confidence, cols = _compare_pair(
|
||||||
|
df.iloc[i], df.iloc[j], strategy
|
||||||
|
)
|
||||||
|
if is_match:
|
||||||
|
uf.union(i, j)
|
||||||
|
key = (i, j)
|
||||||
|
# Keep the highest-confidence match for this pair
|
||||||
|
if key not in pair_info or confidence > pair_info[key][0]:
|
||||||
|
pair_info[key] = (confidence, cols)
|
||||||
|
break # OR logic: one strategy match is enough
|
||||||
|
|
||||||
|
checked += 1
|
||||||
|
if progress_callback and checked % 1000 == 0:
|
||||||
|
progress_callback(checked, total_pairs)
|
||||||
|
|
||||||
|
if progress_callback:
|
||||||
|
progress_callback(total_pairs, total_pairs)
|
||||||
|
|
||||||
|
# Build MatchResult objects (survivor not yet selected)
|
||||||
|
raw_groups = uf.groups()
|
||||||
|
match_groups: list[MatchResult] = []
|
||||||
|
for gid, (root, members) in enumerate(sorted(raw_groups.items())):
|
||||||
|
# Confidence = min across all pairs in the group
|
||||||
|
group_confidence = 100.0
|
||||||
|
group_cols: set[str] = set()
|
||||||
|
for idx_a, m in enumerate(members):
|
||||||
|
for idx_b in range(idx_a + 1, len(members)):
|
||||||
|
key = (min(m, members[idx_b]), max(m, members[idx_b]))
|
||||||
|
if key in pair_info:
|
||||||
|
conf, cols = pair_info[key]
|
||||||
|
group_confidence = min(group_confidence, conf)
|
||||||
|
group_cols.update(cols)
|
||||||
|
|
||||||
|
match_groups.append(MatchResult(
|
||||||
|
group_id=gid,
|
||||||
|
row_indices=members,
|
||||||
|
confidence=round(group_confidence, 2),
|
||||||
|
matched_on=sorted(group_cols),
|
||||||
|
survivor_index=members[0], # placeholder
|
||||||
|
))
|
||||||
|
|
||||||
|
return match_groups, pair_info
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Survivor selection
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
def _select_survivor(
|
||||||
|
group: MatchResult,
|
||||||
|
df: pd.DataFrame,
|
||||||
|
rule: SurvivorRule,
|
||||||
|
date_column: Optional[str] = None,
|
||||||
|
) -> int:
|
||||||
|
"""Choose the survivor row index within a match group."""
|
||||||
|
indices = group.row_indices
|
||||||
|
|
||||||
|
if rule == SurvivorRule.KEEP_FIRST:
|
||||||
|
return indices[0]
|
||||||
|
|
||||||
|
if rule == SurvivorRule.KEEP_LAST:
|
||||||
|
return indices[-1]
|
||||||
|
|
||||||
|
if rule == SurvivorRule.KEEP_MOST_COMPLETE:
|
||||||
|
# Fewest empty/blank cells wins
|
||||||
|
best_idx = indices[0]
|
||||||
|
best_empty = _count_empty(df.iloc[indices[0]])
|
||||||
|
for idx in indices[1:]:
|
||||||
|
empty = _count_empty(df.iloc[idx])
|
||||||
|
if empty < best_empty:
|
||||||
|
best_empty = empty
|
||||||
|
best_idx = idx
|
||||||
|
return best_idx
|
||||||
|
|
||||||
|
if rule == SurvivorRule.KEEP_MOST_RECENT:
|
||||||
|
if not date_column or date_column not in df.columns:
|
||||||
|
logger.warning("date_column '{}' not found; falling back to keep_first", date_column)
|
||||||
|
return indices[0]
|
||||||
|
best_idx = indices[0]
|
||||||
|
best_date = _parse_date(df.iloc[indices[0]].get(date_column, ""))
|
||||||
|
for idx in indices[1:]:
|
||||||
|
d = _parse_date(df.iloc[idx].get(date_column, ""))
|
||||||
|
if d is not None and (best_date is None or d > best_date):
|
||||||
|
best_date = d
|
||||||
|
best_idx = idx
|
||||||
|
return best_idx
|
||||||
|
|
||||||
|
return indices[0]
|
||||||
|
|
||||||
|
|
||||||
|
def _count_empty(row: pd.Series) -> int:
|
||||||
|
"""Count empty/blank cells in a row, ignoring internal shadow columns."""
|
||||||
|
count = 0
|
||||||
|
for col, val in row.items():
|
||||||
|
if isinstance(col, str) and col.startswith("_norm_"):
|
||||||
|
continue
|
||||||
|
if pd.isna(val) or str(val).strip() == "":
|
||||||
|
count += 1
|
||||||
|
return count
|
||||||
|
|
||||||
|
|
||||||
|
def _parse_date(value) -> Optional[pd.Timestamp]:
|
||||||
|
try:
|
||||||
|
return pd.to_datetime(value)
|
||||||
|
except Exception:
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Merge mode
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
def _merge_group(df: pd.DataFrame, survivor_idx: int, loser_indices: list[int]) -> pd.Series:
|
||||||
|
"""Fill missing fields in survivor from losers (ordered by position)."""
|
||||||
|
survivor = df.iloc[survivor_idx].copy()
|
||||||
|
for col in survivor.index:
|
||||||
|
if isinstance(col, str) and col.startswith("_norm_"):
|
||||||
|
continue
|
||||||
|
val = survivor[col]
|
||||||
|
if pd.isna(val) or str(val).strip() == "":
|
||||||
|
for loser_idx in loser_indices:
|
||||||
|
candidate = df.iloc[loser_idx][col]
|
||||||
|
if not pd.isna(candidate) and str(candidate).strip() != "":
|
||||||
|
survivor[col] = candidate
|
||||||
|
break
|
||||||
|
return survivor
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Auto-detect strategies
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
# (pattern, normalizer, algorithm, threshold, is_strong_key)
|
||||||
|
# Strong keys (email, phone) can be standalone strategies.
|
||||||
|
# Weak keys (name, address) must be combined with a strong key via AND.
|
||||||
|
_COLUMN_TYPE_PATTERNS: list[tuple[re.Pattern, NormalizerType, Algorithm, float, bool]] = [
|
||||||
|
(re.compile(r"e[-_]?mail", re.I), NormalizerType.EMAIL, Algorithm.EXACT, 100.0, True),
|
||||||
|
(re.compile(r"phone|telephone|mobile|cell", re.I), NormalizerType.PHONE, Algorithm.EXACT, 100.0, True),
|
||||||
|
(re.compile(r"^(name|full_name|customer_name|first_name|last_name|contact_name|respondent_name)$", re.I),
|
||||||
|
NormalizerType.NAME, Algorithm.JARO_WINKLER, 85.0, False),
|
||||||
|
(re.compile(r"address|street|addr", re.I), NormalizerType.ADDRESS, Algorithm.TOKEN_SET_RATIO, 80.0, False),
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def build_default_strategies(df: pd.DataFrame) -> list[MatchStrategy]:
|
||||||
|
"""Auto-detect column types and build match strategies.
|
||||||
|
|
||||||
|
Strategy logic:
|
||||||
|
- Strong keys (email, phone): each gets its own standalone OR strategy.
|
||||||
|
- Weak keys (name, address): combined with each strong key via AND to
|
||||||
|
form additional strategies. Weak keys never stand alone (too many
|
||||||
|
false positives — "John" ≈ "Jon" at 93 % Jaro-Winkler).
|
||||||
|
- If only weak keys are found (no strong keys), they're promoted to
|
||||||
|
standalone strategies as a fallback.
|
||||||
|
- If no columns match, exact match on all columns (drop_duplicates
|
||||||
|
equivalent).
|
||||||
|
"""
|
||||||
|
strong_cols: list[ColumnMatchStrategy] = []
|
||||||
|
weak_cols: list[ColumnMatchStrategy] = []
|
||||||
|
|
||||||
|
for col in df.columns:
|
||||||
|
if col.startswith("_norm_"):
|
||||||
|
continue
|
||||||
|
for pattern, norm_type, algo, threshold, is_strong in _COLUMN_TYPE_PATTERNS:
|
||||||
|
if pattern.search(col):
|
||||||
|
cs = ColumnMatchStrategy(
|
||||||
|
column=col, algorithm=algo,
|
||||||
|
threshold=threshold, normalizer=norm_type,
|
||||||
|
)
|
||||||
|
if is_strong:
|
||||||
|
strong_cols.append(cs)
|
||||||
|
else:
|
||||||
|
weak_cols.append(cs)
|
||||||
|
break
|
||||||
|
|
||||||
|
strategies: list[MatchStrategy] = []
|
||||||
|
|
||||||
|
if strong_cols:
|
||||||
|
# Each strong key is a standalone strategy (OR)
|
||||||
|
for sc in strong_cols:
|
||||||
|
strategies.append(MatchStrategy(column_strategies=[sc]))
|
||||||
|
|
||||||
|
# Each weak key is paired with each strong key (AND) for extra recall
|
||||||
|
for wc in weak_cols:
|
||||||
|
for sc in strong_cols:
|
||||||
|
strategies.append(MatchStrategy(column_strategies=[wc, sc]))
|
||||||
|
elif weak_cols:
|
||||||
|
# No strong keys — promote weak to standalone (best effort)
|
||||||
|
for wc in weak_cols:
|
||||||
|
strategies.append(MatchStrategy(column_strategies=[wc]))
|
||||||
|
|
||||||
|
if strategies:
|
||||||
|
return strategies
|
||||||
|
|
||||||
|
# Fallback: exact match on all columns (equivalent to drop_duplicates)
|
||||||
|
logger.info("No column patterns matched; using exact match on all columns")
|
||||||
|
all_cols = [
|
||||||
|
ColumnMatchStrategy(column=c, algorithm=Algorithm.EXACT, threshold=100.0)
|
||||||
|
for c in df.columns
|
||||||
|
]
|
||||||
|
return [MatchStrategy(column_strategies=all_cols)]
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Normalisation pass
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
def _apply_normalizations(df: pd.DataFrame, strategies: list[MatchStrategy]) -> pd.DataFrame:
|
||||||
|
"""Add ``_norm_*`` shadow columns for every column that has a normalizer."""
|
||||||
|
df = df.copy()
|
||||||
|
seen: set[str] = set()
|
||||||
|
for strategy in strategies:
|
||||||
|
for cs in strategy.column_strategies:
|
||||||
|
if cs.normalizer and cs.column not in seen and cs.column in df.columns:
|
||||||
|
seen.add(cs.column)
|
||||||
|
norm_fn = get_normalizer(cs.normalizer)
|
||||||
|
norm_col = f"_norm_{cs.column}"
|
||||||
|
df[norm_col] = df[cs.column].apply(
|
||||||
|
lambda v, fn=norm_fn: fn(str(v)) if pd.notna(v) and str(v).strip() else ""
|
||||||
|
)
|
||||||
|
return df
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Main entry point
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
def deduplicate(
|
||||||
|
df: pd.DataFrame,
|
||||||
|
*,
|
||||||
|
strategies: Optional[list[MatchStrategy]] = None,
|
||||||
|
survivor_rule: SurvivorRule = SurvivorRule.KEEP_FIRST,
|
||||||
|
date_column: Optional[str] = None,
|
||||||
|
merge: bool = False,
|
||||||
|
preview: bool = True,
|
||||||
|
review_callback: Optional[Callable] = None,
|
||||||
|
progress_callback: Optional[Callable[[int, int], None]] = None,
|
||||||
|
) -> DeduplicationResult:
|
||||||
|
"""Run the full deduplication pipeline.
|
||||||
|
|
||||||
|
Parameters
|
||||||
|
----------
|
||||||
|
df : input DataFrame
|
||||||
|
strategies : matching strategies (auto-detected if None)
|
||||||
|
survivor_rule : which row to keep per group
|
||||||
|
date_column : used with ``KEEP_MOST_RECENT``
|
||||||
|
merge : fill missing fields in survivor from losers
|
||||||
|
preview : if True, result is informational only (no writes)
|
||||||
|
review_callback : ``(group: MatchResult, df: DataFrame) -> bool|None``
|
||||||
|
Called for each match group. Return True to accept, False to reject,
|
||||||
|
None to skip (keep both rows). Used for interactive review.
|
||||||
|
progress_callback : ``(current: int, total: int) -> None``
|
||||||
|
Called periodically during pairwise comparison.
|
||||||
|
|
||||||
|
Returns a ``DeduplicationResult``.
|
||||||
|
"""
|
||||||
|
log_entries: list[str] = []
|
||||||
|
original_count = len(df)
|
||||||
|
|
||||||
|
if strategies is None:
|
||||||
|
strategies = build_default_strategies(df)
|
||||||
|
log_entries.append(f"Auto-detected {len(strategies)} match strategies")
|
||||||
|
|
||||||
|
# Log strategies
|
||||||
|
for i, s in enumerate(strategies):
|
||||||
|
cols_desc = ", ".join(
|
||||||
|
f"{cs.column}({cs.algorithm.value}@{cs.threshold})"
|
||||||
|
for cs in s.column_strategies
|
||||||
|
)
|
||||||
|
log_entries.append(f"Strategy {i}: {cols_desc}")
|
||||||
|
logger.info("Strategy {}: {}", i, cols_desc)
|
||||||
|
|
||||||
|
# Normalise
|
||||||
|
df_work = _apply_normalizations(df, strategies)
|
||||||
|
|
||||||
|
# Find matches
|
||||||
|
match_groups, pair_info = _find_match_groups(
|
||||||
|
df_work, strategies, progress_callback=progress_callback
|
||||||
|
)
|
||||||
|
log_entries.append(f"Found {len(match_groups)} duplicate groups")
|
||||||
|
logger.info("Found {} duplicate groups from {} rows", len(match_groups), original_count)
|
||||||
|
|
||||||
|
# Interactive review
|
||||||
|
if review_callback and match_groups:
|
||||||
|
reviewed_groups: list[MatchResult] = []
|
||||||
|
for group in match_groups:
|
||||||
|
decision = review_callback(group, df_work)
|
||||||
|
if decision is True:
|
||||||
|
reviewed_groups.append(group)
|
||||||
|
log_entries.append(f"Group {group.group_id}: accepted by reviewer")
|
||||||
|
elif decision is False:
|
||||||
|
log_entries.append(f"Group {group.group_id}: rejected by reviewer")
|
||||||
|
else:
|
||||||
|
log_entries.append(f"Group {group.group_id}: skipped by reviewer")
|
||||||
|
match_groups = reviewed_groups
|
||||||
|
|
||||||
|
# Survivor selection
|
||||||
|
for group in match_groups:
|
||||||
|
group.survivor_index = _select_survivor(group, df_work, survivor_rule, date_column)
|
||||||
|
log_entries.append(
|
||||||
|
f"Group {group.group_id}: survivor=row {group.survivor_index} "
|
||||||
|
f"(rule={survivor_rule.value}, confidence={group.confidence}%)"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Build result dataframes
|
||||||
|
remove_indices: set[int] = set()
|
||||||
|
merged_rows: dict[int, pd.Series] = {}
|
||||||
|
|
||||||
|
for group in match_groups:
|
||||||
|
survivor_idx = group.survivor_index
|
||||||
|
losers = [i for i in group.row_indices if i != survivor_idx]
|
||||||
|
remove_indices.update(losers)
|
||||||
|
|
||||||
|
if merge and losers:
|
||||||
|
merged = _merge_group(df_work, survivor_idx, losers)
|
||||||
|
merged_rows[survivor_idx] = merged
|
||||||
|
# Log merged fields
|
||||||
|
original = df_work.iloc[survivor_idx]
|
||||||
|
for col in original.index:
|
||||||
|
if isinstance(col, str) and col.startswith("_norm_"):
|
||||||
|
continue
|
||||||
|
orig_val = str(original[col]).strip()
|
||||||
|
new_val = str(merged[col]).strip()
|
||||||
|
if orig_val != new_val and not orig_val:
|
||||||
|
log_entries.append(
|
||||||
|
f"Group {group.group_id}: merged '{col}' "
|
||||||
|
f"into survivor from losers: '{new_val}'"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Build output DataFrames
|
||||||
|
keep_indices = [i for i in range(len(df_work)) if i not in remove_indices]
|
||||||
|
|
||||||
|
if merged_rows:
|
||||||
|
rows = []
|
||||||
|
for i in keep_indices:
|
||||||
|
if i in merged_rows:
|
||||||
|
rows.append(merged_rows[i])
|
||||||
|
else:
|
||||||
|
rows.append(df_work.iloc[i])
|
||||||
|
deduplicated_df = pd.DataFrame(rows)
|
||||||
|
else:
|
||||||
|
deduplicated_df = df_work.iloc[keep_indices].copy()
|
||||||
|
|
||||||
|
removed_df = df_work.iloc[sorted(remove_indices)].copy() if remove_indices else pd.DataFrame()
|
||||||
|
|
||||||
|
# Drop shadow columns from output
|
||||||
|
norm_cols = [c for c in deduplicated_df.columns if str(c).startswith("_norm_")]
|
||||||
|
deduplicated_df = deduplicated_df.drop(columns=norm_cols, errors="ignore")
|
||||||
|
if not removed_df.empty:
|
||||||
|
removed_df = removed_df.drop(columns=norm_cols, errors="ignore")
|
||||||
|
|
||||||
|
# Reset index
|
||||||
|
deduplicated_df = deduplicated_df.reset_index(drop=True)
|
||||||
|
if not removed_df.empty:
|
||||||
|
removed_df = removed_df.reset_index(drop=True)
|
||||||
|
|
||||||
|
removed_count = original_count - len(deduplicated_df)
|
||||||
|
log_entries.append(f"Result: {original_count} → {len(deduplicated_df)} rows ({removed_count} removed)")
|
||||||
|
|
||||||
|
return DeduplicationResult(
|
||||||
|
original_row_count=original_count,
|
||||||
|
deduplicated_df=deduplicated_df,
|
||||||
|
removed_df=removed_df,
|
||||||
|
match_groups=match_groups,
|
||||||
|
log_entries=log_entries,
|
||||||
|
is_preview=preview,
|
||||||
|
)
|
||||||
247
src/core/io.py
Normal file
247
src/core/io.py
Normal file
@@ -0,0 +1,247 @@
|
|||||||
|
"""File I/O: encoding/delimiter detection, CSV/Excel reading, output writing."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import csv
|
||||||
|
import io
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Generator, Optional
|
||||||
|
|
||||||
|
import pandas as pd
|
||||||
|
from charset_normalizer import from_bytes
|
||||||
|
from loguru import logger
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Encoding detection
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
def detect_encoding(path: Path, sample_bytes: int = 65_536) -> str:
|
||||||
|
"""Detect file encoding by reading the first *sample_bytes*.
|
||||||
|
|
||||||
|
Returns the best-guess encoding name (e.g. ``utf-8``, ``windows-1252``).
|
||||||
|
Falls back to ``utf-8`` when detection is inconclusive.
|
||||||
|
"""
|
||||||
|
raw = Path(path).read_bytes()[:sample_bytes]
|
||||||
|
if not raw:
|
||||||
|
return "utf-8"
|
||||||
|
|
||||||
|
# Check BOM first
|
||||||
|
if raw[:3] == b"\xef\xbb\xbf":
|
||||||
|
return "utf-8-sig"
|
||||||
|
if raw[:2] in (b"\xff\xfe", b"\xfe\xff"):
|
||||||
|
return "utf-16"
|
||||||
|
|
||||||
|
result = from_bytes(raw).best()
|
||||||
|
if result is None:
|
||||||
|
return "utf-8"
|
||||||
|
enc = result.encoding.lower()
|
||||||
|
# Normalise common aliases
|
||||||
|
if enc in ("ascii", "us-ascii"):
|
||||||
|
enc = "utf-8"
|
||||||
|
return enc
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Delimiter detection
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
_COMMON_DELIMITERS = [",", "\t", ";", "|"]
|
||||||
|
|
||||||
|
|
||||||
|
def detect_delimiter(path: Path, encoding: str = "utf-8") -> str:
|
||||||
|
"""Sniff the delimiter from the first 20 lines of a text file.
|
||||||
|
|
||||||
|
Falls back to comma if csv.Sniffer cannot decide.
|
||||||
|
"""
|
||||||
|
raw_path = Path(path)
|
||||||
|
lines: list[str] = []
|
||||||
|
with raw_path.open("r", encoding=encoding, errors="replace") as fh:
|
||||||
|
for _ in range(20):
|
||||||
|
line = fh.readline()
|
||||||
|
if not line:
|
||||||
|
break
|
||||||
|
lines.append(line)
|
||||||
|
|
||||||
|
if not lines:
|
||||||
|
return ","
|
||||||
|
|
||||||
|
sample = "".join(lines)
|
||||||
|
try:
|
||||||
|
dialect = csv.Sniffer().sniff(sample, delimiters="".join(_COMMON_DELIMITERS))
|
||||||
|
return dialect.delimiter
|
||||||
|
except csv.Error:
|
||||||
|
return ","
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Header-row detection
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
def detect_header_row(path: Path, encoding: str = "utf-8", delimiter: str = ",",
|
||||||
|
max_scan: int = 20) -> int:
|
||||||
|
"""Return the 0-based index of the likely header row.
|
||||||
|
|
||||||
|
Heuristic: the first row where *every* cell looks like a column name
|
||||||
|
(non-numeric, non-empty string). Falls back to 0.
|
||||||
|
"""
|
||||||
|
raw_path = Path(path)
|
||||||
|
with raw_path.open("r", encoding=encoding, errors="replace") as fh:
|
||||||
|
reader = csv.reader(fh, delimiter=delimiter)
|
||||||
|
for idx, row in enumerate(reader):
|
||||||
|
if idx >= max_scan:
|
||||||
|
break
|
||||||
|
if not row:
|
||||||
|
continue
|
||||||
|
# All cells must be non-empty, non-numeric strings
|
||||||
|
if all(_looks_like_header(cell) for cell in row if cell.strip()):
|
||||||
|
return idx
|
||||||
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
def _looks_like_header(value: str) -> bool:
|
||||||
|
"""True if *value* looks like a column header, not a data value."""
|
||||||
|
v = value.strip()
|
||||||
|
if not v:
|
||||||
|
return False
|
||||||
|
# Pure numbers are not headers
|
||||||
|
try:
|
||||||
|
float(v.replace(",", ""))
|
||||||
|
return False
|
||||||
|
except ValueError:
|
||||||
|
pass
|
||||||
|
return True
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Excel helpers
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
def list_sheets(path: Path) -> list[str]:
|
||||||
|
"""Return sheet names from an Excel workbook."""
|
||||||
|
xl = pd.ExcelFile(path, engine="openpyxl")
|
||||||
|
return xl.sheet_names
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Reading
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
def read_file(
|
||||||
|
path: str | Path,
|
||||||
|
*,
|
||||||
|
encoding: Optional[str] = None,
|
||||||
|
delimiter: Optional[str] = None,
|
||||||
|
header_row: Optional[int] = None,
|
||||||
|
sheet_name: Optional[str | int] = 0,
|
||||||
|
chunk_size: Optional[int] = None,
|
||||||
|
) -> pd.DataFrame | Generator[pd.DataFrame, None, None]:
|
||||||
|
"""Read a CSV, TSV, or Excel file into a DataFrame.
|
||||||
|
|
||||||
|
Parameters
|
||||||
|
----------
|
||||||
|
path : file path
|
||||||
|
encoding : override detected encoding (CSV only)
|
||||||
|
delimiter : override detected delimiter (CSV only)
|
||||||
|
header_row : 0-based row index for the header; auto-detected if *None*
|
||||||
|
sheet_name : Excel sheet (name or 0-based index). Ignored for CSV.
|
||||||
|
chunk_size : if set, return a generator of DataFrames (CSV only).
|
||||||
|
|
||||||
|
Returns a DataFrame (or generator when *chunk_size* is set).
|
||||||
|
"""
|
||||||
|
filepath = Path(path)
|
||||||
|
if not filepath.exists():
|
||||||
|
raise FileNotFoundError(f"File not found: {filepath}")
|
||||||
|
|
||||||
|
suffix = filepath.suffix.lower()
|
||||||
|
if suffix in (".xlsx", ".xls"):
|
||||||
|
return _read_excel(filepath, header_row=header_row, sheet_name=sheet_name)
|
||||||
|
else:
|
||||||
|
return _read_csv(
|
||||||
|
filepath,
|
||||||
|
encoding=encoding,
|
||||||
|
delimiter=delimiter,
|
||||||
|
header_row=header_row,
|
||||||
|
chunk_size=chunk_size,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _read_csv(
|
||||||
|
path: Path,
|
||||||
|
*,
|
||||||
|
encoding: Optional[str] = None,
|
||||||
|
delimiter: Optional[str] = None,
|
||||||
|
header_row: Optional[int] = None,
|
||||||
|
chunk_size: Optional[int] = None,
|
||||||
|
) -> pd.DataFrame | Generator[pd.DataFrame, None, None]:
|
||||||
|
enc = encoding or detect_encoding(path)
|
||||||
|
delim = delimiter or detect_delimiter(path, enc)
|
||||||
|
hdr = header_row if header_row is not None else detect_header_row(path, enc, delim)
|
||||||
|
|
||||||
|
logger.debug("Reading CSV {} (encoding={}, delimiter={!r}, header_row={})",
|
||||||
|
path.name, enc, delim, hdr)
|
||||||
|
|
||||||
|
kwargs: dict = dict(
|
||||||
|
filepath_or_buffer=path,
|
||||||
|
encoding=enc,
|
||||||
|
delimiter=delim,
|
||||||
|
header=hdr,
|
||||||
|
dtype=str,
|
||||||
|
keep_default_na=False,
|
||||||
|
on_bad_lines="warn",
|
||||||
|
)
|
||||||
|
|
||||||
|
if chunk_size:
|
||||||
|
return pd.read_csv(**kwargs, chunksize=chunk_size)
|
||||||
|
|
||||||
|
return pd.read_csv(**kwargs)
|
||||||
|
|
||||||
|
|
||||||
|
def _read_excel(
|
||||||
|
path: Path,
|
||||||
|
*,
|
||||||
|
header_row: Optional[int] = None,
|
||||||
|
sheet_name: Optional[str | int] = 0,
|
||||||
|
) -> pd.DataFrame:
|
||||||
|
hdr = header_row if header_row is not None else 0
|
||||||
|
logger.debug("Reading Excel {} (sheet={}, header_row={})", path.name, sheet_name, hdr)
|
||||||
|
return pd.read_excel(
|
||||||
|
path,
|
||||||
|
sheet_name=sheet_name,
|
||||||
|
header=hdr,
|
||||||
|
dtype=str,
|
||||||
|
keep_default_na=False,
|
||||||
|
engine="openpyxl",
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Writing
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
def write_file(
|
||||||
|
df: pd.DataFrame,
|
||||||
|
path: str | Path,
|
||||||
|
*,
|
||||||
|
file_format: Optional[str] = None,
|
||||||
|
encoding: str = "utf-8-sig",
|
||||||
|
) -> Path:
|
||||||
|
"""Write a DataFrame to CSV or Excel.
|
||||||
|
|
||||||
|
Parameters
|
||||||
|
----------
|
||||||
|
df : DataFrame to write
|
||||||
|
path : output file path
|
||||||
|
file_format : ``"csv"`` or ``"xlsx"``; auto-detected from *path* suffix if *None*
|
||||||
|
encoding : output encoding (default ``utf-8-sig`` for Windows Excel compat)
|
||||||
|
|
||||||
|
Returns the resolved output Path.
|
||||||
|
"""
|
||||||
|
out = Path(path)
|
||||||
|
fmt = file_format or out.suffix.lstrip(".").lower()
|
||||||
|
if fmt in ("xlsx", "xls"):
|
||||||
|
df.to_excel(out, index=False, engine="openpyxl")
|
||||||
|
else:
|
||||||
|
df.to_csv(out, index=False, encoding=encoding)
|
||||||
|
logger.info("Wrote {} rows to {}", len(df), out)
|
||||||
|
return out
|
||||||
224
src/core/normalizers.py
Normal file
224
src/core/normalizers.py
Normal file
@@ -0,0 +1,224 @@
|
|||||||
|
"""Per-column normalization functions for deduplication matching.
|
||||||
|
|
||||||
|
Every normalizer is ``str -> str``, handles None/empty gracefully, and is
|
||||||
|
idempotent (applying it twice yields the same result as once).
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import re
|
||||||
|
from enum import Enum
|
||||||
|
from typing import Callable, Optional
|
||||||
|
|
||||||
|
import phonenumbers
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Types
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
class NormalizerType(str, Enum):
|
||||||
|
EMAIL = "email"
|
||||||
|
PHONE = "phone"
|
||||||
|
NAME = "name"
|
||||||
|
ADDRESS = "address"
|
||||||
|
STRING = "string"
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# String normalizer (base)
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
def normalize_string(value: Optional[str]) -> str:
|
||||||
|
"""Trim, collapse internal whitespace, case-fold."""
|
||||||
|
if not value or not isinstance(value, str):
|
||||||
|
return ""
|
||||||
|
return re.sub(r"\s+", " ", value.strip()).casefold()
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Email normalizer
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
_GMAIL_DOMAINS = {"gmail.com", "googlemail.com"}
|
||||||
|
|
||||||
|
|
||||||
|
def normalize_email(value: Optional[str]) -> str:
|
||||||
|
"""Lowercase, strip whitespace, strip Gmail dots, strip +tag suffixes."""
|
||||||
|
if not value or not isinstance(value, str):
|
||||||
|
return ""
|
||||||
|
email = value.strip().lower()
|
||||||
|
if "@" not in email:
|
||||||
|
return email
|
||||||
|
|
||||||
|
local, domain = email.rsplit("@", 1)
|
||||||
|
|
||||||
|
# Strip +tag suffix
|
||||||
|
if "+" in local:
|
||||||
|
local = local.split("+", 1)[0]
|
||||||
|
|
||||||
|
# Strip dots for Gmail addresses
|
||||||
|
if domain in _GMAIL_DOMAINS:
|
||||||
|
local = local.replace(".", "")
|
||||||
|
|
||||||
|
return f"{local}@{domain}"
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Phone normalizer
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
def normalize_phone(value: Optional[str], default_region: str = "US") -> str:
|
||||||
|
"""Parse with phonenumbers lib, return E.164. Fallback: digits-only."""
|
||||||
|
if not value or not isinstance(value, str):
|
||||||
|
return ""
|
||||||
|
stripped = value.strip()
|
||||||
|
if not stripped:
|
||||||
|
return ""
|
||||||
|
|
||||||
|
try:
|
||||||
|
parsed = phonenumbers.parse(stripped, default_region)
|
||||||
|
if phonenumbers.is_possible_number(parsed):
|
||||||
|
return phonenumbers.format_number(parsed, phonenumbers.PhoneNumberFormat.E164)
|
||||||
|
except phonenumbers.NumberParseException:
|
||||||
|
pass
|
||||||
|
|
||||||
|
# Fallback: digits only
|
||||||
|
digits = re.sub(r"\D", "", stripped)
|
||||||
|
return digits
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Name normalizer
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
_TITLE_PREFIXES = {
|
||||||
|
"mr", "mrs", "ms", "miss", "dr", "prof", "professor",
|
||||||
|
"sir", "madam", "rev", "reverend", "hon", "honorable",
|
||||||
|
}
|
||||||
|
_NAME_SUFFIXES = {
|
||||||
|
"jr", "sr", "ii", "iii", "iv", "v",
|
||||||
|
"phd", "md", "esq", "dds", "rn",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def normalize_name(value: Optional[str]) -> str:
|
||||||
|
"""Strip titles/suffixes, collapse whitespace, case-fold."""
|
||||||
|
if not value or not isinstance(value, str):
|
||||||
|
return ""
|
||||||
|
name = value.strip()
|
||||||
|
if not name:
|
||||||
|
return ""
|
||||||
|
|
||||||
|
# Case-fold first for matching
|
||||||
|
name = name.casefold()
|
||||||
|
|
||||||
|
# Remove periods and commas that are part of titles/suffixes
|
||||||
|
name = name.replace(".", " ").replace(",", " ")
|
||||||
|
|
||||||
|
parts = name.split()
|
||||||
|
|
||||||
|
# Strip leading titles
|
||||||
|
while parts and parts[0].rstrip(".") in _TITLE_PREFIXES:
|
||||||
|
parts.pop(0)
|
||||||
|
|
||||||
|
# Strip trailing suffixes
|
||||||
|
while parts and parts[-1].rstrip(".") in _NAME_SUFFIXES:
|
||||||
|
parts.pop()
|
||||||
|
|
||||||
|
return " ".join(parts)
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Address normalizer
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
_USPS_ABBREVIATIONS: dict[str, str] = {
|
||||||
|
"street": "st",
|
||||||
|
"avenue": "ave",
|
||||||
|
"boulevard": "blvd",
|
||||||
|
"drive": "dr",
|
||||||
|
"lane": "ln",
|
||||||
|
"road": "rd",
|
||||||
|
"court": "ct",
|
||||||
|
"place": "pl",
|
||||||
|
"circle": "cir",
|
||||||
|
"trail": "trl",
|
||||||
|
"way": "way",
|
||||||
|
"terrace": "ter",
|
||||||
|
"parkway": "pkwy",
|
||||||
|
"highway": "hwy",
|
||||||
|
"expressway": "expy",
|
||||||
|
"freeway": "fwy",
|
||||||
|
"square": "sq",
|
||||||
|
"loop": "loop",
|
||||||
|
"alley": "aly",
|
||||||
|
"crossing": "xing",
|
||||||
|
"point": "pt",
|
||||||
|
"north": "n",
|
||||||
|
"south": "s",
|
||||||
|
"east": "e",
|
||||||
|
"west": "w",
|
||||||
|
"northeast": "ne",
|
||||||
|
"northwest": "nw",
|
||||||
|
"southeast": "se",
|
||||||
|
"southwest": "sw",
|
||||||
|
"apartment": "apt",
|
||||||
|
"suite": "ste",
|
||||||
|
"building": "bldg",
|
||||||
|
"floor": "fl",
|
||||||
|
"room": "rm",
|
||||||
|
"unit": "unit",
|
||||||
|
"number": "#",
|
||||||
|
"saint": "st",
|
||||||
|
"fort": "ft",
|
||||||
|
"mount": "mt",
|
||||||
|
"heights": "hts",
|
||||||
|
"springs": "spgs",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def normalize_address(value: Optional[str]) -> str:
|
||||||
|
"""USPS abbreviation normalization, collapse whitespace, case-fold."""
|
||||||
|
if not value or not isinstance(value, str):
|
||||||
|
return ""
|
||||||
|
addr = value.strip()
|
||||||
|
if not addr:
|
||||||
|
return ""
|
||||||
|
|
||||||
|
# Case-fold and clean punctuation (keep #)
|
||||||
|
addr = addr.casefold()
|
||||||
|
addr = addr.replace(".", " ").replace(",", " ")
|
||||||
|
|
||||||
|
parts = addr.split()
|
||||||
|
normalized_parts = []
|
||||||
|
for part in parts:
|
||||||
|
normalized_parts.append(_USPS_ABBREVIATIONS.get(part, part))
|
||||||
|
|
||||||
|
return " ".join(normalized_parts)
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Registry
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
_NORMALIZER_MAP: dict[NormalizerType, Callable[[str], str]] = {
|
||||||
|
NormalizerType.EMAIL: normalize_email,
|
||||||
|
NormalizerType.PHONE: normalize_phone,
|
||||||
|
NormalizerType.NAME: normalize_name,
|
||||||
|
NormalizerType.ADDRESS: normalize_address,
|
||||||
|
NormalizerType.STRING: normalize_string,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def get_normalizer(normalizer_type: NormalizerType | str) -> Callable[[str], str]:
|
||||||
|
"""Return the normalizer function for the given type.
|
||||||
|
|
||||||
|
Accepts both ``NormalizerType`` enum values and plain strings.
|
||||||
|
"""
|
||||||
|
if isinstance(normalizer_type, str):
|
||||||
|
normalizer_type = NormalizerType(normalizer_type.lower())
|
||||||
|
func = _NORMALIZER_MAP.get(normalizer_type)
|
||||||
|
if func is None:
|
||||||
|
raise ValueError(f"Unknown normalizer type: {normalizer_type}")
|
||||||
|
return func
|
||||||
1
src/gui/__init__.py
Normal file
1
src/gui/__init__.py
Normal file
@@ -0,0 +1 @@
|
|||||||
|
"""Streamlit GUI for the DataTools Deduplicator."""
|
||||||
8
src/gui/__main__.py
Normal file
8
src/gui/__main__.py
Normal file
@@ -0,0 +1,8 @@
|
|||||||
|
"""Allow running as ``python -m src.gui``."""
|
||||||
|
|
||||||
|
import subprocess
|
||||||
|
import sys
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
app_path = Path(__file__).parent / "app.py"
|
||||||
|
subprocess.run([sys.executable, "-m", "streamlit", "run", str(app_path)])
|
||||||
287
src/gui/app.py
Normal file
287
src/gui/app.py
Normal file
@@ -0,0 +1,287 @@
|
|||||||
|
"""DataTools Deduplicator — Streamlit GUI.
|
||||||
|
|
||||||
|
Launch:
|
||||||
|
streamlit run src/gui/app.py
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import io
|
||||||
|
import sys
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import pandas as pd
|
||||||
|
import streamlit as st
|
||||||
|
|
||||||
|
# Ensure project root is on sys.path so `src.core` imports work
|
||||||
|
_project_root = Path(__file__).resolve().parent.parent.parent
|
||||||
|
if str(_project_root) not in sys.path:
|
||||||
|
sys.path.insert(0, str(_project_root))
|
||||||
|
|
||||||
|
from src.core.dedup import deduplicate, build_default_strategies, DeduplicationResult
|
||||||
|
from src.core.io import read_file, list_sheets
|
||||||
|
from src.core.config import DeduplicationConfig
|
||||||
|
from src.gui.components import config_panel, match_group_card, results_summary
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Page config
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
st.set_page_config(
|
||||||
|
page_title="DataTools Deduplicator",
|
||||||
|
page_icon="🔍",
|
||||||
|
layout="wide",
|
||||||
|
)
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Session state defaults
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
_DEFAULTS = {
|
||||||
|
"df": None,
|
||||||
|
"result": None,
|
||||||
|
"review_decisions": {},
|
||||||
|
"config": None,
|
||||||
|
"file_name": "",
|
||||||
|
"sheet_names": [],
|
||||||
|
}
|
||||||
|
for key, default in _DEFAULTS.items():
|
||||||
|
if key not in st.session_state:
|
||||||
|
st.session_state[key] = default
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Header
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
st.title("DataTools Deduplicator")
|
||||||
|
st.caption("Find and remove duplicate rows in CSV and Excel files.")
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# File upload
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
uploaded = st.file_uploader(
|
||||||
|
"Upload CSV or Excel file",
|
||||||
|
type=["csv", "tsv", "xlsx", "xls"],
|
||||||
|
help="Supports CSV, TSV, and Excel files. Encoding and delimiters are auto-detected.",
|
||||||
|
)
|
||||||
|
|
||||||
|
if uploaded is not None:
|
||||||
|
# Detect if file changed
|
||||||
|
if uploaded.name != st.session_state["file_name"]:
|
||||||
|
st.session_state["file_name"] = uploaded.name
|
||||||
|
st.session_state["result"] = None
|
||||||
|
st.session_state["review_decisions"] = {}
|
||||||
|
|
||||||
|
# Read the file
|
||||||
|
try:
|
||||||
|
# Write to a temp file for read_file() which needs a path
|
||||||
|
import tempfile
|
||||||
|
suffix = Path(uploaded.name).suffix
|
||||||
|
with tempfile.NamedTemporaryFile(delete=False, suffix=suffix) as tmp:
|
||||||
|
tmp.write(uploaded.getvalue())
|
||||||
|
tmp_path = Path(tmp.name)
|
||||||
|
|
||||||
|
# Check for Excel sheets
|
||||||
|
if suffix.lower() in (".xlsx", ".xls"):
|
||||||
|
st.session_state["sheet_names"] = list_sheets(tmp_path)
|
||||||
|
else:
|
||||||
|
st.session_state["sheet_names"] = []
|
||||||
|
|
||||||
|
df = read_file(tmp_path)
|
||||||
|
if not isinstance(df, pd.DataFrame):
|
||||||
|
df = pd.concat(list(df), ignore_index=True)
|
||||||
|
|
||||||
|
st.session_state["df"] = df
|
||||||
|
|
||||||
|
# Clean up temp file
|
||||||
|
tmp_path.unlink(missing_ok=True)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
st.error(f"Failed to read file: {e}")
|
||||||
|
st.session_state["df"] = None
|
||||||
|
|
||||||
|
df = st.session_state["df"]
|
||||||
|
|
||||||
|
if df is not None:
|
||||||
|
# Sheet selector for Excel files
|
||||||
|
if st.session_state["sheet_names"] and len(st.session_state["sheet_names"]) > 1:
|
||||||
|
sheet = st.selectbox(
|
||||||
|
"Select sheet",
|
||||||
|
st.session_state["sheet_names"],
|
||||||
|
)
|
||||||
|
if sheet != st.session_state.get("_current_sheet"):
|
||||||
|
st.session_state["_current_sheet"] = sheet
|
||||||
|
suffix = Path(uploaded.name).suffix
|
||||||
|
import tempfile
|
||||||
|
with tempfile.NamedTemporaryFile(delete=False, suffix=suffix) as tmp:
|
||||||
|
tmp.write(uploaded.getvalue())
|
||||||
|
tmp_path = Path(tmp.name)
|
||||||
|
df = read_file(tmp_path, sheet_name=sheet)
|
||||||
|
if not isinstance(df, pd.DataFrame):
|
||||||
|
df = pd.concat(list(df), ignore_index=True)
|
||||||
|
st.session_state["df"] = df
|
||||||
|
st.session_state["result"] = None
|
||||||
|
st.session_state["review_decisions"] = {}
|
||||||
|
tmp_path.unlink(missing_ok=True)
|
||||||
|
|
||||||
|
# Preview
|
||||||
|
st.subheader(f"Preview: {uploaded.name}")
|
||||||
|
st.caption(f"{len(df)} rows, {len(df.columns)} columns")
|
||||||
|
st.dataframe(df.head(10), use_container_width=True)
|
||||||
|
|
||||||
|
# Advanced options
|
||||||
|
settings = config_panel(df)
|
||||||
|
|
||||||
|
# Apply loaded config if present
|
||||||
|
loaded_cfg = st.session_state.get("loaded_config")
|
||||||
|
if loaded_cfg is not None:
|
||||||
|
settings["strategies"] = loaded_cfg.to_strategies()
|
||||||
|
settings["survivor_rule"] = loaded_cfg.to_survivor_rule()
|
||||||
|
settings["date_column"] = loaded_cfg.date_column
|
||||||
|
settings["merge"] = loaded_cfg.merge
|
||||||
|
# Clear so it doesn't override on every rerun
|
||||||
|
del st.session_state["loaded_config"]
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Find Duplicates button
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
st.divider()
|
||||||
|
|
||||||
|
if st.button("Find Duplicates", type="primary", use_container_width=True):
|
||||||
|
progress_bar = st.progress(0, text="Comparing rows...")
|
||||||
|
|
||||||
|
def _gui_progress(current: int, total: int) -> None:
|
||||||
|
if total > 0:
|
||||||
|
pct = min(current / total, 1.0)
|
||||||
|
progress_bar.progress(pct, text=f"Comparing rows... {current:,}/{total:,}")
|
||||||
|
|
||||||
|
with st.spinner("Running deduplication..."):
|
||||||
|
result = deduplicate(
|
||||||
|
df,
|
||||||
|
strategies=settings["strategies"],
|
||||||
|
survivor_rule=settings["survivor_rule"],
|
||||||
|
date_column=settings["date_column"],
|
||||||
|
merge=settings["merge"],
|
||||||
|
preview=False,
|
||||||
|
progress_callback=_gui_progress,
|
||||||
|
)
|
||||||
|
|
||||||
|
progress_bar.empty()
|
||||||
|
st.session_state["result"] = result
|
||||||
|
st.session_state["review_decisions"] = {}
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Results
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
result: DeduplicationResult | None = st.session_state["result"]
|
||||||
|
|
||||||
|
if result is not None:
|
||||||
|
st.divider()
|
||||||
|
st.subheader("Results")
|
||||||
|
|
||||||
|
# Summary + download buttons
|
||||||
|
results_summary(result, df)
|
||||||
|
|
||||||
|
# Match group review
|
||||||
|
if result.match_groups:
|
||||||
|
st.divider()
|
||||||
|
st.subheader("Match Groups")
|
||||||
|
|
||||||
|
# Batch actions
|
||||||
|
action_left, action_mid, action_right = st.columns(3)
|
||||||
|
with action_left:
|
||||||
|
if st.button("Accept All"):
|
||||||
|
for g in result.match_groups:
|
||||||
|
st.session_state["review_decisions"][g.group_id] = True
|
||||||
|
st.rerun()
|
||||||
|
with action_mid:
|
||||||
|
if st.button("Reject All"):
|
||||||
|
for g in result.match_groups:
|
||||||
|
st.session_state["review_decisions"][g.group_id] = False
|
||||||
|
st.rerun()
|
||||||
|
with action_right:
|
||||||
|
if st.button("Clear Decisions"):
|
||||||
|
st.session_state["review_decisions"] = {}
|
||||||
|
st.rerun()
|
||||||
|
|
||||||
|
# Individual group cards
|
||||||
|
decisions = st.session_state["review_decisions"]
|
||||||
|
for i, group in enumerate(result.match_groups):
|
||||||
|
decision = match_group_card(group, df, group_num=i + 1)
|
||||||
|
if decision is not None:
|
||||||
|
decisions[group.group_id] = decision
|
||||||
|
st.session_state["review_decisions"] = decisions
|
||||||
|
st.rerun()
|
||||||
|
|
||||||
|
# Show decision summary
|
||||||
|
if decisions:
|
||||||
|
st.divider()
|
||||||
|
accepted = sum(1 for v in decisions.values() if v is True)
|
||||||
|
rejected = sum(1 for v in decisions.values() if v is False)
|
||||||
|
pending = len(result.match_groups) - len(decisions)
|
||||||
|
st.caption(
|
||||||
|
f"Decisions: {accepted} merged, {rejected} kept both, "
|
||||||
|
f"{pending} pending"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Re-run dedup with review decisions applied
|
||||||
|
if st.button(
|
||||||
|
"Apply Review Decisions & Download",
|
||||||
|
type="primary",
|
||||||
|
use_container_width=True,
|
||||||
|
):
|
||||||
|
def _review_callback(group, _df):
|
||||||
|
gid = group.group_id
|
||||||
|
if gid in decisions:
|
||||||
|
return decisions[gid]
|
||||||
|
return True # default: accept
|
||||||
|
|
||||||
|
reviewed_result = deduplicate(
|
||||||
|
df,
|
||||||
|
strategies=settings["strategies"],
|
||||||
|
survivor_rule=settings["survivor_rule"],
|
||||||
|
date_column=settings["date_column"],
|
||||||
|
merge=settings["merge"],
|
||||||
|
preview=False,
|
||||||
|
review_callback=_review_callback,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Update result and show downloads
|
||||||
|
st.session_state["result"] = reviewed_result
|
||||||
|
|
||||||
|
csv_bytes = reviewed_result.deduplicated_df.to_csv(
|
||||||
|
index=False
|
||||||
|
).encode("utf-8-sig")
|
||||||
|
st.download_button(
|
||||||
|
"Download Reviewed & Deduplicated CSV",
|
||||||
|
data=csv_bytes,
|
||||||
|
file_name="deduplicated_reviewed.csv",
|
||||||
|
mime="text/csv",
|
||||||
|
key="reviewed_download",
|
||||||
|
)
|
||||||
|
|
||||||
|
# Log entries
|
||||||
|
if result.log_entries:
|
||||||
|
with st.expander("Processing Log"):
|
||||||
|
st.code("\n".join(result.log_entries))
|
||||||
|
|
||||||
|
else:
|
||||||
|
# No file uploaded — show placeholder
|
||||||
|
st.info("Upload a CSV or Excel file to get started.")
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Footer
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
st.divider()
|
||||||
|
st.caption(
|
||||||
|
"Runs locally. Your data never leaves this computer. "
|
||||||
|
"| DataTools Deduplicator v1.0"
|
||||||
|
)
|
||||||
413
src/gui/components.py
Normal file
413
src/gui/components.py
Normal file
@@ -0,0 +1,413 @@
|
|||||||
|
"""Reusable Streamlit widgets for the deduplicator GUI."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import io
|
||||||
|
from typing import Optional
|
||||||
|
|
||||||
|
import pandas as pd
|
||||||
|
import streamlit as st
|
||||||
|
|
||||||
|
from src.core.dedup import (
|
||||||
|
Algorithm,
|
||||||
|
ColumnMatchStrategy,
|
||||||
|
DeduplicationResult,
|
||||||
|
MatchResult,
|
||||||
|
MatchStrategy,
|
||||||
|
SurvivorRule,
|
||||||
|
)
|
||||||
|
from src.core.config import (
|
||||||
|
ColumnStrategyConfig,
|
||||||
|
DeduplicationConfig,
|
||||||
|
StrategyConfig,
|
||||||
|
)
|
||||||
|
from src.core.normalizers import NormalizerType
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Config panel (advanced options)
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
def config_panel(df: pd.DataFrame) -> dict:
|
||||||
|
"""Render the Advanced Options expander. Returns a settings dict.
|
||||||
|
|
||||||
|
Keys returned:
|
||||||
|
strategies: list[MatchStrategy] | None
|
||||||
|
survivor_rule: SurvivorRule
|
||||||
|
date_column: str | None
|
||||||
|
merge: bool
|
||||||
|
"""
|
||||||
|
columns = list(df.columns)
|
||||||
|
|
||||||
|
with st.expander("Advanced Options"):
|
||||||
|
col_left, col_right = st.columns(2)
|
||||||
|
|
||||||
|
with col_left:
|
||||||
|
subset_cols = st.multiselect(
|
||||||
|
"Match on columns",
|
||||||
|
columns,
|
||||||
|
default=[],
|
||||||
|
help="Leave empty to auto-detect based on column names.",
|
||||||
|
)
|
||||||
|
key_cols = st.multiselect(
|
||||||
|
"Strong keys",
|
||||||
|
columns,
|
||||||
|
default=[],
|
||||||
|
help="Columns that uniquely identify records (e.g., EIN, SKU). Each is an independent exact-match strategy.",
|
||||||
|
)
|
||||||
|
fuzzy_cols = st.multiselect(
|
||||||
|
"Fuzzy columns",
|
||||||
|
columns,
|
||||||
|
default=[],
|
||||||
|
help="Columns to fuzzy-match. Others use exact matching.",
|
||||||
|
)
|
||||||
|
|
||||||
|
with col_right:
|
||||||
|
algorithm = st.selectbox(
|
||||||
|
"Fuzzy algorithm",
|
||||||
|
["jaro_winkler", "levenshtein", "token_set_ratio"],
|
||||||
|
index=0,
|
||||||
|
help="jaro_winkler: best for names. levenshtein: best for typos. token_set_ratio: best for addresses.",
|
||||||
|
)
|
||||||
|
threshold = st.slider(
|
||||||
|
"Similarity threshold",
|
||||||
|
min_value=50,
|
||||||
|
max_value=100,
|
||||||
|
value=85,
|
||||||
|
help="Lower = more matches but more false positives.",
|
||||||
|
)
|
||||||
|
survivor = st.selectbox(
|
||||||
|
"Survivor rule",
|
||||||
|
["first", "last", "most-complete", "most-recent"],
|
||||||
|
index=0,
|
||||||
|
help="Which row to keep when duplicates are found.",
|
||||||
|
)
|
||||||
|
|
||||||
|
# Second row of options
|
||||||
|
col_a, col_b = st.columns(2)
|
||||||
|
|
||||||
|
with col_a:
|
||||||
|
normalize_options = {c: "auto" for c in columns}
|
||||||
|
normalizer_types = ["auto", "email", "phone", "name", "address", "string", "none"]
|
||||||
|
|
||||||
|
normalize_map: dict[str, str] = {}
|
||||||
|
if fuzzy_cols or subset_cols:
|
||||||
|
target_cols = fuzzy_cols or subset_cols
|
||||||
|
st.markdown("**Per-column normalizers**")
|
||||||
|
for col_name in target_cols:
|
||||||
|
norm = st.selectbox(
|
||||||
|
f"Normalizer for '{col_name}'",
|
||||||
|
normalizer_types,
|
||||||
|
index=0,
|
||||||
|
key=f"norm_{col_name}",
|
||||||
|
)
|
||||||
|
if norm not in ("auto", "none"):
|
||||||
|
normalize_map[col_name] = norm
|
||||||
|
|
||||||
|
with col_b:
|
||||||
|
merge = st.checkbox(
|
||||||
|
"Merge mode",
|
||||||
|
value=False,
|
||||||
|
help="Fill missing fields in the surviving row from removed duplicates.",
|
||||||
|
)
|
||||||
|
date_column: Optional[str] = None
|
||||||
|
if survivor == "most-recent":
|
||||||
|
date_column = st.selectbox(
|
||||||
|
"Date column",
|
||||||
|
columns,
|
||||||
|
help="Required for most-recent survivor rule.",
|
||||||
|
)
|
||||||
|
|
||||||
|
# Config save/load
|
||||||
|
st.divider()
|
||||||
|
cfg_left, cfg_right = st.columns(2)
|
||||||
|
|
||||||
|
with cfg_left:
|
||||||
|
config_file = st.file_uploader(
|
||||||
|
"Load config profile",
|
||||||
|
type=["json"],
|
||||||
|
help="Load previously saved settings.",
|
||||||
|
key="config_upload",
|
||||||
|
)
|
||||||
|
if config_file is not None:
|
||||||
|
import json
|
||||||
|
try:
|
||||||
|
data = json.loads(config_file.read())
|
||||||
|
loaded = DeduplicationConfig.from_dict(data)
|
||||||
|
st.session_state["loaded_config"] = loaded
|
||||||
|
st.success("Config loaded.")
|
||||||
|
except Exception as e:
|
||||||
|
st.error(f"Failed to load config: {e}")
|
||||||
|
|
||||||
|
with cfg_right:
|
||||||
|
if st.button("Save current settings"):
|
||||||
|
cfg = _build_config(
|
||||||
|
subset_cols, key_cols, fuzzy_cols,
|
||||||
|
algorithm, threshold, normalize_map,
|
||||||
|
survivor, date_column, merge,
|
||||||
|
)
|
||||||
|
cfg_json = cfg.to_dict()
|
||||||
|
import json
|
||||||
|
st.download_button(
|
||||||
|
"Download config JSON",
|
||||||
|
data=json.dumps(cfg_json, indent=2),
|
||||||
|
file_name="dedup_config.json",
|
||||||
|
mime="application/json",
|
||||||
|
)
|
||||||
|
|
||||||
|
# Build strategies from selections
|
||||||
|
strategies = _build_strategies(
|
||||||
|
subset_cols, key_cols, fuzzy_cols,
|
||||||
|
algorithm, threshold, normalize_map,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Survivor rule mapping
|
||||||
|
survivor_map = {
|
||||||
|
"first": SurvivorRule.KEEP_FIRST,
|
||||||
|
"last": SurvivorRule.KEEP_LAST,
|
||||||
|
"most-complete": SurvivorRule.KEEP_MOST_COMPLETE,
|
||||||
|
"most-recent": SurvivorRule.KEEP_MOST_RECENT,
|
||||||
|
}
|
||||||
|
|
||||||
|
return {
|
||||||
|
"strategies": strategies,
|
||||||
|
"survivor_rule": survivor_map[survivor],
|
||||||
|
"date_column": date_column,
|
||||||
|
"merge": merge,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def _build_strategies(
|
||||||
|
subset_cols: list[str],
|
||||||
|
key_cols: list[str],
|
||||||
|
fuzzy_cols: list[str],
|
||||||
|
algorithm: str,
|
||||||
|
threshold: int,
|
||||||
|
normalize_map: dict[str, str],
|
||||||
|
) -> Optional[list[MatchStrategy]]:
|
||||||
|
"""Build MatchStrategy list from GUI selections. Returns None for auto-detect."""
|
||||||
|
strategies: list[MatchStrategy] = []
|
||||||
|
|
||||||
|
# If user selected columns explicitly, build from those
|
||||||
|
if subset_cols or fuzzy_cols:
|
||||||
|
target_cols = subset_cols if subset_cols else fuzzy_cols
|
||||||
|
fuzzy_set = set(fuzzy_cols)
|
||||||
|
col_strats: list[ColumnMatchStrategy] = []
|
||||||
|
for col in target_cols:
|
||||||
|
norm = None
|
||||||
|
if col in normalize_map:
|
||||||
|
norm = NormalizerType(normalize_map[col])
|
||||||
|
if col in fuzzy_set:
|
||||||
|
algo = Algorithm(algorithm)
|
||||||
|
thresh = float(threshold)
|
||||||
|
else:
|
||||||
|
algo = Algorithm.EXACT
|
||||||
|
thresh = 100.0
|
||||||
|
col_strats.append(ColumnMatchStrategy(
|
||||||
|
column=col, algorithm=algo, threshold=thresh, normalizer=norm,
|
||||||
|
))
|
||||||
|
strategies.append(MatchStrategy(column_strategies=col_strats))
|
||||||
|
|
||||||
|
# Add strong key strategies
|
||||||
|
if key_cols:
|
||||||
|
for col in key_cols:
|
||||||
|
strategies.append(MatchStrategy(column_strategies=[
|
||||||
|
ColumnMatchStrategy(column=col, algorithm=Algorithm.EXACT, threshold=100.0)
|
||||||
|
]))
|
||||||
|
|
||||||
|
return strategies if strategies else None
|
||||||
|
|
||||||
|
|
||||||
|
def _build_config(
|
||||||
|
subset_cols, key_cols, fuzzy_cols,
|
||||||
|
algorithm, threshold, normalize_map,
|
||||||
|
survivor, date_column, merge,
|
||||||
|
) -> DeduplicationConfig:
|
||||||
|
"""Build a DeduplicationConfig from GUI state."""
|
||||||
|
cfg = DeduplicationConfig(
|
||||||
|
survivor_rule=survivor.replace("-", "_"),
|
||||||
|
date_column=date_column,
|
||||||
|
merge=merge,
|
||||||
|
subset_columns=subset_cols or None,
|
||||||
|
fuzzy_columns=fuzzy_cols or None,
|
||||||
|
default_algorithm=algorithm,
|
||||||
|
default_threshold=float(threshold),
|
||||||
|
normalize_map=normalize_map or None,
|
||||||
|
)
|
||||||
|
strategies = _build_strategies(
|
||||||
|
subset_cols, key_cols, fuzzy_cols,
|
||||||
|
algorithm, threshold, normalize_map,
|
||||||
|
)
|
||||||
|
if strategies:
|
||||||
|
cfg.strategies = [
|
||||||
|
StrategyConfig(columns=[
|
||||||
|
ColumnStrategyConfig(
|
||||||
|
column=cs.column,
|
||||||
|
algorithm=cs.algorithm.value,
|
||||||
|
threshold=cs.threshold,
|
||||||
|
normalizer=cs.normalizer.value if cs.normalizer else None,
|
||||||
|
)
|
||||||
|
for cs in s.column_strategies
|
||||||
|
])
|
||||||
|
for s in strategies
|
||||||
|
]
|
||||||
|
return cfg
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Match group review card
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
def match_group_card(
|
||||||
|
group: MatchResult,
|
||||||
|
df: pd.DataFrame,
|
||||||
|
group_num: int,
|
||||||
|
) -> Optional[bool]:
|
||||||
|
"""Render an expandable match group card with side-by-side diff.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
True — user clicked Merge (accept match)
|
||||||
|
False — user clicked Keep Both (reject match)
|
||||||
|
None — no decision yet
|
||||||
|
"""
|
||||||
|
confidence = group.confidence
|
||||||
|
auto_expand = confidence < 95.0
|
||||||
|
matched_on = ", ".join(group.matched_on)
|
||||||
|
n_rows = len(group.row_indices)
|
||||||
|
|
||||||
|
label = (
|
||||||
|
f"Group {group_num}: {n_rows} rows "
|
||||||
|
f"(confidence: {confidence:.0f}%) "
|
||||||
|
f"[{matched_on}]"
|
||||||
|
)
|
||||||
|
|
||||||
|
with st.expander(label, expanded=auto_expand):
|
||||||
|
# Build comparison DataFrame
|
||||||
|
display_cols = [c for c in df.columns if not str(c).startswith("_norm_")]
|
||||||
|
rows_data = []
|
||||||
|
for idx in group.row_indices:
|
||||||
|
row = {"_row": idx + 1}
|
||||||
|
for col in display_cols:
|
||||||
|
row[col] = df.iloc[idx].get(col, "")
|
||||||
|
rows_data.append(row)
|
||||||
|
|
||||||
|
compare_df = pd.DataFrame(rows_data)
|
||||||
|
compare_df = compare_df.set_index("_row")
|
||||||
|
|
||||||
|
# Highlight differences
|
||||||
|
def _highlight_diffs(s: pd.Series) -> list[str]:
|
||||||
|
"""Highlight cells that differ from the first row."""
|
||||||
|
styles = []
|
||||||
|
first_val = str(s.iloc[0]).strip() if len(s) > 0 else ""
|
||||||
|
for val in s:
|
||||||
|
val_str = str(val).strip()
|
||||||
|
if val_str != first_val and val_str and first_val:
|
||||||
|
styles.append("background-color: rgba(245, 166, 35, 0.2)")
|
||||||
|
elif not val_str and first_val:
|
||||||
|
styles.append("background-color: rgba(240, 82, 82, 0.1)")
|
||||||
|
else:
|
||||||
|
styles.append("")
|
||||||
|
return styles
|
||||||
|
|
||||||
|
styled = compare_df.style.apply(_highlight_diffs, axis=0)
|
||||||
|
st.dataframe(styled, use_container_width=True)
|
||||||
|
|
||||||
|
# Action buttons
|
||||||
|
btn_left, btn_mid, btn_right = st.columns(3)
|
||||||
|
merge_key = f"merge_{group.group_id}"
|
||||||
|
keep_key = f"keep_{group.group_id}"
|
||||||
|
|
||||||
|
with btn_left:
|
||||||
|
if st.button("Merge", key=merge_key, type="primary"):
|
||||||
|
return True
|
||||||
|
with btn_mid:
|
||||||
|
if st.button("Keep Both", key=keep_key):
|
||||||
|
return False
|
||||||
|
|
||||||
|
# Check session state for previous decisions
|
||||||
|
decisions = st.session_state.get("review_decisions", {})
|
||||||
|
if group.group_id in decisions:
|
||||||
|
decision = decisions[group.group_id]
|
||||||
|
if decision is True:
|
||||||
|
st.success("Decision: Merge")
|
||||||
|
elif decision is False:
|
||||||
|
st.info("Decision: Keep Both")
|
||||||
|
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Results summary + downloads
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
def results_summary(
|
||||||
|
result: DeduplicationResult,
|
||||||
|
original_df: pd.DataFrame,
|
||||||
|
) -> None:
|
||||||
|
"""Render summary stats and download buttons."""
|
||||||
|
removed = result.original_row_count - len(result.deduplicated_df)
|
||||||
|
|
||||||
|
# Summary metrics
|
||||||
|
col1, col2, col3, col4 = st.columns(4)
|
||||||
|
col1.metric("Rows In", result.original_row_count)
|
||||||
|
col2.metric("Rows Out", len(result.deduplicated_df))
|
||||||
|
col3.metric("Removed", removed)
|
||||||
|
col4.metric("Groups", len(result.match_groups))
|
||||||
|
|
||||||
|
st.divider()
|
||||||
|
|
||||||
|
# Download buttons
|
||||||
|
dl_left, dl_mid, dl_right = st.columns(3)
|
||||||
|
|
||||||
|
with dl_left:
|
||||||
|
csv_bytes = result.deduplicated_df.to_csv(index=False).encode("utf-8-sig")
|
||||||
|
st.download_button(
|
||||||
|
"Download Deduplicated CSV",
|
||||||
|
data=csv_bytes,
|
||||||
|
file_name="deduplicated.csv",
|
||||||
|
mime="text/csv",
|
||||||
|
)
|
||||||
|
|
||||||
|
with dl_mid:
|
||||||
|
if not result.removed_df.empty:
|
||||||
|
removed_bytes = result.removed_df.to_csv(index=False).encode("utf-8-sig")
|
||||||
|
st.download_button(
|
||||||
|
"Download Removed Rows",
|
||||||
|
data=removed_bytes,
|
||||||
|
file_name="removed_rows.csv",
|
||||||
|
mime="text/csv",
|
||||||
|
)
|
||||||
|
|
||||||
|
with dl_right:
|
||||||
|
if result.match_groups:
|
||||||
|
groups_data = _build_match_groups_csv(result, original_df)
|
||||||
|
st.download_button(
|
||||||
|
"Download Match Groups Report",
|
||||||
|
data=groups_data,
|
||||||
|
file_name="match_groups.csv",
|
||||||
|
mime="text/csv",
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _build_match_groups_csv(
|
||||||
|
result: DeduplicationResult,
|
||||||
|
original_df: pd.DataFrame,
|
||||||
|
) -> bytes:
|
||||||
|
"""Build the match groups audit CSV as bytes."""
|
||||||
|
rows = []
|
||||||
|
for g in result.match_groups:
|
||||||
|
for idx in g.row_indices:
|
||||||
|
row_data = {
|
||||||
|
"_group_id": g.group_id + 1,
|
||||||
|
"_is_survivor": idx == g.survivor_index,
|
||||||
|
"_confidence": g.confidence,
|
||||||
|
"_matched_on": ", ".join(g.matched_on),
|
||||||
|
"_original_row": idx + 1,
|
||||||
|
}
|
||||||
|
for col in original_df.columns:
|
||||||
|
if not str(col).startswith("_norm_"):
|
||||||
|
row_data[col] = original_df.iloc[idx].get(col, "") if idx < len(original_df) else ""
|
||||||
|
rows.append(row_data)
|
||||||
|
|
||||||
|
groups_df = pd.DataFrame(rows)
|
||||||
|
return groups_df.to_csv(index=False).encode("utf-8-sig")
|
||||||
11
test-cases/ec01_encoding_windows1252.csv
Normal file
11
test-cases/ec01_encoding_windows1252.csv
Normal file
@@ -0,0 +1,11 @@
|
|||||||
|
customer_id,name,email,city
|
||||||
|
E001,Björn Andersson,bjorn@example.com,Malmö
|
||||||
|
E002,Bjorn Andersson,bjorn@example.com,Malmo
|
||||||
|
E003,François Dupont,f.dupont@example.com,Montréal
|
||||||
|
E004,Francois Dupont,f.dupont@example.com,Montreal
|
||||||
|
E005,José García,jgarcia@example.com,México
|
||||||
|
E006,Jose Garcia,jgarcia@example.com,Mexico
|
||||||
|
E007,Müller GmbH,kontakt@muller.de,Köln
|
||||||
|
E008,Muller GmbH,kontakt@muller.de,Koln
|
||||||
|
E009,Unique Person,unique@example.com,London
|
||||||
|
E010,Another Unique,another@example.com,Paris
|
||||||
|
6
test-cases/ec02_delimiter_semicolon.csv
Normal file
6
test-cases/ec02_delimiter_semicolon.csv
Normal file
@@ -0,0 +1,6 @@
|
|||||||
|
id;name;email;amount
|
||||||
|
1;Test User;test1@example.com;1.234,56
|
||||||
|
2;Test User;TEST1@example.com;1.234,56
|
||||||
|
3;Other User;test2@example.com;987,00
|
||||||
|
4;Other User;test2@example.com;987,00
|
||||||
|
5;Unique;unique@example.com;500,00
|
||||||
|
6
test-cases/ec03_delimiter_tab.tsv
Normal file
6
test-cases/ec03_delimiter_tab.tsv
Normal file
@@ -0,0 +1,6 @@
|
|||||||
|
sku product price
|
||||||
|
A-100 Widget 9.99
|
||||||
|
A-100 Widget Standard 9.99
|
||||||
|
A-101 Widget Plus 12.99
|
||||||
|
A-100 WIDGET 9.99
|
||||||
|
B-200 Gadget 15.00
|
||||||
|
6
test-cases/ec04_utf8_bom.csv
Normal file
6
test-cases/ec04_utf8_bom.csv
Normal file
@@ -0,0 +1,6 @@
|
|||||||
|
customer_id,name,email
|
||||||
|
B001,Andersén,andersen@example.com
|
||||||
|
B002,Andersen,andersen@example.com
|
||||||
|
B003,Smith,smith@example.com
|
||||||
|
B004,SMITH,Smith@Example.com
|
||||||
|
B005,Lee,lee@example.com
|
||||||
|
16
test-cases/uc01_shopify_customer_list.csv
Normal file
16
test-cases/uc01_shopify_customer_list.csv
Normal file
@@ -0,0 +1,16 @@
|
|||||||
|
customer_id,first_name,last_name,email,phone,signup_date
|
||||||
|
C001,John,Smith,john.smith@gmail.com,(555) 123-4567,2025-01-15
|
||||||
|
C002,John,Smith,JohnSmith@Gmail.com,555-123-4567,2025-02-03
|
||||||
|
C003,John,Smith,j.o.h.n.smith+shop@gmail.com,+15551234567,2025-02-19
|
||||||
|
C004,Jon,Smith,jon.smith@yahoo.com,(555) 987-6543,2025-01-22
|
||||||
|
C005,Maria,Garcia,maria@example.com,555.222.3333,2025-03-01
|
||||||
|
C006,Maria,Garcia,maria@example.com,5552223333,2025-03-10
|
||||||
|
C007,Maria,Garcia,maria@example.com,+1 (555) 222-3333,2025-03-15
|
||||||
|
C008,David,Lee,david.lee@work.com,(555) 444-1111,2025-02-01
|
||||||
|
C009,David,Lee,dlee@personal.com,555-444-1111,2025-02-28
|
||||||
|
C010,Sarah,Jones,sarah@example.com ,(555) 666-7777,2025-01-08
|
||||||
|
C011,Sarah,Jones,sarah@example.com,(555) 666-7777,2025-01-09
|
||||||
|
C012,Alice,Wong,alice.wong@example.com,(555) 333-9999,2025-03-22
|
||||||
|
C013,Robert,Brown,rob.brown@example.com,(555) 888-2222,2025-03-25
|
||||||
|
C014,Catherine,Doe,cdoe@example.com,(555) 111-0000,2025-01-30
|
||||||
|
C015,Katherine,Doe,cdoe@example.com,(555) 111-0000,2025-04-02
|
||||||
|
16
test-cases/uc02_product_catalog.csv
Normal file
16
test-cases/uc02_product_catalog.csv
Normal file
@@ -0,0 +1,16 @@
|
|||||||
|
sku,product_name,price,stock,category
|
||||||
|
DOG-001,Dog Collar - Red - Large,12.99,45,Collars
|
||||||
|
DOG-001 ,Dog Collar Red Large,12.99,12,Collars
|
||||||
|
dog-001,"Dog Collar, Red, L",12.99,8,Collars
|
||||||
|
DOG-001 ,Red Dog Collar (Large),12.99,3,Collars
|
||||||
|
DOG-002,Dog Collar - Red - Medium,11.99,20,Collars
|
||||||
|
DOG-002,Dog Collar Red Medium,11.99,15,Collars
|
||||||
|
CAT-100,Cat Scratching Post 36 inch,29.99,10,Furniture
|
||||||
|
CAT-100,"Cat Scratching Post 36""",29.99,5,Furniture
|
||||||
|
CAT-100,Cat Scratch Post 36in,29.99,2,Furniture
|
||||||
|
CAT-101,Cat Scratching Post 36 inch,31.99,8,Furniture
|
||||||
|
BIRD-50,Parrot Cage Large,89.00,4,Cages
|
||||||
|
BIRD-50,Parrot Cage Large,89.00,1,Cages
|
||||||
|
FISH-22,Aquarium Filter 20gal,45.50,12,Aquatics
|
||||||
|
RABT-7,Rabbit Hutch Outdoor,199.99,3,Cages
|
||||||
|
DOG-555,Dog Bed Memory Foam XL,75.00,8,Beds
|
||||||
|
16
test-cases/uc03_abandoned_carts.csv
Normal file
16
test-cases/uc03_abandoned_carts.csv
Normal file
@@ -0,0 +1,16 @@
|
|||||||
|
cart_id,email,abandoned_at,cart_value,items_count
|
||||||
|
AC1001,buyer1@example.com,2026-04-10 14:23:00,87.50,3
|
||||||
|
AC1018,buyer1@example.com,2026-04-12 09:11:00,120.00,4
|
||||||
|
AC1042,BUYER1@example.com,2026-04-15 18:55:00,65.25,2
|
||||||
|
AC1003,buyer2@example.com,2026-04-10 10:00:00,45.00,1
|
||||||
|
AC1009,buyer2@example.com,2026-04-10 16:30:00,52.00,2
|
||||||
|
AC1011,j.doe@gmail.com,2026-04-11 12:00:00,200.00,5
|
||||||
|
AC1027,jdoe@gmail.com,2026-04-13 14:00:00,180.00,4
|
||||||
|
AC1004,alice@example.com,2026-04-10 11:00:00,30.00,1
|
||||||
|
AC1006,bob@example.com,2026-04-10 13:00:00,75.00,2
|
||||||
|
AC1019,carol@example.com,2026-04-12 15:00:00,90.00,3
|
||||||
|
AC1031,dan@example.com,2026-04-13 17:00:00,55.00,2
|
||||||
|
AC1050,eve@example.com,2026-04-16 08:00:00,112.00,3
|
||||||
|
AC1051,eve@example.com,2026-04-16 08:01:00,112.00,3
|
||||||
|
AC1099,frank@example.com,2026-01-05 10:00:00,40.00,1
|
||||||
|
AC1187,frank@example.com,2026-04-18 15:00:00,60.00,2
|
||||||
|
15
test-cases/uc04_orders_consolidated.csv
Normal file
15
test-cases/uc04_orders_consolidated.csv
Normal file
@@ -0,0 +1,15 @@
|
|||||||
|
source_order_id,source,order_date,customer_email,customer_name,total
|
||||||
|
SHOP-1001,shopify,2026-04-01,kara.miller@gmail.com,Kara Miller,45.99
|
||||||
|
AMZ-A789X,amazon,2026-04-03,kara.miller@gmail.com,K Miller,32.50
|
||||||
|
MAN-2026-04,manual,2026-04-05,kara@miller-design.com,Kara Miller,120.00
|
||||||
|
SHOP-1023,shopify,2026-04-08,tom.harris@example.com,Tom Harris,67.00
|
||||||
|
MAN-2026-09,manual,2026-04-10,tom.haris@example.com,Tom Harris,55.00
|
||||||
|
SHOP-1045,shopify,2026-04-12,jenny.l@example.com,Jenny Lee,89.00
|
||||||
|
AMZ-B221Y,amazon,2026-04-15,JENNY.L@EXAMPLE.COM,Jenny L.,44.50
|
||||||
|
SHOP-2001,shopify,2026-04-02,alex@example.com,Alex Park,33.00
|
||||||
|
SHOP-2002,shopify,2026-04-02,amy@example.com,Amy Park,33.00
|
||||||
|
SHOP-3001,shopify,2026-04-06,single1@example.com,Single Buyer One,78.00
|
||||||
|
AMZ-C100Z,amazon,2026-04-07,single2@example.com,Single Buyer Two,92.00
|
||||||
|
MAN-2026-12,manual,2026-04-09,single3@example.com,Single Buyer Three,150.00
|
||||||
|
SHOP-4001,shopify,2026-04-11,sales@bigcorp.com,Mike Johnson,200.00
|
||||||
|
SHOP-4012,shopify,2026-04-14,sales@bigcorp.com,Sarah Wright,175.00
|
||||||
|
16
test-cases/uc05_subscriber_list.csv
Normal file
16
test-cases/uc05_subscriber_list.csv
Normal file
@@ -0,0 +1,16 @@
|
|||||||
|
email,first_name,source,subscribed_date,tags
|
||||||
|
sub1@example.com,Pat,newsletter_signup,2025-08-15,newsletter
|
||||||
|
SUB1@example.com,Pat,facebook_lead_form,2025-09-01,fb_q3_2025
|
||||||
|
sub1@EXAMPLE.com,Patricia,checkout_optin,2025-10-12,customer
|
||||||
|
sub1@example.com ,Pat W,popup_form,2026-01-08,popup_homepage
|
||||||
|
mike.smith@gmail.com,Mike,import_2024,2024-11-20,legacy
|
||||||
|
mikesmith@gmail.com,Mike S,newsletter_signup,2025-03-15,newsletter
|
||||||
|
m.i.k.e.s.m.i.t.h@gmail.com,Michael,facebook_lead_form,2025-07-22,fb_q3_2025
|
||||||
|
promos+freebie@gmail.com,Sam,freebie_download,2025-12-01,lead_magnet
|
||||||
|
promos@gmail.com,Sam,newsletter_signup,2026-02-15,newsletter
|
||||||
|
unique1@example.com,Alpha,newsletter_signup,2025-09-10,newsletter
|
||||||
|
unique2@example.com,Beta,popup_form,2025-10-05,popup_homepage
|
||||||
|
unique3@example.com,Gamma,facebook_lead_form,2025-11-12,fb_q4_2025
|
||||||
|
unique4@example.com,Delta,checkout_optin,2026-01-20,customer
|
||||||
|
jdoe@company-a.com,Jane Doe,newsletter_signup,2025-08-01,newsletter
|
||||||
|
jdoe@company-b.com,Jane Doe,newsletter_signup,2025-08-02,newsletter
|
||||||
|
19
test-cases/uc06_bank_export_overlap.csv
Normal file
19
test-cases/uc06_bank_export_overlap.csv
Normal file
@@ -0,0 +1,19 @@
|
|||||||
|
txn_date,description,amount,balance_after,export_batch
|
||||||
|
2026-04-10,ACH DEPOSIT PAYROLL,2500.00,5230.45,exportA
|
||||||
|
2026-04-10,ACH Dep Payroll,2500.00,5230.45,exportB
|
||||||
|
2026-04-11,POS PURCHASE STARBUCKS #4421,-6.75,5223.70,exportA
|
||||||
|
2026-04-11,POS Purchase Starbucks 4421,-6.75,5223.70,exportB
|
||||||
|
2026-04-12,CHECK #1042,-450.00,4773.70,exportA
|
||||||
|
2026-04-12,CHECK 1042,-450.00,4773.70,exportB
|
||||||
|
2026-04-13,ATM WITHDRAWAL ATM0019,-100.00,4673.70,exportA
|
||||||
|
2026-04-13,ATM Withdrawal ATM0019,-100.00,4673.70,exportB
|
||||||
|
2026-04-15,ONLINE TRANSFER TO SAVINGS,-200.00,4473.70,exportA
|
||||||
|
2026-04-15,Online Transfer to Savings,-200.00,4473.70,exportB
|
||||||
|
2026-04-02,ACH DEBIT UTILITY CO,-145.00,2730.45,exportA
|
||||||
|
2026-04-05,POS PURCHASE GROCERY MART,-87.32,2643.13,exportA
|
||||||
|
2026-04-08,DEPOSIT MOBILE CHECK,200.00,2843.13,exportA
|
||||||
|
2026-04-20,ACH DEBIT INSURANCE CO,-220.00,4253.70,exportB
|
||||||
|
2026-04-22,POS PURCHASE GAS STATION,-45.10,4208.60,exportB
|
||||||
|
2026-04-25,INTEREST PAYMENT,0.85,4209.45,exportB
|
||||||
|
2026-04-11,POS PURCHASE STARBUCKS #4421,-6.75,5216.95,exportA
|
||||||
|
2026-04-11,POS PURCHASE PEET'S #1102,-6.75,5210.20,exportA
|
||||||
|
16
test-cases/uc07_vendor_consolidation.csv
Normal file
16
test-cases/uc07_vendor_consolidation.csv
Normal file
@@ -0,0 +1,16 @@
|
|||||||
|
vendor_name,ein,contact_email,phone,source
|
||||||
|
Acme Corp,12-3456789,billing@acme.com,(555) 100-2000,quickbooks
|
||||||
|
Acme Corporation,12-3456789,ap@acme.com,555-100-2000,spreadsheet
|
||||||
|
ACME CORP.,12-3456789,billing@acme.com,5551002000,email_extract
|
||||||
|
Beta Solutions LLC,98-7654321,info@betasolutions.com,(555) 200-3000,quickbooks
|
||||||
|
"Beta Solutions, LLC",,billing@betasolutions.com,555.200.3000,spreadsheet
|
||||||
|
beta solutions,98-7654321,,(555) 200-3000,email_extract
|
||||||
|
Smith Consulting Inc,11-1111111,ap@smithconsulting.com,(555) 300-4000,quickbooks
|
||||||
|
Smith Consulting LLC,22-2222222,ap@smith-consulting.com,(555) 300-4001,quickbooks
|
||||||
|
Gamma Industries Inc,33-3333333,ap@gamma.com,(555) 400-5000,quickbooks
|
||||||
|
Gamma Industries Inc DBA QuickPrint,33-3333333,billing@quickprint.com,(555) 400-5001,spreadsheet
|
||||||
|
"Delta Services, Inc. ",44-4444444,ap@delta.com,(555) 500-6000,email_extract
|
||||||
|
Delta Services Inc,44-4444444,ap@delta.com,(555) 500-6000,quickbooks
|
||||||
|
Unique Vendor One,55-5555555,u1@example.com,(555) 600-7000,quickbooks
|
||||||
|
Unique Vendor Two,66-6666666,u2@example.com,(555) 700-8000,spreadsheet
|
||||||
|
Unique Vendor Three,77-7777777,u3@example.com,(555) 800-9000,email_extract
|
||||||
|
15
test-cases/uc08_customer_master_merge.csv
Normal file
15
test-cases/uc08_customer_master_merge.csv
Normal file
@@ -0,0 +1,15 @@
|
|||||||
|
customer_id,name,email,phone,address,city,state,zip,last_purchase
|
||||||
|
CUST-A1,Linda Park,linda.park@example.com,,,,,,2025-03-15
|
||||||
|
CUST-A2,Linda Park,,(555) 123-9999,1234 Oak St,Portland,OR,97201,
|
||||||
|
CUST-A3,L. Park,linda.park@example.com,(555) 123-9999,1234 Oak Street,,,97201,2026-01-10
|
||||||
|
CUST-B1,James Wilson,jwilson@example.com,(555) 444-8888,,Austin,TX,,2025-11-22
|
||||||
|
CUST-B2,James Wilson,jwilson@example.com,,789 Pine Ave,Austin,TX,78701,
|
||||||
|
CUST-C1,Anna Chen,anna.chen@example.com,,,,,,2024-08-01
|
||||||
|
CUST-C2,Anna Chen,anna.chen@example.com,(555) 222-7777,,,,,2025-02-14
|
||||||
|
CUST-C3,Anna Chen,,(555) 222-7777,456 Elm Dr,Seattle,WA,,2025-09-30
|
||||||
|
CUST-C4,A Chen,anna.chen@example.com,(555) 222-7777,456 Elm Drive,Seattle,WA,98101,2026-03-20
|
||||||
|
CUST-D1,Ricardo Lopez,rlopez@example.com,(555) 666-1212,999 Maple Ln,Denver,CO,80202,2026-02-05
|
||||||
|
CUST-E1,John A Brown,jabrown@example.com,(555) 111-3333,111 First St,Boston,MA,02101,2026-01-05
|
||||||
|
CUST-E2,John B Brown,jbbrown@example.com,(555) 111-4444,222 Second St,Boston,MA,02102,2026-01-06
|
||||||
|
CUST-F1,Maria Costa,,,333 Bay Rd,Miami,FL,33101,
|
||||||
|
CUST-F2,Maria Costa,mcosta@example.com,(555) 777-2222,333 Bay Rd,Miami,FL,33101,2026-04-01
|
||||||
|
15
test-cases/uc09_expense_reports.csv
Normal file
15
test-cases/uc09_expense_reports.csv
Normal file
@@ -0,0 +1,15 @@
|
|||||||
|
expense_id,employee,expense_date,vendor,amount,description,submitted_at
|
||||||
|
EXP-001,Tom R,2026-03-15,Marriott,234.56,Hotel client visit,2026-03-18 09:00
|
||||||
|
EXP-019,Tom R,2026-03-15,Marriott Hotels,234.56,Hotel - client mtg,2026-04-02 14:30
|
||||||
|
EXP-005,Tom R,2026-03-16,Steakhouse,187.45,Client dinner with team,2026-03-19 10:15
|
||||||
|
EXP-006,Sara K,2026-03-16,Steakhouse,187.45,Client dinner,2026-03-19 11:00
|
||||||
|
EXP-010,Mike P,2026-03-20,Local Cafe,12.50,Lunch with vendor,2026-03-21 09:00
|
||||||
|
EXP-011,Mike P,2026-03-20,Local Cafe,8.75,Coffee meeting,2026-03-21 09:01
|
||||||
|
EXP-022,Anna L,2026-03-25,Uber,45.20,Airport transfer,2026-03-26 08:00
|
||||||
|
EXP-024,Anna L,2026-03-25,Uber,45.20,Airport transfer to hotel,2026-03-26 08:15
|
||||||
|
EXP-027,Anna L,2026-03-25,UBER,45.20,airport->hotel,2026-03-26 09:30
|
||||||
|
EXP-033,Ben T,2026-03-28,Office Depot,89.99,Printer ink,2026-03-30 13:00
|
||||||
|
EXP-040,Tom R,2026-04-01,Hyatt,234.56,Hotel different city,2026-04-03 10:00
|
||||||
|
EXP-050,Sara K,2026-04-05,American Airlines,412.00,Flight to Chicago,2026-04-06 09:00
|
||||||
|
EXP-051,Mike P,2026-04-05,Hertz,189.50,Rental car,2026-04-06 10:00
|
||||||
|
EXP-052,Anna L,2026-04-06,Subway,11.25,Lunch,2026-04-07 12:00
|
||||||
|
17
test-cases/uc10_client_data_dump_messy.csv
Normal file
17
test-cases/uc10_client_data_dump_messy.csv
Normal file
@@ -0,0 +1,17 @@
|
|||||||
|
id,full_name,email,phone,company,notes
|
||||||
|
1,Alice Johnson , alice@ex.com ,(555) 100-1000,Ex Corp ,VIP client
|
||||||
|
2,alice johnson,alice@ex.com,5551001000,Ex Corp,
|
||||||
|
3,Bob O’Brien,bob.obrien@example.com,(555) 200-2000,OBrien LLC,intro by Alice
|
||||||
|
4,Bob O'Brien,BOB.OBRIEN@example.com,555-200-2000,O'Brien LLC,Intro by Alice
|
||||||
|
5,Carol Wu,carol@example.com,(555) 300-3000,WuCo,follow up
|
||||||
|
6,Carol Wu,carol@example.com,(555) 300-3000,WuCo,follow up Q2
|
||||||
|
7,Derek Hall,dhall@example.com,(555) 400-4000,Hall Industries,
|
||||||
|
8,Derek Hall,dhall@example.com,(555) 400-4099,Hall Inds.,
|
||||||
|
9,Eve Martin,eve@example.com,(555) 500-5000,Martin Co,lead
|
||||||
|
10,Eve Martin,eve@example.com,(555) 500-5000,Martin Co,lead
|
||||||
|
11,Frank Ng,fng@example.com,(555) 600-6000,Ng Group,
|
||||||
|
12,Grace Park,gpark@example.com,(555) 700-7000,Park & Co,
|
||||||
|
13,Henry Lo,hlo@example.com,(555) 800-8000,Lo Holdings Inc.,renewal
|
||||||
|
14,Henry Lo,hlo@example.com,(555) 800-8000,"Lo Holdings, Inc",renewal Q2
|
||||||
|
15,Ivan Kim,ikim@example.com,(555) 900-9000,Kim Solutions,
|
||||||
|
16,Ivan Kim,ikim@example.com,(555) 900-9000,Kim Solutions,
|
||||||
|
15
test-cases/uc11_survey_responses.csv
Normal file
15
test-cases/uc11_survey_responses.csv
Normal file
@@ -0,0 +1,15 @@
|
|||||||
|
response_id,submitted_at,ip_address,respondent_email,q1_satisfaction,q2_recommend,q3_comments
|
||||||
|
R0001,2026-04-20 09:15:22,192.168.1.50,user1@example.com,5,Yes,Great service
|
||||||
|
R0002,2026-04-20 09:17:45,10.0.0.12,user1@example.com,5,Yes,Great service
|
||||||
|
R0010,2026-04-20 10:30:00,192.168.1.51,user2@example.com,4,Yes,Good but could improve
|
||||||
|
R0011,2026-04-20 10:35:12,192.168.1.52,user2@example.com,5,Yes,Good but could improve
|
||||||
|
R0020,2026-04-20 11:00:00,192.168.1.55,User3@Example.com,3,Maybe,OK
|
||||||
|
R0021,2026-04-20 11:02:30,192.168.1.55,user3@example.com,3,Maybe,OK
|
||||||
|
R0030,2026-04-20 14:00:00,73.55.10.100,spouse1@example.com,5,Yes,Loved it
|
||||||
|
R0031,2026-04-20 14:30:00,73.55.10.100,spouse2@example.com,4,Yes,Liked it
|
||||||
|
R0040,2026-04-05 08:00:00,192.168.2.10,user4@example.com,3,Maybe,Mixed
|
||||||
|
R0041,2026-04-19 15:00:00,192.168.2.11,user4@example.com,5,Yes,Got better
|
||||||
|
R0050,2026-04-20 08:00:00,100.10.10.10,unique1@example.com,5,Yes,Excellent
|
||||||
|
R0051,2026-04-20 09:00:00,100.10.10.11,unique2@example.com,4,Yes,Solid
|
||||||
|
R0052,2026-04-20 10:00:00,100.10.10.12,unique3@example.com,2,No,Disappointing
|
||||||
|
R0053,2026-04-20 11:00:00,100.10.10.13,unique4@example.com,5,Yes,Recommended
|
||||||
|
15
test-cases/uc12_lead_list_handoff.csv
Normal file
15
test-cases/uc12_lead_list_handoff.csv
Normal file
@@ -0,0 +1,15 @@
|
|||||||
|
lead_id,first_name,last_name,email,phone,company,lead_source,captured_date
|
||||||
|
L001,Patricia,Anders,p.anders@bigco.com,(555) 100-2222,Big Co,trade_show,2026-02-15
|
||||||
|
L045,Pat,Anders,p.anders@bigco.com,5551002222,Big Co Inc,linkedin,2026-03-01
|
||||||
|
L091,Patty,Anders,panders@bigco.com,(555) 100-2222,BigCo,webinar,2026-03-22
|
||||||
|
L010,Mohammed,Khan,m.khan@example.com,(555) 200-3333,Khan Industries,trade_show,2026-02-15
|
||||||
|
L032,Muhammad,Khan,mkhan@example.com,(555) 200-3333,Khan Industries,cold_outreach,2026-02-28
|
||||||
|
L067,Mohammad,Khan,m.khan@example.com,(555) 200-3333,Khan Industries Inc,referral,2026-03-10
|
||||||
|
L020,Sarah,"Lee, VP Marketing",slee@target.com,(555) 300-4444,Target Corp,linkedin,2026-02-20
|
||||||
|
L058,Sarah,Lee,slee@target.com,(555) 300-4444,Target Corp,webinar,2026-03-05
|
||||||
|
L080,John,Adams,jadams@enterprise.com,(555) 400-5555,Enterprise Inc,trade_show,2026-03-15
|
||||||
|
L081,John,Adams II,jadams2@enterprise.com,(555) 400-5556,Enterprise Inc,trade_show,2026-03-15
|
||||||
|
L100,Wei,Chen,wchen@startup.io,(555) 500-6666,Startup IO,referral,2026-03-20
|
||||||
|
L101,Olivia,Mendez,omendez@agency.com,(555) 600-7777,Agency Co,linkedin,2026-03-21
|
||||||
|
L102,Raj,Patel,rpatel@firm.com,(555) 700-8888,Firm LLC,cold_outreach,2026-03-22
|
||||||
|
L103,Emma,Wright,ewright@consulting.com,(555) 800-9999,Wright Consulting,webinar,2026-03-23
|
||||||
|
16
test-cases/uc13_combined_lead_sources.csv
Normal file
16
test-cases/uc13_combined_lead_sources.csv
Normal file
@@ -0,0 +1,16 @@
|
|||||||
|
email,captured_date,source_channel,campaign,consent
|
||||||
|
lead1@example.com,2025-09-15,facebook_ad,back_to_school_2025,yes
|
||||||
|
lead1@example.com,2025-11-20,google_ad,black_friday_2025,yes
|
||||||
|
lead1@example.com,2026-01-08,organic_form,homepage_popup,yes
|
||||||
|
lead1@example.com,2026-03-12,facebook_ad,spring_2026,yes
|
||||||
|
lead2@example.com,2025-10-01,facebook_ad,halloween_2025,yes
|
||||||
|
lead2@example.com,2025-12-15,google_ad,holiday_2025,no
|
||||||
|
lead2@example.com,2026-02-01,organic_form,valentines_2026,yes
|
||||||
|
lead.three@gmail.com,2025-08-01,facebook_ad,summer_2025,yes
|
||||||
|
leadthree@gmail.com,2025-10-15,google_ad,fall_2025,yes
|
||||||
|
l.e.a.d.three@gmail.com,2026-02-20,organic_form,winter_2026,yes
|
||||||
|
solo1@example.com,2025-09-10,facebook_ad,back_to_school_2025,yes
|
||||||
|
solo2@example.com,2025-11-05,google_ad,black_friday_2025,yes
|
||||||
|
solo3@example.com,2026-01-20,organic_form,homepage_popup,yes
|
||||||
|
solo4@example.com,2026-03-08,facebook_ad,spring_2026,yes
|
||||||
|
solo5@example.com,2026-04-15,referral,ambassador_program,yes
|
||||||
|
15
test-cases/uc14_audience_cross_platform.csv
Normal file
15
test-cases/uc14_audience_cross_platform.csv
Normal file
@@ -0,0 +1,15 @@
|
|||||||
|
email,fb_id,google_click_id,platform,first_seen,last_seen
|
||||||
|
audience1@example.com,fb_77881122,,facebook,2026-01-05,2026-04-20
|
||||||
|
audience1@example.com,,gclid_AAA111,google_ads,2026-02-10,2026-04-18
|
||||||
|
audience1@example.com,,,organic_form,2026-03-15,2026-03-15
|
||||||
|
audience2@example.com,fb_99887766,,facebook,2026-02-01,2026-04-15
|
||||||
|
audience2@example.com,,gclid_BBB222,google_ads,2026-02-20,2026-04-22
|
||||||
|
old.email@example.com,fb_55443322,,facebook,2025-06-10,2025-12-01
|
||||||
|
new.email@example.com,fb_55443322,,facebook,2026-01-15,2026-04-20
|
||||||
|
,fb_11223344,,facebook,2026-03-01,2026-04-01
|
||||||
|
,,gclid_CCC333,google_ads,2026-03-10,2026-04-10
|
||||||
|
fbonly@example.com,fb_44556677,,facebook,2026-03-20,2026-04-20
|
||||||
|
googleonly@example.com,,gclid_DDD444,google_ads,2026-03-25,2026-04-22
|
||||||
|
organiconly@example.com,,,organic_form,2026-04-01,2026-04-01
|
||||||
|
combined@example.com,fb_88990011,gclid_EEE555,manual_merge,2026-04-10,2026-04-25
|
||||||
|
,,,sms_list,2026-04-05,2026-04-05
|
||||||
|
18
test-cases/uc15_suppression_combined.csv
Normal file
18
test-cases/uc15_suppression_combined.csv
Normal file
@@ -0,0 +1,18 @@
|
|||||||
|
email,suppression_reason,suppressed_date,source_list
|
||||||
|
supp1@example.com,unsubscribe,2025-08-15,list_main
|
||||||
|
supp1@example.com,hard_bounce,2025-09-20,list_promo
|
||||||
|
supp1@example.com,spam_complaint,2025-10-01,list_main
|
||||||
|
supp2@example.com,unsubscribe,2026-01-10,list_main
|
||||||
|
supp2@example.com,unsubscribe,2026-01-10,list_promo
|
||||||
|
supp2@example.com,unsubscribe,2026-01-10,list_newsletter
|
||||||
|
Supp3@Example.com,unsubscribe,2025-12-05,list_main
|
||||||
|
supp3@example.com,hard_bounce,2026-02-15,list_promo
|
||||||
|
bounced@gmail.com,hard_bounce,2026-03-01,list_main
|
||||||
|
b.o.u.n.c.e.d@gmail.com,hard_bounce,2026-03-01,list_promo
|
||||||
|
complainer@example.com,spam_complaint,2026-02-20,list_main
|
||||||
|
soft@example.com,soft_bounce_x3,2026-03-15,list_main
|
||||||
|
solo_unsub@example.com,unsubscribe,2026-04-01,list_main
|
||||||
|
solo_bounce@example.com,hard_bounce,2026-04-05,list_promo
|
||||||
|
solo_complaint@example.com,spam_complaint,2026-04-10,list_main
|
||||||
|
padded@example.com ,unsubscribe,2026-04-12,list_main
|
||||||
|
padded@example.com,unsubscribe,2026-04-12,list_promo
|
||||||
|
0
tests/__init__.py
Normal file
0
tests/__init__.py
Normal file
47
tests/conftest.py
Normal file
47
tests/conftest.py
Normal file
@@ -0,0 +1,47 @@
|
|||||||
|
"""Shared test fixtures."""
|
||||||
|
|
||||||
|
import pandas as pd
|
||||||
|
import pytest
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
SAMPLES_DIR = Path(__file__).parent.parent / "samples"
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def sample_csv_path():
|
||||||
|
return SAMPLES_DIR / "messy_sales.csv"
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def sample_df(sample_csv_path):
|
||||||
|
return pd.read_csv(sample_csv_path, dtype=str, keep_default_na=False)
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def simple_df():
|
||||||
|
"""Small DataFrame with obvious duplicates for unit testing."""
|
||||||
|
return pd.DataFrame({
|
||||||
|
"name": ["Alice", "alice", "Bob", "Charlie", "ALICE"],
|
||||||
|
"email": ["alice@test.com", "alice@test.com", "bob@test.com",
|
||||||
|
"charlie@test.com", "alice@test.com"],
|
||||||
|
"phone": ["555-1234", "555-1234", "555-5678", "555-9012", "555-1234"],
|
||||||
|
})
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def merge_df():
|
||||||
|
"""DataFrame with partial records that benefit from merge."""
|
||||||
|
return pd.DataFrame({
|
||||||
|
"name": ["John Doe", "John Doe", "Jane Smith"],
|
||||||
|
"email": ["john@test.com", "john@test.com", "jane@test.com"],
|
||||||
|
"phone": ["555-1111", "", "555-3333"],
|
||||||
|
"address": ["", "123 Main St", "456 Oak Ave"],
|
||||||
|
})
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def tmp_csv(tmp_path, simple_df):
|
||||||
|
"""Write simple_df to a temp CSV and return the path."""
|
||||||
|
path = tmp_path / "test_input.csv"
|
||||||
|
simple_df.to_csv(path, index=False)
|
||||||
|
return path
|
||||||
147
tests/test_cli.py
Normal file
147
tests/test_cli.py
Normal file
@@ -0,0 +1,147 @@
|
|||||||
|
"""Integration tests for the CLI via Typer's CliRunner."""
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
from pathlib import Path
|
||||||
|
from typer.testing import CliRunner
|
||||||
|
|
||||||
|
from src.cli import app
|
||||||
|
|
||||||
|
runner = CliRunner()
|
||||||
|
|
||||||
|
|
||||||
|
class TestCliPreview:
|
||||||
|
def test_preview_default(self, tmp_csv):
|
||||||
|
result = runner.invoke(app, [str(tmp_csv)])
|
||||||
|
assert result.exit_code == 0
|
||||||
|
assert "preview" in result.output.lower() or "Rows in" in result.output
|
||||||
|
|
||||||
|
def test_preview_shows_row_counts(self, tmp_csv):
|
||||||
|
result = runner.invoke(app, [str(tmp_csv)])
|
||||||
|
assert result.exit_code == 0
|
||||||
|
assert "Rows in" in result.output
|
||||||
|
assert "Rows out" in result.output
|
||||||
|
|
||||||
|
def test_file_not_found(self):
|
||||||
|
result = runner.invoke(app, ["/tmp/nonexistent_xyz_abc.csv"])
|
||||||
|
assert result.exit_code != 0
|
||||||
|
assert "not found" in result.output.lower()
|
||||||
|
|
||||||
|
|
||||||
|
class TestCliApply:
|
||||||
|
def test_apply_writes_output(self, tmp_csv, tmp_path):
|
||||||
|
out = tmp_path / "output.csv"
|
||||||
|
result = runner.invoke(app, [str(tmp_csv), "--apply", "-o", str(out)])
|
||||||
|
assert result.exit_code == 0
|
||||||
|
assert out.exists()
|
||||||
|
|
||||||
|
def test_apply_default_output_name(self, tmp_csv):
|
||||||
|
result = runner.invoke(app, [str(tmp_csv), "--apply"])
|
||||||
|
assert result.exit_code == 0
|
||||||
|
expected = tmp_csv.parent / f"{tmp_csv.stem}_deduplicated.csv"
|
||||||
|
assert expected.exists()
|
||||||
|
|
||||||
|
def test_apply_creates_removed_file(self, tmp_csv):
|
||||||
|
result = runner.invoke(app, [str(tmp_csv), "--apply"])
|
||||||
|
assert result.exit_code == 0
|
||||||
|
removed = tmp_csv.parent / f"{tmp_csv.stem}_removed.csv"
|
||||||
|
# May or may not exist depending on whether duplicates were found
|
||||||
|
# with default auto-detect on simple_df
|
||||||
|
|
||||||
|
|
||||||
|
class TestCliFuzzy:
|
||||||
|
def test_fuzzy_flag(self, tmp_csv):
|
||||||
|
result = runner.invoke(app, [
|
||||||
|
str(tmp_csv), "--fuzzy", "name", "--threshold", "80",
|
||||||
|
])
|
||||||
|
assert result.exit_code == 0
|
||||||
|
|
||||||
|
def test_subset_flag(self, tmp_csv):
|
||||||
|
result = runner.invoke(app, [
|
||||||
|
str(tmp_csv), "--subset", "email",
|
||||||
|
])
|
||||||
|
assert result.exit_code == 0
|
||||||
|
|
||||||
|
def test_bad_column_error(self, tmp_csv):
|
||||||
|
result = runner.invoke(app, [
|
||||||
|
str(tmp_csv), "--subset", "nonexistent_column",
|
||||||
|
])
|
||||||
|
assert result.exit_code != 0
|
||||||
|
assert "not found" in result.output.lower()
|
||||||
|
|
||||||
|
|
||||||
|
class TestCliConfig:
|
||||||
|
def test_save_and_load_config(self, tmp_csv, tmp_path):
|
||||||
|
cfg_path = tmp_path / "my_config.json"
|
||||||
|
# Save
|
||||||
|
result = runner.invoke(app, [
|
||||||
|
str(tmp_csv), "--subset", "email", "--save-config", str(cfg_path),
|
||||||
|
])
|
||||||
|
assert result.exit_code == 0
|
||||||
|
assert cfg_path.exists()
|
||||||
|
|
||||||
|
# Load and apply
|
||||||
|
result = runner.invoke(app, [
|
||||||
|
str(tmp_csv), "--config", str(cfg_path), "--apply",
|
||||||
|
])
|
||||||
|
assert result.exit_code == 0
|
||||||
|
|
||||||
|
|
||||||
|
class TestCliSurvivor:
|
||||||
|
def test_survivor_last(self, tmp_csv):
|
||||||
|
result = runner.invoke(app, [str(tmp_csv), "--survivor", "last"])
|
||||||
|
assert result.exit_code == 0
|
||||||
|
|
||||||
|
def test_survivor_most_complete(self, tmp_csv):
|
||||||
|
result = runner.invoke(app, [str(tmp_csv), "--survivor", "most-complete"])
|
||||||
|
assert result.exit_code == 0
|
||||||
|
|
||||||
|
def test_invalid_survivor(self, tmp_csv):
|
||||||
|
result = runner.invoke(app, [str(tmp_csv), "--survivor", "bogus"])
|
||||||
|
assert result.exit_code != 0
|
||||||
|
|
||||||
|
|
||||||
|
class TestCliMerge:
|
||||||
|
def test_merge_flag(self, tmp_csv):
|
||||||
|
result = runner.invoke(app, [str(tmp_csv), "--merge", "--apply"])
|
||||||
|
assert result.exit_code == 0
|
||||||
|
|
||||||
|
|
||||||
|
class TestCliSampleData:
|
||||||
|
def test_sample_preview(self, sample_csv_path):
|
||||||
|
result = runner.invoke(app, [str(sample_csv_path)])
|
||||||
|
assert result.exit_code == 0
|
||||||
|
assert "Rows in: 50" in result.output
|
||||||
|
# Should find duplicates
|
||||||
|
assert "Removed:" in result.output
|
||||||
|
|
||||||
|
def test_sample_apply(self, sample_csv_path, tmp_path):
|
||||||
|
out = tmp_path / "deduped.csv"
|
||||||
|
result = runner.invoke(app, [
|
||||||
|
str(sample_csv_path), "--apply", "-o", str(out),
|
||||||
|
])
|
||||||
|
assert result.exit_code == 0
|
||||||
|
assert out.exists()
|
||||||
|
import pandas as pd
|
||||||
|
df = pd.read_csv(out, encoding="utf-8-sig")
|
||||||
|
# Should have fewer than 50 rows
|
||||||
|
assert len(df) < 50
|
||||||
|
|
||||||
|
def test_sample_fuzzy_with_merge(self, sample_csv_path, tmp_path):
|
||||||
|
out = tmp_path / "fuzzy_merged.csv"
|
||||||
|
result = runner.invoke(app, [
|
||||||
|
str(sample_csv_path),
|
||||||
|
"--fuzzy", "customer_name",
|
||||||
|
"--threshold", "80",
|
||||||
|
"--merge",
|
||||||
|
"--apply",
|
||||||
|
"-o", str(out),
|
||||||
|
])
|
||||||
|
assert result.exit_code == 0
|
||||||
|
assert out.exists()
|
||||||
|
|
||||||
|
|
||||||
|
class TestCliHelp:
|
||||||
|
def test_help(self):
|
||||||
|
result = runner.invoke(app, ["--help"])
|
||||||
|
assert result.exit_code == 0
|
||||||
|
assert "--apply" in result.output
|
||||||
102
tests/test_config.py
Normal file
102
tests/test_config.py
Normal file
@@ -0,0 +1,102 @@
|
|||||||
|
"""Tests for src.core.config — save/load configuration profiles."""
|
||||||
|
|
||||||
|
import json
|
||||||
|
import pytest
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
from src.core.config import (
|
||||||
|
DeduplicationConfig,
|
||||||
|
StrategyConfig,
|
||||||
|
ColumnStrategyConfig,
|
||||||
|
)
|
||||||
|
from src.core.dedup import Algorithm, SurvivorRule
|
||||||
|
from src.core.normalizers import NormalizerType
|
||||||
|
|
||||||
|
|
||||||
|
class TestDeduplicationConfig:
|
||||||
|
def test_default(self):
|
||||||
|
cfg = DeduplicationConfig.default()
|
||||||
|
assert cfg.survivor_rule == "first"
|
||||||
|
assert cfg.merge is False
|
||||||
|
assert cfg.strategies == []
|
||||||
|
|
||||||
|
def test_to_dict_roundtrip(self):
|
||||||
|
cfg = DeduplicationConfig(
|
||||||
|
strategies=[
|
||||||
|
StrategyConfig(columns=[
|
||||||
|
ColumnStrategyConfig(
|
||||||
|
column="email",
|
||||||
|
algorithm="exact",
|
||||||
|
threshold=100.0,
|
||||||
|
normalizer="email",
|
||||||
|
),
|
||||||
|
]),
|
||||||
|
],
|
||||||
|
survivor_rule="most_complete",
|
||||||
|
merge=True,
|
||||||
|
)
|
||||||
|
d = cfg.to_dict()
|
||||||
|
cfg2 = DeduplicationConfig.from_dict(d)
|
||||||
|
assert cfg2.survivor_rule == "most_complete"
|
||||||
|
assert cfg2.merge is True
|
||||||
|
assert len(cfg2.strategies) == 1
|
||||||
|
assert cfg2.strategies[0].columns[0].column == "email"
|
||||||
|
|
||||||
|
def test_to_file_from_file(self, tmp_path):
|
||||||
|
cfg = DeduplicationConfig(
|
||||||
|
strategies=[
|
||||||
|
StrategyConfig(columns=[
|
||||||
|
ColumnStrategyConfig(column="name", algorithm="jaro_winkler",
|
||||||
|
threshold=85.0, normalizer="name"),
|
||||||
|
]),
|
||||||
|
],
|
||||||
|
survivor_rule="last",
|
||||||
|
)
|
||||||
|
path = tmp_path / "test_config.json"
|
||||||
|
cfg.to_file(path)
|
||||||
|
|
||||||
|
loaded = DeduplicationConfig.from_file(path)
|
||||||
|
assert loaded.survivor_rule == "last"
|
||||||
|
assert len(loaded.strategies) == 1
|
||||||
|
assert loaded.strategies[0].columns[0].algorithm == "jaro_winkler"
|
||||||
|
|
||||||
|
def test_to_strategies(self):
|
||||||
|
cfg = DeduplicationConfig(
|
||||||
|
strategies=[
|
||||||
|
StrategyConfig(columns=[
|
||||||
|
ColumnStrategyConfig(column="email", algorithm="exact",
|
||||||
|
threshold=100.0, normalizer="email"),
|
||||||
|
ColumnStrategyConfig(column="phone", algorithm="exact",
|
||||||
|
threshold=100.0, normalizer="phone"),
|
||||||
|
]),
|
||||||
|
],
|
||||||
|
)
|
||||||
|
strats = cfg.to_strategies()
|
||||||
|
assert strats is not None
|
||||||
|
assert len(strats) == 1
|
||||||
|
assert len(strats[0].column_strategies) == 2
|
||||||
|
assert strats[0].column_strategies[0].algorithm == Algorithm.EXACT
|
||||||
|
assert strats[0].column_strategies[0].normalizer == NormalizerType.EMAIL
|
||||||
|
|
||||||
|
def test_to_strategies_empty(self):
|
||||||
|
cfg = DeduplicationConfig.default()
|
||||||
|
assert cfg.to_strategies() is None
|
||||||
|
|
||||||
|
def test_to_survivor_rule(self):
|
||||||
|
cfg = DeduplicationConfig(survivor_rule="most_complete")
|
||||||
|
assert cfg.to_survivor_rule() == SurvivorRule.KEEP_MOST_COMPLETE
|
||||||
|
|
||||||
|
def test_json_is_valid(self, tmp_path):
|
||||||
|
cfg = DeduplicationConfig(
|
||||||
|
strategies=[
|
||||||
|
StrategyConfig(columns=[
|
||||||
|
ColumnStrategyConfig(column="x", algorithm="exact"),
|
||||||
|
]),
|
||||||
|
],
|
||||||
|
normalize_map={"email": "email"},
|
||||||
|
)
|
||||||
|
path = tmp_path / "valid.json"
|
||||||
|
cfg.to_file(path)
|
||||||
|
data = json.loads(path.read_text())
|
||||||
|
assert "strategies" in data
|
||||||
|
assert "normalize_map" in data
|
||||||
258
tests/test_dedup.py
Normal file
258
tests/test_dedup.py
Normal file
@@ -0,0 +1,258 @@
|
|||||||
|
"""Tests for src.core.dedup — matching engine."""
|
||||||
|
|
||||||
|
import pandas as pd
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from src.core.dedup import (
|
||||||
|
Algorithm,
|
||||||
|
ColumnMatchStrategy,
|
||||||
|
MatchStrategy,
|
||||||
|
SurvivorRule,
|
||||||
|
_compute_similarity,
|
||||||
|
_compare_pair,
|
||||||
|
_UnionFind,
|
||||||
|
build_default_strategies,
|
||||||
|
deduplicate,
|
||||||
|
)
|
||||||
|
from src.core.normalizers import NormalizerType
|
||||||
|
|
||||||
|
|
||||||
|
class TestComputeSimilarity:
|
||||||
|
def test_exact_match(self):
|
||||||
|
assert _compute_similarity("hello", "hello", Algorithm.EXACT) == 100.0
|
||||||
|
|
||||||
|
def test_exact_mismatch(self):
|
||||||
|
assert _compute_similarity("hello", "world", Algorithm.EXACT) == 0.0
|
||||||
|
|
||||||
|
def test_levenshtein_similar(self):
|
||||||
|
score = _compute_similarity("kitten", "sitting", Algorithm.LEVENSHTEIN)
|
||||||
|
assert 50 < score < 80
|
||||||
|
|
||||||
|
def test_jaro_winkler_similar(self):
|
||||||
|
score = _compute_similarity("john", "jon", Algorithm.JARO_WINKLER)
|
||||||
|
assert score > 80
|
||||||
|
|
||||||
|
def test_token_set_ratio(self):
|
||||||
|
score = _compute_similarity(
|
||||||
|
"123 main street apt 4",
|
||||||
|
"apt 4 123 main street",
|
||||||
|
Algorithm.TOKEN_SET_RATIO,
|
||||||
|
)
|
||||||
|
assert score == 100.0
|
||||||
|
|
||||||
|
|
||||||
|
class TestUnionFind:
|
||||||
|
def test_basic_union(self):
|
||||||
|
uf = _UnionFind(5)
|
||||||
|
uf.union(0, 1)
|
||||||
|
uf.union(1, 2)
|
||||||
|
assert uf.find(0) == uf.find(2) # transitive
|
||||||
|
|
||||||
|
def test_separate_groups(self):
|
||||||
|
uf = _UnionFind(5)
|
||||||
|
uf.union(0, 1)
|
||||||
|
uf.union(3, 4)
|
||||||
|
assert uf.find(0) != uf.find(3)
|
||||||
|
|
||||||
|
def test_groups(self):
|
||||||
|
uf = _UnionFind(5)
|
||||||
|
uf.union(0, 1)
|
||||||
|
uf.union(1, 2)
|
||||||
|
uf.union(3, 4)
|
||||||
|
groups = uf.groups()
|
||||||
|
assert len(groups) == 2
|
||||||
|
sizes = sorted(len(v) for v in groups.values())
|
||||||
|
assert sizes == [2, 3]
|
||||||
|
|
||||||
|
|
||||||
|
class TestComparePair:
|
||||||
|
def test_exact_match(self):
|
||||||
|
strategy = MatchStrategy(column_strategies=[
|
||||||
|
ColumnMatchStrategy(column="email", algorithm=Algorithm.EXACT),
|
||||||
|
])
|
||||||
|
row_a = pd.Series({"email": "test@example.com"})
|
||||||
|
row_b = pd.Series({"email": "test@example.com"})
|
||||||
|
is_match, conf, cols = _compare_pair(row_a, row_b, strategy)
|
||||||
|
assert is_match
|
||||||
|
assert conf == 100.0
|
||||||
|
assert cols == ["email"]
|
||||||
|
|
||||||
|
def test_exact_mismatch(self):
|
||||||
|
strategy = MatchStrategy(column_strategies=[
|
||||||
|
ColumnMatchStrategy(column="email", algorithm=Algorithm.EXACT),
|
||||||
|
])
|
||||||
|
row_a = pd.Series({"email": "a@test.com"})
|
||||||
|
row_b = pd.Series({"email": "b@test.com"})
|
||||||
|
is_match, conf, cols = _compare_pair(row_a, row_b, strategy)
|
||||||
|
assert not is_match
|
||||||
|
|
||||||
|
def test_fuzzy_match(self):
|
||||||
|
strategy = MatchStrategy(column_strategies=[
|
||||||
|
ColumnMatchStrategy(column="name", algorithm=Algorithm.JARO_WINKLER, threshold=80),
|
||||||
|
])
|
||||||
|
row_a = pd.Series({"name": "john smith"})
|
||||||
|
row_b = pd.Series({"name": "jon smith"})
|
||||||
|
is_match, conf, cols = _compare_pair(row_a, row_b, strategy)
|
||||||
|
assert is_match
|
||||||
|
assert conf > 80
|
||||||
|
|
||||||
|
def test_and_logic_both_must_match(self):
|
||||||
|
strategy = MatchStrategy(column_strategies=[
|
||||||
|
ColumnMatchStrategy(column="name", algorithm=Algorithm.EXACT),
|
||||||
|
ColumnMatchStrategy(column="email", algorithm=Algorithm.EXACT),
|
||||||
|
])
|
||||||
|
# name matches, email doesn't
|
||||||
|
row_a = pd.Series({"name": "alice", "email": "a@test.com"})
|
||||||
|
row_b = pd.Series({"name": "alice", "email": "b@test.com"})
|
||||||
|
is_match, conf, cols = _compare_pair(row_a, row_b, strategy)
|
||||||
|
assert not is_match
|
||||||
|
|
||||||
|
|
||||||
|
class TestBuildDefaultStrategies:
|
||||||
|
def test_detects_email(self):
|
||||||
|
df = pd.DataFrame({"email": ["a@b.com"], "name": ["Alice"]})
|
||||||
|
strats = build_default_strategies(df)
|
||||||
|
# email (strong, standalone) + name AND email (weak paired with strong) = 2
|
||||||
|
assert len(strats) == 2
|
||||||
|
found_email = any(
|
||||||
|
cs.column == "email" and cs.normalizer == NormalizerType.EMAIL
|
||||||
|
for s in strats for cs in s.column_strategies
|
||||||
|
)
|
||||||
|
assert found_email
|
||||||
|
# Name should only appear paired with email, not standalone
|
||||||
|
name_strats = [s for s in strats
|
||||||
|
if any(cs.column == "name" for cs in s.column_strategies)]
|
||||||
|
for s in name_strats:
|
||||||
|
assert len(s.column_strategies) >= 2, "Name should be paired with a strong key"
|
||||||
|
|
||||||
|
def test_fallback_all_columns(self):
|
||||||
|
df = pd.DataFrame({"x": [1], "y": [2], "z": [3]})
|
||||||
|
strats = build_default_strategies(df)
|
||||||
|
assert len(strats) == 1
|
||||||
|
assert len(strats[0].column_strategies) == 3
|
||||||
|
|
||||||
|
|
||||||
|
class TestDeduplicate:
|
||||||
|
def test_exact_duplicates(self, simple_df):
|
||||||
|
# Alice appears 3 times with same email
|
||||||
|
strategy = MatchStrategy(column_strategies=[
|
||||||
|
ColumnMatchStrategy(column="email", algorithm=Algorithm.EXACT),
|
||||||
|
])
|
||||||
|
result = deduplicate(simple_df, strategies=[strategy])
|
||||||
|
# 3 Alices -> 1, Bob stays, Charlie stays = 3 rows
|
||||||
|
assert len(result.deduplicated_df) == 3
|
||||||
|
assert result.original_row_count == 5
|
||||||
|
assert len(result.match_groups) == 1
|
||||||
|
|
||||||
|
def test_fuzzy_name_match(self):
|
||||||
|
df = pd.DataFrame({
|
||||||
|
"name": ["John Smith", "Jon Smith", "Jane Doe"],
|
||||||
|
"email": ["a@test.com", "b@test.com", "c@test.com"],
|
||||||
|
})
|
||||||
|
strategy = MatchStrategy(column_strategies=[
|
||||||
|
ColumnMatchStrategy(
|
||||||
|
column="name",
|
||||||
|
algorithm=Algorithm.JARO_WINKLER,
|
||||||
|
threshold=85,
|
||||||
|
normalizer=NormalizerType.NAME,
|
||||||
|
),
|
||||||
|
])
|
||||||
|
result = deduplicate(df, strategies=[strategy])
|
||||||
|
assert len(result.deduplicated_df) == 2
|
||||||
|
assert len(result.match_groups) == 1
|
||||||
|
|
||||||
|
def test_survivor_keep_last(self, simple_df):
|
||||||
|
strategy = MatchStrategy(column_strategies=[
|
||||||
|
ColumnMatchStrategy(column="email", algorithm=Algorithm.EXACT),
|
||||||
|
])
|
||||||
|
result = deduplicate(simple_df, strategies=[strategy],
|
||||||
|
survivor_rule=SurvivorRule.KEEP_LAST)
|
||||||
|
# The last Alice (index 4) should survive
|
||||||
|
assert len(result.match_groups) == 1
|
||||||
|
assert result.match_groups[0].survivor_index == 4
|
||||||
|
|
||||||
|
def test_survivor_most_complete(self, merge_df):
|
||||||
|
strategy = MatchStrategy(column_strategies=[
|
||||||
|
ColumnMatchStrategy(column="email", algorithm=Algorithm.EXACT),
|
||||||
|
])
|
||||||
|
result = deduplicate(merge_df, strategies=[strategy],
|
||||||
|
survivor_rule=SurvivorRule.KEEP_MOST_COMPLETE)
|
||||||
|
# Row 0 has phone but no address (1 empty)
|
||||||
|
# Row 1 has address but no phone (1 empty)
|
||||||
|
# Both have 1 empty, so keep_first among ties
|
||||||
|
assert len(result.deduplicated_df) == 2
|
||||||
|
|
||||||
|
def test_merge_mode(self, merge_df):
|
||||||
|
strategy = MatchStrategy(column_strategies=[
|
||||||
|
ColumnMatchStrategy(column="email", algorithm=Algorithm.EXACT),
|
||||||
|
])
|
||||||
|
result = deduplicate(merge_df, strategies=[strategy], merge=True)
|
||||||
|
# Survivor should have both phone and address filled
|
||||||
|
john_row = result.deduplicated_df[
|
||||||
|
result.deduplicated_df["name"] == "John Doe"
|
||||||
|
].iloc[0]
|
||||||
|
assert john_row["phone"] == "555-1111"
|
||||||
|
assert john_row["address"] == "123 Main St"
|
||||||
|
|
||||||
|
def test_multi_strategy_or(self):
|
||||||
|
df = pd.DataFrame({
|
||||||
|
"name": ["Alice", "Bob", "Alice B."],
|
||||||
|
"email": ["a@test.com", "a@test.com", "c@test.com"],
|
||||||
|
})
|
||||||
|
# Strategy 1: match on email
|
||||||
|
# Strategy 2: match on name (fuzzy)
|
||||||
|
strat1 = MatchStrategy(column_strategies=[
|
||||||
|
ColumnMatchStrategy(column="email", algorithm=Algorithm.EXACT),
|
||||||
|
])
|
||||||
|
strat2 = MatchStrategy(column_strategies=[
|
||||||
|
ColumnMatchStrategy(column="name", algorithm=Algorithm.JARO_WINKLER, threshold=70),
|
||||||
|
])
|
||||||
|
result = deduplicate(df, strategies=[strat1, strat2])
|
||||||
|
# All three should end up in one group via transitive closure:
|
||||||
|
# Alice~Bob (email), Alice~Alice B. (name)
|
||||||
|
assert len(result.deduplicated_df) == 1
|
||||||
|
|
||||||
|
def test_confidence_score(self, simple_df):
|
||||||
|
strategy = MatchStrategy(column_strategies=[
|
||||||
|
ColumnMatchStrategy(column="email", algorithm=Algorithm.EXACT),
|
||||||
|
])
|
||||||
|
result = deduplicate(simple_df, strategies=[strategy])
|
||||||
|
for group in result.match_groups:
|
||||||
|
assert 0 <= group.confidence <= 100
|
||||||
|
|
||||||
|
def test_preview_flag(self, simple_df):
|
||||||
|
result = deduplicate(simple_df, preview=True)
|
||||||
|
assert result.is_preview is True
|
||||||
|
result2 = deduplicate(simple_df, preview=False)
|
||||||
|
assert result2.is_preview is False
|
||||||
|
|
||||||
|
def test_auto_detect_strategies(self, sample_df):
|
||||||
|
result = deduplicate(sample_df)
|
||||||
|
# Should find duplicates in the sample data
|
||||||
|
assert len(result.match_groups) > 0
|
||||||
|
assert len(result.deduplicated_df) < len(sample_df)
|
||||||
|
|
||||||
|
def test_idempotent(self, sample_df):
|
||||||
|
"""Running dedup twice with same config produces same output."""
|
||||||
|
result1 = deduplicate(sample_df)
|
||||||
|
result2 = deduplicate(result1.deduplicated_df)
|
||||||
|
# Second pass should find no new duplicates
|
||||||
|
assert len(result2.match_groups) == 0
|
||||||
|
assert len(result2.deduplicated_df) == len(result1.deduplicated_df)
|
||||||
|
|
||||||
|
def test_review_callback(self):
|
||||||
|
df = pd.DataFrame({
|
||||||
|
"email": ["a@test.com", "a@test.com", "b@test.com"],
|
||||||
|
})
|
||||||
|
strategy = MatchStrategy(column_strategies=[
|
||||||
|
ColumnMatchStrategy(column="email", algorithm=Algorithm.EXACT),
|
||||||
|
])
|
||||||
|
# Reject all matches
|
||||||
|
result = deduplicate(df, strategies=[strategy],
|
||||||
|
review_callback=lambda g, d: False)
|
||||||
|
assert len(result.deduplicated_df) == 3 # nothing removed
|
||||||
|
|
||||||
|
# Accept all matches
|
||||||
|
result = deduplicate(df, strategies=[strategy],
|
||||||
|
review_callback=lambda g, d: True)
|
||||||
|
assert len(result.deduplicated_df) == 2
|
||||||
130
tests/test_io.py
Normal file
130
tests/test_io.py
Normal file
@@ -0,0 +1,130 @@
|
|||||||
|
"""Tests for src.core.io — file reading, encoding/delimiter detection."""
|
||||||
|
|
||||||
|
import pandas as pd
|
||||||
|
import pytest
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
from src.core.io import (
|
||||||
|
detect_encoding,
|
||||||
|
detect_delimiter,
|
||||||
|
detect_header_row,
|
||||||
|
read_file,
|
||||||
|
write_file,
|
||||||
|
list_sheets,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
class TestDetectEncoding:
|
||||||
|
def test_utf8_file(self, sample_csv_path):
|
||||||
|
enc = detect_encoding(sample_csv_path)
|
||||||
|
assert enc.lower().replace("-", "") in ("utf8", "ascii", "utf8sig")
|
||||||
|
|
||||||
|
def test_empty_file(self, tmp_path):
|
||||||
|
f = tmp_path / "empty.csv"
|
||||||
|
f.write_bytes(b"")
|
||||||
|
assert detect_encoding(f) == "utf-8"
|
||||||
|
|
||||||
|
def test_bom_file(self, tmp_path):
|
||||||
|
f = tmp_path / "bom.csv"
|
||||||
|
f.write_bytes(b"\xef\xbb\xbfname,email\nAlice,a@b.com\n")
|
||||||
|
assert detect_encoding(f) == "utf-8-sig"
|
||||||
|
|
||||||
|
def test_latin1_file(self, tmp_path):
|
||||||
|
f = tmp_path / "latin.csv"
|
||||||
|
content = "name,city\nJosé,São Paulo\n".encode("latin-1")
|
||||||
|
f.write_bytes(content)
|
||||||
|
enc = detect_encoding(f)
|
||||||
|
# Should detect something compatible with latin-1 family
|
||||||
|
assert enc in ("iso-8859-1", "latin-1", "windows-1252", "cp1252",
|
||||||
|
"iso-8859-9", "cp1250", "iso-8859-15", "utf-8")
|
||||||
|
|
||||||
|
|
||||||
|
class TestDetectDelimiter:
|
||||||
|
def test_comma(self, sample_csv_path):
|
||||||
|
assert detect_delimiter(sample_csv_path) == ","
|
||||||
|
|
||||||
|
def test_tab(self, tmp_path):
|
||||||
|
f = tmp_path / "tabs.tsv"
|
||||||
|
f.write_text("name\temail\nAlice\ta@b.com\n")
|
||||||
|
assert detect_delimiter(f) == "\t"
|
||||||
|
|
||||||
|
def test_semicolon(self, tmp_path):
|
||||||
|
f = tmp_path / "semi.csv"
|
||||||
|
f.write_text("name;email;phone\nAlice;a@b.com;555\n")
|
||||||
|
assert detect_delimiter(f) == ";"
|
||||||
|
|
||||||
|
def test_pipe(self, tmp_path):
|
||||||
|
f = tmp_path / "pipe.csv"
|
||||||
|
f.write_text("name|email|phone\nAlice|a@b.com|555\n")
|
||||||
|
assert detect_delimiter(f) == "|"
|
||||||
|
|
||||||
|
|
||||||
|
class TestDetectHeaderRow:
|
||||||
|
def test_standard_csv(self, sample_csv_path):
|
||||||
|
assert detect_header_row(sample_csv_path) == 0
|
||||||
|
|
||||||
|
def test_with_junk_rows(self, tmp_path):
|
||||||
|
f = tmp_path / "junk.csv"
|
||||||
|
f.write_text("Report generated 2024-01-01\n\nname,email,phone\nAlice,a@b.com,555\n")
|
||||||
|
# Row 0 has "Report generated..." which is a single non-numeric string
|
||||||
|
# Row 2 has "name,email,phone" which looks like headers
|
||||||
|
# The heuristic checks all cells, so row 0 may match if it's a single cell
|
||||||
|
hdr = detect_header_row(f)
|
||||||
|
assert hdr in (0, 2) # depends on delimiter detection
|
||||||
|
|
||||||
|
|
||||||
|
class TestReadFile:
|
||||||
|
def test_read_csv(self, sample_csv_path):
|
||||||
|
df = read_file(sample_csv_path)
|
||||||
|
assert isinstance(df, pd.DataFrame)
|
||||||
|
assert len(df) == 50
|
||||||
|
assert "customer_name" in df.columns
|
||||||
|
|
||||||
|
def test_read_nonexistent(self):
|
||||||
|
with pytest.raises(FileNotFoundError):
|
||||||
|
read_file("/tmp/nonexistent_file_xyz.csv")
|
||||||
|
|
||||||
|
def test_read_with_encoding_override(self, sample_csv_path):
|
||||||
|
df = read_file(sample_csv_path, encoding="utf-8")
|
||||||
|
assert len(df) == 50
|
||||||
|
|
||||||
|
def test_chunked_reading(self, sample_csv_path):
|
||||||
|
chunks = read_file(sample_csv_path, chunk_size=10)
|
||||||
|
# Should be a generator
|
||||||
|
all_chunks = list(chunks)
|
||||||
|
assert len(all_chunks) == 5
|
||||||
|
total_rows = sum(len(c) for c in all_chunks)
|
||||||
|
assert total_rows == 50
|
||||||
|
|
||||||
|
|
||||||
|
class TestWriteFile:
|
||||||
|
def test_write_csv(self, tmp_path, simple_df):
|
||||||
|
out = tmp_path / "output.csv"
|
||||||
|
write_file(simple_df, out)
|
||||||
|
assert out.exists()
|
||||||
|
# Read back
|
||||||
|
df = pd.read_csv(out, encoding="utf-8-sig")
|
||||||
|
assert len(df) == len(simple_df)
|
||||||
|
|
||||||
|
def test_write_xlsx(self, tmp_path, simple_df):
|
||||||
|
out = tmp_path / "output.xlsx"
|
||||||
|
write_file(simple_df, out)
|
||||||
|
assert out.exists()
|
||||||
|
df = pd.read_excel(out)
|
||||||
|
assert len(df) == len(simple_df)
|
||||||
|
|
||||||
|
def test_utf8_bom_default(self, tmp_path, simple_df):
|
||||||
|
out = tmp_path / "bom.csv"
|
||||||
|
write_file(simple_df, out)
|
||||||
|
raw = out.read_bytes()
|
||||||
|
assert raw[:3] == b"\xef\xbb\xbf"
|
||||||
|
|
||||||
|
|
||||||
|
class TestListSheets:
|
||||||
|
def test_list_sheets(self, tmp_path, simple_df):
|
||||||
|
path = tmp_path / "multi.xlsx"
|
||||||
|
with pd.ExcelWriter(path, engine="openpyxl") as writer:
|
||||||
|
simple_df.to_excel(writer, sheet_name="Sheet1", index=False)
|
||||||
|
simple_df.to_excel(writer, sheet_name="Sheet2", index=False)
|
||||||
|
sheets = list_sheets(path)
|
||||||
|
assert sheets == ["Sheet1", "Sheet2"]
|
||||||
158
tests/test_normalizers.py
Normal file
158
tests/test_normalizers.py
Normal file
@@ -0,0 +1,158 @@
|
|||||||
|
"""Tests for src.core.normalizers."""
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
from src.core.normalizers import (
|
||||||
|
NormalizerType,
|
||||||
|
get_normalizer,
|
||||||
|
normalize_email,
|
||||||
|
normalize_phone,
|
||||||
|
normalize_name,
|
||||||
|
normalize_address,
|
||||||
|
normalize_string,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
class TestNormalizeEmail:
|
||||||
|
def test_basic_lowercase(self):
|
||||||
|
assert normalize_email("John@Example.COM") == "john@example.com"
|
||||||
|
|
||||||
|
def test_strip_whitespace(self):
|
||||||
|
assert normalize_email(" alice@test.com ") == "alice@test.com"
|
||||||
|
|
||||||
|
def test_strip_gmail_dots(self):
|
||||||
|
assert normalize_email("j.o.h.n@gmail.com") == "john@gmail.com"
|
||||||
|
|
||||||
|
def test_strip_plus_tag(self):
|
||||||
|
assert normalize_email("alice+promo@test.com") == "alice@test.com"
|
||||||
|
|
||||||
|
def test_gmail_dots_and_plus(self):
|
||||||
|
assert normalize_email("j.smith+tag@gmail.com") == "jsmith@gmail.com"
|
||||||
|
|
||||||
|
def test_non_gmail_keeps_dots(self):
|
||||||
|
assert normalize_email("j.smith@company.com") == "j.smith@company.com"
|
||||||
|
|
||||||
|
def test_empty(self):
|
||||||
|
assert normalize_email("") == ""
|
||||||
|
assert normalize_email(None) == ""
|
||||||
|
|
||||||
|
def test_no_at_sign(self):
|
||||||
|
assert normalize_email("not-an-email") == "not-an-email"
|
||||||
|
|
||||||
|
def test_idempotent(self):
|
||||||
|
result = normalize_email("J.Smith+tag@Gmail.com")
|
||||||
|
assert normalize_email(result) == result
|
||||||
|
|
||||||
|
|
||||||
|
class TestNormalizePhone:
|
||||||
|
def test_us_formatted(self):
|
||||||
|
assert normalize_phone("(555) 123-4567") == "+15551234567"
|
||||||
|
|
||||||
|
def test_dashes(self):
|
||||||
|
assert normalize_phone("555-123-4567") == "+15551234567"
|
||||||
|
|
||||||
|
def test_dots(self):
|
||||||
|
assert normalize_phone("555.123.4567") == "+15551234567"
|
||||||
|
|
||||||
|
def test_with_country_code(self):
|
||||||
|
assert normalize_phone("+1 555-123-4567") == "+15551234567"
|
||||||
|
|
||||||
|
def test_digits_only_input(self):
|
||||||
|
assert normalize_phone("5551234567") == "+15551234567"
|
||||||
|
|
||||||
|
def test_empty(self):
|
||||||
|
assert normalize_phone("") == ""
|
||||||
|
assert normalize_phone(None) == ""
|
||||||
|
|
||||||
|
def test_invalid_fallback_digits(self):
|
||||||
|
# Very short number that phonenumbers rejects
|
||||||
|
result = normalize_phone("123")
|
||||||
|
assert result == "123"
|
||||||
|
|
||||||
|
def test_idempotent(self):
|
||||||
|
result = normalize_phone("(555) 123-4567")
|
||||||
|
assert normalize_phone(result) == result
|
||||||
|
|
||||||
|
|
||||||
|
class TestNormalizeName:
|
||||||
|
def test_strip_mr(self):
|
||||||
|
assert normalize_name("Mr. John Smith") == "john smith"
|
||||||
|
|
||||||
|
def test_strip_dr(self):
|
||||||
|
assert normalize_name("Dr. Jane Doe") == "jane doe"
|
||||||
|
|
||||||
|
def test_strip_suffix(self):
|
||||||
|
assert normalize_name("Robert Brown Jr.") == "robert brown"
|
||||||
|
|
||||||
|
def test_strip_numeral_suffix(self):
|
||||||
|
assert normalize_name("James Wilson III") == "james wilson"
|
||||||
|
|
||||||
|
def test_title_and_suffix(self):
|
||||||
|
assert normalize_name("Dr. Michael Williams III") == "michael williams"
|
||||||
|
|
||||||
|
def test_collapse_whitespace(self):
|
||||||
|
assert normalize_name(" John Smith ") == "john smith"
|
||||||
|
|
||||||
|
def test_case_fold(self):
|
||||||
|
assert normalize_name("JOHN SMITH") == "john smith"
|
||||||
|
|
||||||
|
def test_empty(self):
|
||||||
|
assert normalize_name("") == ""
|
||||||
|
assert normalize_name(None) == ""
|
||||||
|
|
||||||
|
def test_idempotent(self):
|
||||||
|
result = normalize_name("Mr. John Smith Jr.")
|
||||||
|
assert normalize_name(result) == result
|
||||||
|
|
||||||
|
|
||||||
|
class TestNormalizeAddress:
|
||||||
|
def test_street_abbreviation(self):
|
||||||
|
assert normalize_address("123 Main Street") == "123 main st"
|
||||||
|
|
||||||
|
def test_avenue_abbreviation(self):
|
||||||
|
assert normalize_address("456 Oak Avenue") == "456 oak ave"
|
||||||
|
|
||||||
|
def test_boulevard_abbreviation(self):
|
||||||
|
assert normalize_address("789 Pine Boulevard") == "789 pine blvd"
|
||||||
|
|
||||||
|
def test_apartment(self):
|
||||||
|
assert normalize_address("123 Main St Apartment 4") == "123 main st apt 4"
|
||||||
|
|
||||||
|
def test_direction(self):
|
||||||
|
assert normalize_address("111 First Street North") == "111 first st n"
|
||||||
|
|
||||||
|
def test_collapse_whitespace(self):
|
||||||
|
assert normalize_address(" 123 Main Street ") == "123 main st"
|
||||||
|
|
||||||
|
def test_empty(self):
|
||||||
|
assert normalize_address("") == ""
|
||||||
|
assert normalize_address(None) == ""
|
||||||
|
|
||||||
|
def test_idempotent(self):
|
||||||
|
result = normalize_address("123 Main Street Apartment 4")
|
||||||
|
assert normalize_address(result) == result
|
||||||
|
|
||||||
|
|
||||||
|
class TestNormalizeString:
|
||||||
|
def test_trim_and_casefold(self):
|
||||||
|
assert normalize_string(" Hello World ") == "hello world"
|
||||||
|
|
||||||
|
def test_collapse_whitespace(self):
|
||||||
|
assert normalize_string("a b c") == "a b c"
|
||||||
|
|
||||||
|
def test_empty(self):
|
||||||
|
assert normalize_string("") == ""
|
||||||
|
assert normalize_string(None) == ""
|
||||||
|
|
||||||
|
|
||||||
|
class TestGetNormalizer:
|
||||||
|
def test_get_by_enum(self):
|
||||||
|
fn = get_normalizer(NormalizerType.EMAIL)
|
||||||
|
assert fn("TEST@Gmail.com") == "test@gmail.com"
|
||||||
|
|
||||||
|
def test_get_by_string(self):
|
||||||
|
fn = get_normalizer("phone")
|
||||||
|
assert fn("(555) 123-4567") == "+15551234567"
|
||||||
|
|
||||||
|
def test_unknown_raises(self):
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
get_normalizer("unknown_type")
|
||||||
Reference in New Issue
Block a user