feat: add documentation, Streamlit GUI, and full source tree

- Rewrite README.md with project overview, quick-start, and CLI summary
- Add docs/CLI-REFERENCE.md with full flag reference and 8 recipe sections
- Add docs/DEVELOPER.md with architecture, data flow, and extension guides
- Rewrite src/core/__init__.py with public API exports and module docstring
- Add Streamlit GUI (src/gui/) with file upload, advanced options, interactive
  match group review with side-by-side diff, and download buttons
- Add .gitignore, requirements.txt, all source code, tests, and sample data
- Add streamlit to requirements.txt

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-04-28 23:06:39 +00:00
parent 0613dc420c
commit b871ab24fc
47 changed files with 4413 additions and 2 deletions

9
.gitignore vendored Normal file
View File

@@ -0,0 +1,9 @@
.venv/
__pycache__/
*.pyc
*.pyo
logs/
*.egg-info/
dist/
build/
.pytest_cache/

124
README.md
View File

@@ -1,3 +1,123 @@
# datatools-dev
# DataTools Deduplicator
Data tools development
Find and remove duplicate rows in CSV and Excel files — with fuzzy matching, smart normalization, and interactive review.
## Features
- **Zero-config start** — auto-detects encoding, delimiters, headers, and match columns
- **Fuzzy matching** — Jaro-Winkler, Levenshtein, and token set ratio algorithms
- **5 built-in normalizers** — email (Gmail dot/plus), phone (E.164), name (titles/suffixes), address (USPS), string (whitespace/case)
- **Merge mode** — fill missing fields in the surviving row from removed duplicates
- **4 survivor rules** — keep first, last, most complete, or most recent row per group
- **Interactive review** — inspect each match group and decide: merge, keep both, or skip
- **Config profiles** — save and reload your settings as JSON for repeatable runs
- **Dual interface** — full CLI for automation, Streamlit GUI for visual review
- **Dry-run by default** — preview what would change before writing anything
- **Audit trail** — every run produces a match groups report and timestamped log
## Quick Start
### Install
```bash
pip install -r requirements.txt
```
### CLI
```bash
# Preview duplicates (dry run — no files written)
python -m src.cli customers.csv
# Remove duplicates and save the result
python -m src.cli customers.csv --apply
# Fuzzy-match names at 80% similarity, merge missing fields
python -m src.cli customers.csv --fuzzy name --threshold 80 --merge --apply
# Interactively review each match group
python -m src.cli customers.csv --review --apply
```
### GUI
```bash
streamlit run src/gui/app.py
```
Upload a file, click **Find Duplicates**, review match groups side-by-side, then download the cleaned result.
## CLI Usage Summary
```
python -m src.cli INPUT_FILE [OPTIONS]
Options:
--apply Write output files (default: preview only)
--output, -o PATH Output file path
--subset, -s COLS Columns to match on (comma-separated)
--key, -k COLS Strong-key columns for exact matching
--fuzzy COLS Columns to fuzzy-match
--algorithm, -a ALG levenshtein | jaro_winkler | token_set_ratio
--threshold, -t N Similarity threshold 0-100 (default: 85)
--normalize COL:TYPE Per-column normalizers (e.g., email:email,phone:phone)
--survivor RULE first | last | most-complete | most-recent
--merge Fill missing fields from removed duplicates
--review Interactively review each match group
--config PATH Load settings from a JSON config file
--save-config PATH Save current settings to JSON
--sheet NAME Excel sheet name or 0-based index
--encoding ENC Override auto-detected encoding
--header-row N 0-based header row index
--help Show full help
```
## Sample Output
```
$ python -m src.cli samples/messy_sales.csv
Reading messy_sales.csv...
50 rows, 8 columns
Finding duplicates...
──────────────────────────────────────────────────
File: messy_sales.csv
Rows in: 50
Rows out: 28
Removed: 22
Groups: 22
──────────────────────────────────────────────────
Match groups:
Group 1: rows [1, 2] → keep row 1 (confidence: 100.0%, matched on: email)
Group 2: rows [3, 4] → keep row 3 (confidence: 92.3%, matched on: name, phone)
...
This was a preview. Add --apply to write the output files.
```
## Output Files
When `--apply` is used, three files are produced:
| File | Contents |
|------|----------|
| `{input}_deduplicated.csv` | Cleaned data with duplicates removed |
| `{input}_removed.csv` | Rows that were removed |
| `{input}_match_groups.csv` | Audit trail: group ID, confidence, matched columns, survivor flag |
## Documentation
- [CLI Reference](docs/CLI-REFERENCE.md) — every flag with examples and recipe sections
- [Developer Guide](docs/DEVELOPER.md) — architecture, data flow, how to extend
- [User Guide](docs/USER-GUIDE.md) — installation and usage for end users
## Requirements
- Python 3.10+
- Dependencies: pandas, openpyxl, rapidfuzz, typer, phonenumbers, loguru, tqdm, charset-normalizer
## License
Proprietary. All rights reserved.

284
docs/CLI-REFERENCE.md Normal file
View File

@@ -0,0 +1,284 @@
# CLI Reference
Complete command-line reference for the DataTools Deduplicator.
```
python -m src.cli INPUT_FILE [OPTIONS]
```
## Arguments
| Argument | Required | Description |
|----------|----------|-------------|
| `INPUT_FILE` | Yes | Path to the CSV or Excel file to deduplicate |
## Options
### Core
| Flag | Short | Default | Description |
|------|-------|---------|-------------|
| `--apply` | | `false` | Write output files. Without this flag, only a preview is shown. |
| `--output` | `-o` | `{input}_deduplicated.csv` | Output file path. |
### Column Selection
| Flag | Short | Default | Description |
|------|-------|---------|-------------|
| `--subset` | `-s` | auto-detect | Comma-separated columns to match on. When omitted, columns are auto-detected by name pattern (email, phone, name, address). |
| `--key` | `-k` | none | Comma-separated strong-key columns. Each becomes an independent exact-match strategy. Use for identifiers like `fb_id`, `ein`, `sku`. |
### Fuzzy Matching
| Flag | Short | Default | Description |
|------|-------|---------|-------------|
| `--fuzzy` | | none | Comma-separated columns to fuzzy-match. Other columns in the strategy use exact matching. |
| `--algorithm` | `-a` | `jaro_winkler` | Fuzzy algorithm: `levenshtein`, `jaro_winkler`, or `token_set_ratio`. |
| `--threshold` | `-t` | `85` | Similarity threshold 0-100. Lower values find more matches but increase false positives. |
### Normalization
| Flag | Short | Default | Description |
|------|-------|---------|-------------|
| `--normalize` | | auto-detect | Column normalizers as `col:type` pairs, comma-separated. Types: `email`, `phone`, `name`, `address`, `string`. |
**Normalizer details:**
| Type | What it does | Example |
|------|-------------|---------|
| `email` | Lowercase, strip Gmail dots, strip `+tag` suffixes | `John.Doe+tag@gmail.com``johndoe@gmail.com` |
| `phone` | Parse to E.164 format; fallback: digits only | `(555) 123-4567``+15551234567` |
| `name` | Strip titles (Dr., Mr.) and suffixes (Jr., PhD), case-fold | `Dr. John Smith Jr.``john smith` |
| `address` | USPS abbreviations (Street→St, Avenue→Ave), case-fold | `123 Main Street, Suite 4``123 main st ste 4` |
| `string` | Trim, collapse whitespace, case-fold | ` HELLO WORLD ``hello world` |
### Survivor Selection
| Flag | Short | Default | Description |
|------|-------|---------|-------------|
| `--survivor` | | `first` | Which row to keep per duplicate group. |
| `--date-column` | | none | Date column for the `most-recent` rule. |
| `--merge` | | `false` | Fill missing fields in the surviving row from removed duplicates. |
**Survivor rules:**
| Rule | Behavior |
|------|----------|
| `first` | Keep the first row encountered (lowest row number) |
| `last` | Keep the last row encountered (highest row number) |
| `most-complete` | Keep the row with the fewest blank/empty cells |
| `most-recent` | Keep the row with the latest date (requires `--date-column`) |
### Interactive Review
| Flag | Short | Default | Description |
|------|-------|---------|-------------|
| `--review` | | `false` | Interactively review each match group. For each group, choose: merge (y), keep both (n), or skip remaining (s). |
### Configuration
| Flag | Short | Default | Description |
|------|-------|---------|-------------|
| `--config` | | none | Load all settings from a saved JSON config file. |
| `--save-config` | | none | Save current settings to a JSON config file for reuse. |
### File Handling
| Flag | Short | Default | Description |
|------|-------|---------|-------------|
| `--sheet` | | first sheet | Excel sheet name or 0-based index. Ignored for CSV files. |
| `--encoding` | | auto-detect | Override auto-detected file encoding (e.g., `utf-8`, `windows-1252`). |
| `--header-row` | | auto-detect | 0-based row index for the header row. |
---
## Recipes
### 1. Basic Dedup (Auto-Detect)
Let the engine detect email, phone, name, and address columns automatically.
```bash
# Preview
python -m src.cli customers.csv
# Apply
python -m src.cli customers.csv --apply
```
The engine scans column names for patterns like `email`, `phone`, `name`, `address` and builds strategies automatically. Strong keys (email, phone) become standalone strategies; weak keys (name, address) are paired with strong keys.
### 2. Fuzzy Name Matching
Match rows where names are similar but not identical — catches typos, nickname variations, and formatting differences.
```bash
# Fuzzy-match on the "name" column at 80% similarity
python -m src.cli customers.csv --fuzzy name --threshold 80 --apply
# Fuzzy-match on multiple columns
python -m src.cli customers.csv --fuzzy name,address --threshold 85 --apply
# Use Levenshtein distance instead of Jaro-Winkler
python -m src.cli customers.csv --fuzzy name --algorithm levenshtein --threshold 80 --apply
```
**Algorithm comparison:**
- `jaro_winkler` (default) — best for short strings like names; weights early characters more heavily
- `levenshtein` — edit-distance ratio; works well for typos and transpositions
- `token_set_ratio` — best for addresses and long strings; ignores word order
### 3. Custom Strong Keys
Use specific identifier columns to find exact duplicates.
```bash
# Deduplicate by Facebook ID
python -m src.cli donors.csv --key fb_id --apply
# Multiple strong keys (each is independent — matched with OR)
python -m src.cli donors.csv --key fb_id,ein --apply
```
Strong keys are OR'd: a match on `fb_id` alone OR `ein` alone marks rows as duplicates.
### 4. Merge Mode
Keep the most complete row and fill any remaining blanks from the duplicates.
```bash
# Most complete row + merge missing fields
python -m src.cli contacts.csv --survivor most-complete --merge --apply
# Keep most recent row and merge
python -m src.cli contacts.csv --survivor most-recent --date-column updated_at --merge --apply
```
**How merge works:** The survivor row keeps all its non-empty fields. For any blank/null fields, the engine fills from the removed rows (in row order). The result is a single row with maximum data retention.
### 5. Multi-Column Subset
Match on a specific combination of columns rather than auto-detecting.
```bash
# Exact match on email + phone only
python -m src.cli customers.csv --subset email,phone --apply
# Mix exact and fuzzy within a subset
python -m src.cli customers.csv --subset email,name --fuzzy name --threshold 85 --apply
```
When using `--subset`, all listed columns must match (AND logic) for a pair to be considered duplicates.
### 6. Save and Load Config Profiles
Save your settings for repeatable runs on similar files.
```bash
# Save settings to a file
python -m src.cli customers.csv --fuzzy name --threshold 80 --merge \
--survivor most-complete --save-config customer_dedup.json
# Load saved settings
python -m src.cli new_customers.csv --config customer_dedup.json --apply
```
Config files are JSON. Example:
```json
{
"strategies": [],
"survivor_rule": "most_complete",
"merge": true,
"default_algorithm": "jaro_winkler",
"default_threshold": 80.0,
"fuzzy_columns": ["name"]
}
```
### 7. Interactive Review
Step through each match group and decide whether to merge.
```bash
python -m src.cli customers.csv --review --apply
```
For each group, the CLI displays both rows side-by-side and prompts:
```
============================================================
Match Group 1 — Confidence: 92.3%
Matched on: name, phone
============================================================
Row 1:
name: John Smith
email: john@example.com
phone: (555) 123-4567
Row 2:
name: Jon Smith
email:
phone: 555-123-4567
[y] Merge [n] Keep both [s] Skip remaining:
```
- **y** — accept the match; merge/remove duplicate
- **n** — reject the match; keep both rows
- **s** — skip all remaining groups (keep both for all)
### 8. Excel Files and Multi-Sheet
Work with Excel files directly — no CSV conversion needed.
```bash
# Deduplicate first sheet (default)
python -m src.cli data.xlsx --apply
# Specify sheet by name
python -m src.cli data.xlsx --sheet "Sales Data" --apply
# Specify sheet by index (0-based)
python -m src.cli data.xlsx --sheet 1 --apply
```
Output is always CSV by default. To write Excel output, use `-o`:
```bash
python -m src.cli data.xlsx -o cleaned.xlsx --apply
```
---
## Auto-Detection Details
When no `--subset` or `--fuzzy` flags are provided, the engine scans column names and builds strategies:
| Column pattern | Detection regex | Algorithm | Threshold | Normalizer | Key type |
|---------------|----------------|-----------|-----------|------------|----------|
| Email | `e[-_]?mail` | exact | 100% | email | strong |
| Phone | `phone\|telephone\|mobile\|cell` | exact | 100% | phone | strong |
| Name | `^(name\|full_name\|customer_name\|...)$` | jaro_winkler | 85% | name | weak |
| Address | `address\|street\|addr` | token_set_ratio | 80% | address | weak |
**Strategy building rules:**
- Strong keys → standalone OR strategies (email match alone is enough)
- Weak keys → paired with each strong key via AND (name match requires email or phone match too)
- No strong keys found → weak keys promoted to standalone
- No patterns matched → exact match on all columns (equivalent to `drop_duplicates`)
## Output Files
When `--apply` is set, three files are written:
| File | Description |
|------|-------------|
| `{stem}_deduplicated.csv` | Cleaned DataFrame with duplicates removed |
| `{stem}_removed.csv` | Rows that were removed |
| `{stem}_match_groups.csv` | Audit trail with `_group_id`, `_is_survivor`, `_confidence`, `_matched_on`, `_original_row`, plus all original columns |
## Logging
Every run writes a timestamped log to `logs/dedup_YYYYMMDD_HHMMSS.log` with full debug-level details: strategies used, pair comparisons, survivor decisions, and merge actions.

282
docs/DEVELOPER.md Normal file
View File

@@ -0,0 +1,282 @@
# Developer Guide
Architecture, data flow, and extension guide for the DataTools Deduplicator.
## Architecture
```
CLI (src/cli.py) GUI (src/gui/app.py)
│ │
│ flags → strategies │ widgets → strategies
│ _interactive_review() │ match_group_card()
│ tqdm progress bar │ st.progress()
│ │
└──────────┐ ┌────────────────┘
│ │
▼ ▼
┌─────────────────┐
│ core.dedup │
│ deduplicate() │
└────────┬────────┘
┌────────────┼────────────┐
▼ ▼ ▼
core.io core.normalizers core.config
read/write normalize_*() save/load JSON
```
**Key principle:** All business logic lives in `src/core/`. The CLI and GUI are thin wrappers that translate user input into `deduplicate()` arguments and display the `DeduplicationResult`.
## File-by-File Reference
### src/core/dedup.py — Deduplication Engine
The central module. Contains:
- **Enums:** `Algorithm` (4 fuzzy algorithms), `SurvivorRule` (4 selection rules)
- **Data classes:** `ColumnMatchStrategy`, `MatchStrategy`, `MatchResult`, `DeduplicationResult`
- **`deduplicate()`** — main entry point. Takes a DataFrame + optional strategies/rules, returns a `DeduplicationResult` with deduplicated DataFrame, removed rows, match groups, and log entries.
- **`build_default_strategies()`** — scans column names with regex patterns to auto-detect email, phone, name, and address columns. Builds strong/weak key strategies with appropriate algorithms and normalizers.
- **`_UnionFind`** — disjoint-set data structure for transitive closure. If A matches B and B matches C, all three end up in one group.
- **`_find_match_groups()`** — O(n^2) pairwise comparison. For each pair, tries all strategies (OR semantics). Feeds matches into union-find. Returns match groups with confidence scores.
- **`_select_survivor()`** — picks the row to keep based on the survivor rule.
- **`_merge_group()`** — fills blank fields in the survivor from loser rows.
### src/core/normalizers.py — Text Normalization
Five normalizer functions, each `str → str`, idempotent, None-safe:
- **`normalize_email()`** — lowercase, strip Gmail dots, strip `+tag` suffixes
- **`normalize_phone()`** — parse with `phonenumbers` to E.164; fallback to digits-only
- **`normalize_name()`** — strip title prefixes (Dr., Mr.) and suffixes (Jr., PhD), case-fold
- **`normalize_address()`** — USPS abbreviations (Street→St, Avenue→Ave), case-fold
- **`normalize_string()`** — trim, collapse whitespace, case-fold
The `get_normalizer()` registry function maps `NormalizerType` enum values to functions.
### src/core/io.py — File I/O
Auto-detection stack:
1. **`detect_encoding()`** — checks BOM, then uses `charset-normalizer` heuristics
2. **`detect_delimiter()`** — uses `csv.Sniffer` on first 20 lines
3. **`detect_header_row()`** — finds first row where all cells look like column names
Main functions:
- **`read_file()`** — reads CSV/TSV/Excel with full auto-detection. Returns a DataFrame.
- **`write_file()`** — writes DataFrame to CSV or Excel. Uses `utf-8-sig` by default for Windows Excel compatibility.
- **`list_sheets()`** — returns sheet names from an Excel workbook.
### src/core/config.py — Configuration Profiles
Save/load deduplication settings as JSON:
- **`DeduplicationConfig`** — flat dataclass with all settings: strategies, survivor rule, merge flag, algorithm, threshold, normalizer map.
- **`.to_file()` / `.from_file()`** — JSON serialization
- **`.to_strategies()`** — converts config back to `MatchStrategy` objects for the engine
- **`.to_survivor_rule()`** — converts string to `SurvivorRule` enum
### src/cli.py — Command-Line Interface
Typer-based CLI with 17 options. Key responsibilities:
- Parse flags into strategies, survivor rule, and other config
- Set up logging (timestamped log files in `logs/`)
- Column name validation with fuzzy suggestions on typos
- `_interactive_review()` — side-by-side row display with y/n/s prompts
- Progress bar via `tqdm` for files > 10,000 rows
- Output formatting and file writing
### src/gui/app.py — Streamlit GUI
Single-page layout:
- File upload with instant preview
- Advanced options expander (column selection, fuzzy, normalizers, survivor rule, merge, config profiles)
- Find Duplicates button → runs `deduplicate()` with `progress_callback`
- Interactive review: expandable match group cards with merge/keep/skip buttons
- Download buttons for deduplicated CSV, removed rows, and match groups report
### src/gui/components.py — Reusable GUI Widgets
- **`match_group_card()`** — expandable card showing side-by-side row comparison with diff highlighting
- **`config_panel()`** — the advanced options expander, returns a `DeduplicationConfig`
- **`results_summary()`** — summary stats and download buttons
## Data Flow
```
Input File
read_file() ← auto-detect encoding, delimiter, header
DataFrame
build_default_strategies() ← (if no explicit strategies)
│ scan column names → regex patterns
│ strong keys: email, phone (standalone OR)
│ weak keys: name, address (AND with strong)
_apply_normalizations() ← add _norm_* shadow columns
│ normalize_email(), normalize_phone(), etc.
_find_match_groups() ← O(n²) pairwise comparison
│ for each pair: try all strategies (OR)
│ _compute_similarity() per column
│ union-find for transitive closure
[review_callback()] ← optional: interactive review per group
│ True=accept, False=reject, None=skip
_select_survivor() ← per group: first/last/most-complete/most-recent
[_merge_group()] ← optional: fill blanks from losers
DeduplicationResult
├── deduplicated_df ← cleaned DataFrame (shadow cols dropped)
├── removed_df ← rows that were removed
├── match_groups ← list of MatchResult with confidence, columns
└── log_entries ← human-readable audit log
```
## How to Add a Normalizer
1. **Add the function** in `src/core/normalizers.py`:
```python
def normalize_company(value: Optional[str]) -> str:
"""Strip legal suffixes (Inc, LLC, Corp), case-fold."""
if not value or not isinstance(value, str):
return ""
name = value.strip().casefold()
# Strip common suffixes
for suffix in ("inc", "llc", "corp", "ltd", "co"):
name = re.sub(rf"\b{suffix}\.?\s*$", "", name).strip()
return name
```
2. **Register it** in the same file:
```python
class NormalizerType(str, Enum):
# ... existing types ...
COMPANY = "company" # ← add enum value
_NORMALIZER_MAP: dict[NormalizerType, Callable[[str], str]] = {
# ... existing entries ...
NormalizerType.COMPANY: normalize_company, # ← add mapping
}
```
3. **Add auto-detection pattern** in `src/core/dedup.py` (optional):
```python
_COLUMN_TYPE_PATTERNS = [
# ... existing patterns ...
(re.compile(r"company|organization|org_name", re.I),
NormalizerType.COMPANY, Algorithm.TOKEN_SET_RATIO, 85.0, False),
]
```
## How to Add a Matching Algorithm
1. **Add the enum value** in `src/core/dedup.py`:
```python
class Algorithm(str, Enum):
# ... existing values ...
SOUNDEX = "soundex"
```
2. **Add the computation** in `_compute_similarity()`:
```python
def _compute_similarity(val_a: str, val_b: str, algorithm: Algorithm) -> float:
# ... existing cases ...
if algorithm == Algorithm.SOUNDEX:
return 100.0 if _soundex(val_a) == _soundex(val_b) else 0.0
```
3. **Add the CLI flag value** in `src/cli.py` help text for `--algorithm`.
## How to Add a Survivor Strategy
1. **Add the enum value** in `src/core/dedup.py`:
```python
class SurvivorRule(str, Enum):
# ... existing values ...
KEEP_LONGEST = "longest"
```
2. **Add the logic** in `_select_survivor()`:
```python
if rule == SurvivorRule.KEEP_LONGEST:
return max(indices, key=lambda i: len(str(df.iloc[i].to_dict())))
```
3. **Add to the CLI** survivor map in `src/cli.py`.
## Testing
### Run Tests
```bash
# All tests
pytest tests/ -q
# Specific module
pytest tests/test_dedup.py -q
pytest tests/test_normalizers.py -q
pytest tests/test_io.py -q
pytest tests/test_config.py -q
pytest tests/test_cli.py -q
# Verbose with output
pytest tests/ -v
# Stop on first failure
pytest tests/ -x
```
### Test Structure
```
tests/
├── conftest.py # Shared fixtures
│ ├── sample_csv_path # Path to samples/messy_sales.csv
│ ├── sample_df # Loaded sample CSV as DataFrame
│ ├── simple_df # Small 5-row DataFrame with obvious duplicates
│ ├── merge_df # DataFrame with partial records
│ └── tmp_csv # Temporary CSV from simple_df
├── test_dedup.py # Engine tests: similarity, union-find, pairs, integration
├── test_normalizers.py # Normalizer tests: all 5 types with edge cases
├── test_io.py # I/O tests: encoding, delimiter, header, read/write
├── test_config.py # Config tests: serialization round-trip
└── test_cli.py # CLI tests: argument parsing, file handling
```
### Writing Tests
Follow existing patterns. Tests use pytest fixtures from `conftest.py`:
```python
def test_my_feature(simple_df):
"""Test description."""
result = deduplicate(simple_df, ...)
assert len(result.match_groups) == expected
assert result.deduplicated_df.shape[0] == expected_rows
```
## Known Limitations
- **O(n^2) pairwise comparison** — no blocking or indexing. Works well up to ~50,000 rows. Beyond that, performance degrades quadratically. Future optimization: add blocking (partition by first letter, zip code prefix, etc.) to reduce comparison space.
- **No multi-sheet dedup** — each Excel sheet is processed independently. Cross-sheet deduplication is not supported.
- **Phone normalization requires valid-length numbers** — the `phonenumbers` library rejects numbers that are too short or too long for the detected region. Fallback is digits-only, which may produce false negatives for international numbers without country codes.
- **Single-threaded** — no parallel comparison. Could benefit from `multiprocessing` for large files.
- **Memory-bound** — entire file is loaded into a pandas DataFrame. Files larger than available RAM will fail. Chunked reading exists but is not integrated with the dedup engine.

2
requirements-dev.txt Normal file
View File

@@ -0,0 +1,2 @@
pytest>=8.0,<9
pytest-cov>=5.0,<6

10
requirements.txt Normal file
View File

@@ -0,0 +1,10 @@
pandas>=2.2,<3
openpyxl>=3.1,<4
numpy>=1.26,<3
rapidfuzz>=3.6,<4
charset-normalizer>=3.3,<4
loguru>=0.7,<1
tqdm>=4.66,<5
typer>=0.12,<1
phonenumbers>=8.13,<9
streamlit>=1.35,<2

51
samples/messy_sales.csv Normal file
View File

@@ -0,0 +1,51 @@
order_id,customer_name,email,phone,address,product,amount,date
1001,John Smith,john.smith@gmail.com,(555) 123-4567,123 Main Street Apt 4,Widget Pro,49.99,2024-01-15
1002,John Smith,johnsmith@gmail.com,555-123-4567,123 Main St Apt 4,Widget Pro,49.99,2024-01-15
1003,Mr. John Smith,john.smith+promo@gmail.com,+1 555-123-4567,123 Main St. Apartment 4,Widget Pro,,2024-01-15
1004,Sarah Johnson,sarah.j@example.com,(555) 234-5678,456 Oak Avenue,Gadget X,89.99,2024-02-01
1005,Sara Johnson,sarah.j@example.com,555.234.5678,456 Oak Ave,Gadget X,89.99,2024-02-03
1006,Sarah M. Johnson,sarahj@example.com,5552345678,456 Oak Ave Suite 2,,89.99,2024-02-05
1007,Michael Williams,mike.w@company.org,(555) 345-6789,789 Pine Boulevard,Super Tool,129.99,2024-02-10
1008,Mike Williams,mike.w@company.org,555-345-6789,789 Pine Blvd,Super Tool,129.99,2024-02-10
1009,Dr. Michael Williams III,michael.williams@company.org,+15553456789,789 Pine Blvd,,129.99,2024-02-12
1010,Emily Davis,emily.davis@mail.com,(555) 456-7890,321 Elm Drive,Basic Set,29.99,2024-03-01
1011,Emily Davis,emily.davis@mail.com,(555) 456-7890,321 Elm Drive,Basic Set,29.99,2024-03-01
1012,Robert Brown,r.brown@email.net,(555) 567-8901,654 Cedar Lane,Premium Kit,199.99,2024-03-15
1013,Bob Brown,r.brown@email.net,555-567-8901,654 Cedar Ln,Premium Kit,199.99,2024-03-15
1014,Robert J. Brown Jr.,rj.brown@email.net,5555678901,654 Cedar Lane,Premium Kit,,2024-03-17
1015,Jennifer Wilson,jen.wilson@web.io,(555) 678-9012,987 Birch Court,Starter Pack,19.99,2024-04-01
1016,Jennifer Wilson,jen.wilson@web.io,(555) 678-9012,987 Birch Ct,Starter Pack,19.99,2024-04-01
1017,David Martinez,david.m@service.com,(555) 789-0123,246 Maple Road,Deluxe Widget,79.99,2024-04-15
1018,David Martinez,d.martinez@service.com,555-789-0123,246 Maple Rd,Deluxe Widget,79.99,2024-04-15
1019,Lisa Anderson,lisa.a@domain.com,(555) 890-1234,135 Walnut Way,Mini Tool,14.99,2024-05-01
1020,Lisa M Anderson,lisa.a@domain.com,555-890-1234,135 Walnut Way,Mini Tool,14.99,2024-05-01
1021,Thomas Taylor,tom.taylor@inbox.com,(555) 901-2345,864 Spruce Trail,Combo Pack,59.99,2024-05-15
1022,Tom Taylor,tom.taylor@inbox.com,555-901-2345,864 Spruce Trl,Combo Pack,59.99,2024-05-15
1023,Prof. Thomas R. Taylor,thomas.taylor@inbox.com,+1-555-901-2345,864 Spruce Trail Unit 3,,59.99,2024-05-18
1024,Karen White,karen.w@test.com,(555) 012-3456,579 Ash Circle,Widget Pro,49.99,2024-06-01
1025,Karen White,karen.w+newsletter@test.com,555-012-3456,579 Ash Cir,Widget Pro,49.99,2024-06-01
1026,James Clark,j.clark@biz.co,(555) 111-2222,111 First Street North,Enterprise License,499.99,2024-06-15
1027,James Clark,j.clark@biz.co,555-111-2222,111 1st St N,Enterprise License,499.99,2024-06-15
1028,Patricia Lee,pat.lee@mail.com,(555) 222-3333,222 Second Avenue South,Basic Set,29.99,2024-07-01
1029,Pat Lee,pat.lee@mail.com,555-222-3333,222 2nd Ave S,Basic Set,29.99,2024-07-01
1030,Christopher Hall,chris.hall@email.org,(555) 333-4444,333 Third Boulevard East,Gadget X,89.99,2024-07-15
1031,Chris Hall,chris.hall@email.org,555-333-4444,333 3rd Blvd E,Gadget X,89.99,2024-07-15
1032,Amanda Young,amanda.y@web.net,(555) 444-5555,444 Oak Parkway,Super Tool,129.99,2024-08-01
1033,Amanda Young,amanda.y@web.net,(555) 444-5555,444 Oak Pkwy,Super Tool,129.99,2024-08-01
1034,Daniel King,dan.king@corp.io,(555) 555-6666,555 Elm Square,Premium Kit,199.99,2024-08-15
1035,Dan King,dan.king@corp.io,555.555.6666,555 Elm Sq,Premium Kit,199.99,2024-08-15
1036,Michelle Scott,m.scott@place.com,(555) 666-7777,666 Pine Highway,Deluxe Widget,79.99,2024-09-01
1037,Michelle Scott,m.scott@place.com,(555) 666-7777,666 Pine Hwy,Deluxe Widget,79.99,2024-09-01
1038,Kevin Adams,kevin.a@shop.biz,(555) 777-8888,777 Cedar Terrace,Starter Pack,19.99,2024-09-15
1039,Kevin Adams,kevin.a+deals@shop.biz,555-777-8888,777 Cedar Ter,Starter Pack,19.99,2024-09-15
1040,Nancy Wright,nancy.w@home.net,(555) 888-9999,888 Maple Place,Mini Tool,14.99,2024-10-01
1041,Nancy Wright,nancy.w@home.net,(555) 888-9999,888 Maple Pl,Mini Tool,14.99,2024-10-01
1042,George Lopez,g.lopez@firma.com,(555) 999-0000,999 Walnut Expressway,Combo Pack,59.99,2024-10-15
1043,George Lopez,g.lopez@firma.com,555-999-0000,999 Walnut Expy,Combo Pack,59.99,2024-10-15
1044,Sandra Hill,s.hill@provider.co,(555) 101-2020,1010 Spruce Crossing,Widget Pro,49.99,2024-11-01
1045,Sandra Hill,s.hill@provider.co,555-101-2020,1010 Spruce Xing,Widget Pro,49.99,2024-11-01
1046,Mark Robinson,mark.r@office.com,(555) 202-3030,2020 Ash Heights,Enterprise License,499.99,2024-11-15
1047,Mark Robinson,mark.r@office.com,555-202-3030,2020 Ash Hts,Enterprise License,499.99,2024-11-15
1048,Unique Customer One,unique1@solo.com,(555) 000-0001,1 Unique Road,Widget Pro,49.99,2024-12-01
1049,Unique Customer Two,unique2@solo.com,(555) 000-0002,2 Unique Road,Gadget X,89.99,2024-12-02
1050,Unique Customer Three,unique3@solo.com,(555) 000-0003,3 Unique Road,Super Tool,129.99,2024-12-03
1 order_id customer_name email phone address product amount date
2 1001 John Smith john.smith@gmail.com (555) 123-4567 123 Main Street Apt 4 Widget Pro 49.99 2024-01-15
3 1002 John Smith johnsmith@gmail.com 555-123-4567 123 Main St Apt 4 Widget Pro 49.99 2024-01-15
4 1003 Mr. John Smith john.smith+promo@gmail.com +1 555-123-4567 123 Main St. Apartment 4 Widget Pro 2024-01-15
5 1004 Sarah Johnson sarah.j@example.com (555) 234-5678 456 Oak Avenue Gadget X 89.99 2024-02-01
6 1005 Sara Johnson sarah.j@example.com 555.234.5678 456 Oak Ave Gadget X 89.99 2024-02-03
7 1006 Sarah M. Johnson sarahj@example.com 5552345678 456 Oak Ave Suite 2 89.99 2024-02-05
8 1007 Michael Williams mike.w@company.org (555) 345-6789 789 Pine Boulevard Super Tool 129.99 2024-02-10
9 1008 Mike Williams mike.w@company.org 555-345-6789 789 Pine Blvd Super Tool 129.99 2024-02-10
10 1009 Dr. Michael Williams III michael.williams@company.org +15553456789 789 Pine Blvd 129.99 2024-02-12
11 1010 Emily Davis emily.davis@mail.com (555) 456-7890 321 Elm Drive Basic Set 29.99 2024-03-01
12 1011 Emily Davis emily.davis@mail.com (555) 456-7890 321 Elm Drive Basic Set 29.99 2024-03-01
13 1012 Robert Brown r.brown@email.net (555) 567-8901 654 Cedar Lane Premium Kit 199.99 2024-03-15
14 1013 Bob Brown r.brown@email.net 555-567-8901 654 Cedar Ln Premium Kit 199.99 2024-03-15
15 1014 Robert J. Brown Jr. rj.brown@email.net 5555678901 654 Cedar Lane Premium Kit 2024-03-17
16 1015 Jennifer Wilson jen.wilson@web.io (555) 678-9012 987 Birch Court Starter Pack 19.99 2024-04-01
17 1016 Jennifer Wilson jen.wilson@web.io (555) 678-9012 987 Birch Ct Starter Pack 19.99 2024-04-01
18 1017 David Martinez david.m@service.com (555) 789-0123 246 Maple Road Deluxe Widget 79.99 2024-04-15
19 1018 David Martinez d.martinez@service.com 555-789-0123 246 Maple Rd Deluxe Widget 79.99 2024-04-15
20 1019 Lisa Anderson lisa.a@domain.com (555) 890-1234 135 Walnut Way Mini Tool 14.99 2024-05-01
21 1020 Lisa M Anderson lisa.a@domain.com 555-890-1234 135 Walnut Way Mini Tool 14.99 2024-05-01
22 1021 Thomas Taylor tom.taylor@inbox.com (555) 901-2345 864 Spruce Trail Combo Pack 59.99 2024-05-15
23 1022 Tom Taylor tom.taylor@inbox.com 555-901-2345 864 Spruce Trl Combo Pack 59.99 2024-05-15
24 1023 Prof. Thomas R. Taylor thomas.taylor@inbox.com +1-555-901-2345 864 Spruce Trail Unit 3 59.99 2024-05-18
25 1024 Karen White karen.w@test.com (555) 012-3456 579 Ash Circle Widget Pro 49.99 2024-06-01
26 1025 Karen White karen.w+newsletter@test.com 555-012-3456 579 Ash Cir Widget Pro 49.99 2024-06-01
27 1026 James Clark j.clark@biz.co (555) 111-2222 111 First Street North Enterprise License 499.99 2024-06-15
28 1027 James Clark j.clark@biz.co 555-111-2222 111 1st St N Enterprise License 499.99 2024-06-15
29 1028 Patricia Lee pat.lee@mail.com (555) 222-3333 222 Second Avenue South Basic Set 29.99 2024-07-01
30 1029 Pat Lee pat.lee@mail.com 555-222-3333 222 2nd Ave S Basic Set 29.99 2024-07-01
31 1030 Christopher Hall chris.hall@email.org (555) 333-4444 333 Third Boulevard East Gadget X 89.99 2024-07-15
32 1031 Chris Hall chris.hall@email.org 555-333-4444 333 3rd Blvd E Gadget X 89.99 2024-07-15
33 1032 Amanda Young amanda.y@web.net (555) 444-5555 444 Oak Parkway Super Tool 129.99 2024-08-01
34 1033 Amanda Young amanda.y@web.net (555) 444-5555 444 Oak Pkwy Super Tool 129.99 2024-08-01
35 1034 Daniel King dan.king@corp.io (555) 555-6666 555 Elm Square Premium Kit 199.99 2024-08-15
36 1035 Dan King dan.king@corp.io 555.555.6666 555 Elm Sq Premium Kit 199.99 2024-08-15
37 1036 Michelle Scott m.scott@place.com (555) 666-7777 666 Pine Highway Deluxe Widget 79.99 2024-09-01
38 1037 Michelle Scott m.scott@place.com (555) 666-7777 666 Pine Hwy Deluxe Widget 79.99 2024-09-01
39 1038 Kevin Adams kevin.a@shop.biz (555) 777-8888 777 Cedar Terrace Starter Pack 19.99 2024-09-15
40 1039 Kevin Adams kevin.a+deals@shop.biz 555-777-8888 777 Cedar Ter Starter Pack 19.99 2024-09-15
41 1040 Nancy Wright nancy.w@home.net (555) 888-9999 888 Maple Place Mini Tool 14.99 2024-10-01
42 1041 Nancy Wright nancy.w@home.net (555) 888-9999 888 Maple Pl Mini Tool 14.99 2024-10-01
43 1042 George Lopez g.lopez@firma.com (555) 999-0000 999 Walnut Expressway Combo Pack 59.99 2024-10-15
44 1043 George Lopez g.lopez@firma.com 555-999-0000 999 Walnut Expy Combo Pack 59.99 2024-10-15
45 1044 Sandra Hill s.hill@provider.co (555) 101-2020 1010 Spruce Crossing Widget Pro 49.99 2024-11-01
46 1045 Sandra Hill s.hill@provider.co 555-101-2020 1010 Spruce Xing Widget Pro 49.99 2024-11-01
47 1046 Mark Robinson mark.r@office.com (555) 202-3030 2020 Ash Heights Enterprise License 499.99 2024-11-15
48 1047 Mark Robinson mark.r@office.com 555-202-3030 2020 Ash Hts Enterprise License 499.99 2024-11-15
49 1048 Unique Customer One unique1@solo.com (555) 000-0001 1 Unique Road Widget Pro 49.99 2024-12-01
50 1049 Unique Customer Two unique2@solo.com (555) 000-0002 2 Unique Road Gadget X 89.99 2024-12-02
51 1050 Unique Customer Three unique3@solo.com (555) 000-0003 3 Unique Road Super Tool 129.99 2024-12-03

View File

@@ -0,0 +1,50 @@
_group_id,_is_survivor,_confidence,_matched_on,_original_row,order_id,customer_name,email,phone,address,product,amount,date
1,True,83.81,customer_name,1,1001,John Smith,john.smith@gmail.com,(555) 123-4567,123 Main Street Apt 4,Widget Pro,49.99,2024-01-15
1,False,83.81,customer_name,2,1002,John Smith,johnsmith@gmail.com,555-123-4567,123 Main St Apt 4,Widget Pro,49.99,2024-01-15
1,False,83.81,customer_name,3,1003,Mr. John Smith,john.smith+promo@gmail.com,+1 555-123-4567,123 Main St. Apartment 4,Widget Pro,,2024-01-15
2,True,91.67,customer_name,4,1004,Sarah Johnson,sarah.j@example.com,(555) 234-5678,456 Oak Avenue,Gadget X,89.99,2024-02-01
2,False,91.67,customer_name,5,1005,Sara Johnson,sarah.j@example.com,555.234.5678,456 Oak Ave,Gadget X,89.99,2024-02-03
2,False,91.67,customer_name,6,1006,Sarah M. Johnson,sarahj@example.com,5552345678,456 Oak Ave Suite 2,,89.99,2024-02-05
3,True,80.56,customer_name,7,1007,Michael Williams,mike.w@company.org,(555) 345-6789,789 Pine Boulevard,Super Tool,129.99,2024-02-10
3,False,80.56,customer_name,8,1008,Mike Williams,mike.w@company.org,555-345-6789,789 Pine Blvd,Super Tool,129.99,2024-02-10
3,False,80.56,customer_name,9,1009,Dr. Michael Williams III,michael.williams@company.org,+15553456789,789 Pine Blvd,,129.99,2024-02-12
4,True,100.0,customer_name,10,1010,Emily Davis,emily.davis@mail.com,(555) 456-7890,321 Elm Drive,Basic Set,29.99,2024-03-01
4,False,100.0,customer_name,11,1011,Emily Davis,emily.davis@mail.com,(555) 456-7890,321 Elm Drive,Basic Set,29.99,2024-03-01
5,True,81.02,customer_name,12,1012,Robert Brown,r.brown@email.net,(555) 567-8901,654 Cedar Lane,Premium Kit,199.99,2024-03-15
5,False,81.02,customer_name,13,1013,Bob Brown,r.brown@email.net,555-567-8901,654 Cedar Ln,Premium Kit,199.99,2024-03-15
5,False,81.02,customer_name,14,1014,Robert J. Brown Jr.,rj.brown@email.net,5555678901,654 Cedar Lane,Premium Kit,,2024-03-17
6,True,100.0,customer_name,15,1015,Jennifer Wilson,jen.wilson@web.io,(555) 678-9012,987 Birch Court,Starter Pack,19.99,2024-04-01
6,False,100.0,customer_name,16,1016,Jennifer Wilson,jen.wilson@web.io,(555) 678-9012,987 Birch Ct,Starter Pack,19.99,2024-04-01
7,True,100.0,customer_name,17,1017,David Martinez,david.m@service.com,(555) 789-0123,246 Maple Road,Deluxe Widget,79.99,2024-04-15
7,False,100.0,customer_name,18,1018,David Martinez,d.martinez@service.com,555-789-0123,246 Maple Rd,Deluxe Widget,79.99,2024-04-15
8,True,97.33,customer_name,19,1019,Lisa Anderson,lisa.a@domain.com,(555) 890-1234,135 Walnut Way,Mini Tool,14.99,2024-05-01
8,False,97.33,customer_name,20,1020,Lisa M Anderson,lisa.a@domain.com,555-890-1234,135 Walnut Way,Mini Tool,14.99,2024-05-01
9,True,90.08,customer_name,21,1021,Thomas Taylor,tom.taylor@inbox.com,(555) 901-2345,864 Spruce Trail,Combo Pack,59.99,2024-05-15
9,False,90.08,customer_name,22,1022,Tom Taylor,tom.taylor@inbox.com,555-901-2345,864 Spruce Trl,Combo Pack,59.99,2024-05-15
10,True,100.0,customer_name,24,1024,Karen White,karen.w@test.com,(555) 012-3456,579 Ash Circle,Widget Pro,49.99,2024-06-01
10,False,100.0,customer_name,25,1025,Karen White,karen.w+newsletter@test.com,555-012-3456,579 Ash Cir,Widget Pro,49.99,2024-06-01
11,True,100.0,customer_name,26,1026,James Clark,j.clark@biz.co,(555) 111-2222,111 First Street North,Enterprise License,499.99,2024-06-15
11,False,100.0,customer_name,27,1027,James Clark,j.clark@biz.co,555-111-2222,111 1st St N,Enterprise License,499.99,2024-06-15
12,True,90.28,customer_name,28,1028,Patricia Lee,pat.lee@mail.com,(555) 222-3333,222 Second Avenue South,Basic Set,29.99,2024-07-01
12,False,90.28,customer_name,29,1029,Pat Lee,pat.lee@mail.com,555-222-3333,222 2nd Ave S,Basic Set,29.99,2024-07-01
13,True,92.5,customer_name,30,1030,Christopher Hall,chris.hall@email.org,(555) 333-4444,333 Third Boulevard East,Gadget X,89.99,2024-07-15
13,False,92.5,customer_name,31,1031,Chris Hall,chris.hall@email.org,555-333-4444,333 3rd Blvd E,Gadget X,89.99,2024-07-15
14,True,100.0,customer_name,32,1032,Amanda Young,amanda.y@web.net,(555) 444-5555,444 Oak Parkway,Super Tool,129.99,2024-08-01
14,False,100.0,customer_name,33,1033,Amanda Young,amanda.y@web.net,(555) 444-5555,444 Oak Pkwy,Super Tool,129.99,2024-08-01
15,True,90.72,customer_name,34,1034,Daniel King,dan.king@corp.io,(555) 555-6666,555 Elm Square,Premium Kit,199.99,2024-08-15
15,False,90.72,customer_name,35,1035,Dan King,dan.king@corp.io,555.555.6666,555 Elm Sq,Premium Kit,199.99,2024-08-15
16,True,100.0,customer_name,36,1036,Michelle Scott,m.scott@place.com,(555) 666-7777,666 Pine Highway,Deluxe Widget,79.99,2024-09-01
16,False,100.0,customer_name,37,1037,Michelle Scott,m.scott@place.com,(555) 666-7777,666 Pine Hwy,Deluxe Widget,79.99,2024-09-01
17,True,100.0,customer_name,38,1038,Kevin Adams,kevin.a@shop.biz,(555) 777-8888,777 Cedar Terrace,Starter Pack,19.99,2024-09-15
17,False,100.0,customer_name,39,1039,Kevin Adams,kevin.a+deals@shop.biz,555-777-8888,777 Cedar Ter,Starter Pack,19.99,2024-09-15
18,True,100.0,customer_name,40,1040,Nancy Wright,nancy.w@home.net,(555) 888-9999,888 Maple Place,Mini Tool,14.99,2024-10-01
18,False,100.0,customer_name,41,1041,Nancy Wright,nancy.w@home.net,(555) 888-9999,888 Maple Pl,Mini Tool,14.99,2024-10-01
19,True,100.0,customer_name,42,1042,George Lopez,g.lopez@firma.com,(555) 999-0000,999 Walnut Expressway,Combo Pack,59.99,2024-10-15
19,False,100.0,customer_name,43,1043,George Lopez,g.lopez@firma.com,555-999-0000,999 Walnut Expy,Combo Pack,59.99,2024-10-15
20,True,100.0,customer_name,44,1044,Sandra Hill,s.hill@provider.co,(555) 101-2020,1010 Spruce Crossing,Widget Pro,49.99,2024-11-01
20,False,100.0,customer_name,45,1045,Sandra Hill,s.hill@provider.co,555-101-2020,1010 Spruce Xing,Widget Pro,49.99,2024-11-01
21,True,100.0,customer_name,46,1046,Mark Robinson,mark.r@office.com,(555) 202-3030,2020 Ash Heights,Enterprise License,499.99,2024-11-15
21,False,100.0,customer_name,47,1047,Mark Robinson,mark.r@office.com,555-202-3030,2020 Ash Hts,Enterprise License,499.99,2024-11-15
22,True,93.68,customer_name,48,1048,Unique Customer One,unique1@solo.com,(555) 000-0001,1 Unique Road,Widget Pro,49.99,2024-12-01
22,False,93.68,customer_name,49,1049,Unique Customer Two,unique2@solo.com,(555) 000-0002,2 Unique Road,Gadget X,89.99,2024-12-02
22,False,93.68,customer_name,50,1050,Unique Customer Three,unique3@solo.com,(555) 000-0003,3 Unique Road,Super Tool,129.99,2024-12-03
1 _group_id _is_survivor _confidence _matched_on _original_row order_id customer_name email phone address product amount date
2 1 True 83.81 customer_name 1 1001 John Smith john.smith@gmail.com (555) 123-4567 123 Main Street Apt 4 Widget Pro 49.99 2024-01-15
3 1 False 83.81 customer_name 2 1002 John Smith johnsmith@gmail.com 555-123-4567 123 Main St Apt 4 Widget Pro 49.99 2024-01-15
4 1 False 83.81 customer_name 3 1003 Mr. John Smith john.smith+promo@gmail.com +1 555-123-4567 123 Main St. Apartment 4 Widget Pro 2024-01-15
5 2 True 91.67 customer_name 4 1004 Sarah Johnson sarah.j@example.com (555) 234-5678 456 Oak Avenue Gadget X 89.99 2024-02-01
6 2 False 91.67 customer_name 5 1005 Sara Johnson sarah.j@example.com 555.234.5678 456 Oak Ave Gadget X 89.99 2024-02-03
7 2 False 91.67 customer_name 6 1006 Sarah M. Johnson sarahj@example.com 5552345678 456 Oak Ave Suite 2 89.99 2024-02-05
8 3 True 80.56 customer_name 7 1007 Michael Williams mike.w@company.org (555) 345-6789 789 Pine Boulevard Super Tool 129.99 2024-02-10
9 3 False 80.56 customer_name 8 1008 Mike Williams mike.w@company.org 555-345-6789 789 Pine Blvd Super Tool 129.99 2024-02-10
10 3 False 80.56 customer_name 9 1009 Dr. Michael Williams III michael.williams@company.org +15553456789 789 Pine Blvd 129.99 2024-02-12
11 4 True 100.0 customer_name 10 1010 Emily Davis emily.davis@mail.com (555) 456-7890 321 Elm Drive Basic Set 29.99 2024-03-01
12 4 False 100.0 customer_name 11 1011 Emily Davis emily.davis@mail.com (555) 456-7890 321 Elm Drive Basic Set 29.99 2024-03-01
13 5 True 81.02 customer_name 12 1012 Robert Brown r.brown@email.net (555) 567-8901 654 Cedar Lane Premium Kit 199.99 2024-03-15
14 5 False 81.02 customer_name 13 1013 Bob Brown r.brown@email.net 555-567-8901 654 Cedar Ln Premium Kit 199.99 2024-03-15
15 5 False 81.02 customer_name 14 1014 Robert J. Brown Jr. rj.brown@email.net 5555678901 654 Cedar Lane Premium Kit 2024-03-17
16 6 True 100.0 customer_name 15 1015 Jennifer Wilson jen.wilson@web.io (555) 678-9012 987 Birch Court Starter Pack 19.99 2024-04-01
17 6 False 100.0 customer_name 16 1016 Jennifer Wilson jen.wilson@web.io (555) 678-9012 987 Birch Ct Starter Pack 19.99 2024-04-01
18 7 True 100.0 customer_name 17 1017 David Martinez david.m@service.com (555) 789-0123 246 Maple Road Deluxe Widget 79.99 2024-04-15
19 7 False 100.0 customer_name 18 1018 David Martinez d.martinez@service.com 555-789-0123 246 Maple Rd Deluxe Widget 79.99 2024-04-15
20 8 True 97.33 customer_name 19 1019 Lisa Anderson lisa.a@domain.com (555) 890-1234 135 Walnut Way Mini Tool 14.99 2024-05-01
21 8 False 97.33 customer_name 20 1020 Lisa M Anderson lisa.a@domain.com 555-890-1234 135 Walnut Way Mini Tool 14.99 2024-05-01
22 9 True 90.08 customer_name 21 1021 Thomas Taylor tom.taylor@inbox.com (555) 901-2345 864 Spruce Trail Combo Pack 59.99 2024-05-15
23 9 False 90.08 customer_name 22 1022 Tom Taylor tom.taylor@inbox.com 555-901-2345 864 Spruce Trl Combo Pack 59.99 2024-05-15
24 10 True 100.0 customer_name 24 1024 Karen White karen.w@test.com (555) 012-3456 579 Ash Circle Widget Pro 49.99 2024-06-01
25 10 False 100.0 customer_name 25 1025 Karen White karen.w+newsletter@test.com 555-012-3456 579 Ash Cir Widget Pro 49.99 2024-06-01
26 11 True 100.0 customer_name 26 1026 James Clark j.clark@biz.co (555) 111-2222 111 First Street North Enterprise License 499.99 2024-06-15
27 11 False 100.0 customer_name 27 1027 James Clark j.clark@biz.co 555-111-2222 111 1st St N Enterprise License 499.99 2024-06-15
28 12 True 90.28 customer_name 28 1028 Patricia Lee pat.lee@mail.com (555) 222-3333 222 Second Avenue South Basic Set 29.99 2024-07-01
29 12 False 90.28 customer_name 29 1029 Pat Lee pat.lee@mail.com 555-222-3333 222 2nd Ave S Basic Set 29.99 2024-07-01
30 13 True 92.5 customer_name 30 1030 Christopher Hall chris.hall@email.org (555) 333-4444 333 Third Boulevard East Gadget X 89.99 2024-07-15
31 13 False 92.5 customer_name 31 1031 Chris Hall chris.hall@email.org 555-333-4444 333 3rd Blvd E Gadget X 89.99 2024-07-15
32 14 True 100.0 customer_name 32 1032 Amanda Young amanda.y@web.net (555) 444-5555 444 Oak Parkway Super Tool 129.99 2024-08-01
33 14 False 100.0 customer_name 33 1033 Amanda Young amanda.y@web.net (555) 444-5555 444 Oak Pkwy Super Tool 129.99 2024-08-01
34 15 True 90.72 customer_name 34 1034 Daniel King dan.king@corp.io (555) 555-6666 555 Elm Square Premium Kit 199.99 2024-08-15
35 15 False 90.72 customer_name 35 1035 Dan King dan.king@corp.io 555.555.6666 555 Elm Sq Premium Kit 199.99 2024-08-15
36 16 True 100.0 customer_name 36 1036 Michelle Scott m.scott@place.com (555) 666-7777 666 Pine Highway Deluxe Widget 79.99 2024-09-01
37 16 False 100.0 customer_name 37 1037 Michelle Scott m.scott@place.com (555) 666-7777 666 Pine Hwy Deluxe Widget 79.99 2024-09-01
38 17 True 100.0 customer_name 38 1038 Kevin Adams kevin.a@shop.biz (555) 777-8888 777 Cedar Terrace Starter Pack 19.99 2024-09-15
39 17 False 100.0 customer_name 39 1039 Kevin Adams kevin.a+deals@shop.biz 555-777-8888 777 Cedar Ter Starter Pack 19.99 2024-09-15
40 18 True 100.0 customer_name 40 1040 Nancy Wright nancy.w@home.net (555) 888-9999 888 Maple Place Mini Tool 14.99 2024-10-01
41 18 False 100.0 customer_name 41 1041 Nancy Wright nancy.w@home.net (555) 888-9999 888 Maple Pl Mini Tool 14.99 2024-10-01
42 19 True 100.0 customer_name 42 1042 George Lopez g.lopez@firma.com (555) 999-0000 999 Walnut Expressway Combo Pack 59.99 2024-10-15
43 19 False 100.0 customer_name 43 1043 George Lopez g.lopez@firma.com 555-999-0000 999 Walnut Expy Combo Pack 59.99 2024-10-15
44 20 True 100.0 customer_name 44 1044 Sandra Hill s.hill@provider.co (555) 101-2020 1010 Spruce Crossing Widget Pro 49.99 2024-11-01
45 20 False 100.0 customer_name 45 1045 Sandra Hill s.hill@provider.co 555-101-2020 1010 Spruce Xing Widget Pro 49.99 2024-11-01
46 21 True 100.0 customer_name 46 1046 Mark Robinson mark.r@office.com (555) 202-3030 2020 Ash Heights Enterprise License 499.99 2024-11-15
47 21 False 100.0 customer_name 47 1047 Mark Robinson mark.r@office.com 555-202-3030 2020 Ash Hts Enterprise License 499.99 2024-11-15
48 22 True 93.68 customer_name 48 1048 Unique Customer One unique1@solo.com (555) 000-0001 1 Unique Road Widget Pro 49.99 2024-12-01
49 22 False 93.68 customer_name 49 1049 Unique Customer Two unique2@solo.com (555) 000-0002 2 Unique Road Gadget X 89.99 2024-12-02
50 22 False 93.68 customer_name 50 1050 Unique Customer Three unique3@solo.com (555) 000-0003 3 Unique Road Super Tool 129.99 2024-12-03

View File

@@ -0,0 +1,28 @@
order_id,customer_name,email,phone,address,product,amount,date
1002,John Smith,johnsmith@gmail.com,555-123-4567,123 Main St Apt 4,Widget Pro,49.99,2024-01-15
1003,Mr. John Smith,john.smith+promo@gmail.com,+1 555-123-4567,123 Main St. Apartment 4,Widget Pro,,2024-01-15
1005,Sara Johnson,sarah.j@example.com,555.234.5678,456 Oak Ave,Gadget X,89.99,2024-02-03
1006,Sarah M. Johnson,sarahj@example.com,5552345678,456 Oak Ave Suite 2,,89.99,2024-02-05
1008,Mike Williams,mike.w@company.org,555-345-6789,789 Pine Blvd,Super Tool,129.99,2024-02-10
1009,Dr. Michael Williams III,michael.williams@company.org,+15553456789,789 Pine Blvd,,129.99,2024-02-12
1011,Emily Davis,emily.davis@mail.com,(555) 456-7890,321 Elm Drive,Basic Set,29.99,2024-03-01
1013,Bob Brown,r.brown@email.net,555-567-8901,654 Cedar Ln,Premium Kit,199.99,2024-03-15
1014,Robert J. Brown Jr.,rj.brown@email.net,5555678901,654 Cedar Lane,Premium Kit,,2024-03-17
1016,Jennifer Wilson,jen.wilson@web.io,(555) 678-9012,987 Birch Ct,Starter Pack,19.99,2024-04-01
1018,David Martinez,d.martinez@service.com,555-789-0123,246 Maple Rd,Deluxe Widget,79.99,2024-04-15
1020,Lisa M Anderson,lisa.a@domain.com,555-890-1234,135 Walnut Way,Mini Tool,14.99,2024-05-01
1022,Tom Taylor,tom.taylor@inbox.com,555-901-2345,864 Spruce Trl,Combo Pack,59.99,2024-05-15
1025,Karen White,karen.w+newsletter@test.com,555-012-3456,579 Ash Cir,Widget Pro,49.99,2024-06-01
1027,James Clark,j.clark@biz.co,555-111-2222,111 1st St N,Enterprise License,499.99,2024-06-15
1029,Pat Lee,pat.lee@mail.com,555-222-3333,222 2nd Ave S,Basic Set,29.99,2024-07-01
1031,Chris Hall,chris.hall@email.org,555-333-4444,333 3rd Blvd E,Gadget X,89.99,2024-07-15
1033,Amanda Young,amanda.y@web.net,(555) 444-5555,444 Oak Pkwy,Super Tool,129.99,2024-08-01
1035,Dan King,dan.king@corp.io,555.555.6666,555 Elm Sq,Premium Kit,199.99,2024-08-15
1037,Michelle Scott,m.scott@place.com,(555) 666-7777,666 Pine Hwy,Deluxe Widget,79.99,2024-09-01
1039,Kevin Adams,kevin.a+deals@shop.biz,555-777-8888,777 Cedar Ter,Starter Pack,19.99,2024-09-15
1041,Nancy Wright,nancy.w@home.net,(555) 888-9999,888 Maple Pl,Mini Tool,14.99,2024-10-01
1043,George Lopez,g.lopez@firma.com,555-999-0000,999 Walnut Expy,Combo Pack,59.99,2024-10-15
1045,Sandra Hill,s.hill@provider.co,555-101-2020,1010 Spruce Xing,Widget Pro,49.99,2024-11-01
1047,Mark Robinson,mark.r@office.com,555-202-3030,2020 Ash Hts,Enterprise License,499.99,2024-11-15
1049,Unique Customer Two,unique2@solo.com,(555) 000-0002,2 Unique Road,Gadget X,89.99,2024-12-02
1050,Unique Customer Three,unique3@solo.com,(555) 000-0003,3 Unique Road,Super Tool,129.99,2024-12-03
1 order_id customer_name email phone address product amount date
2 1002 John Smith johnsmith@gmail.com 555-123-4567 123 Main St Apt 4 Widget Pro 49.99 2024-01-15
3 1003 Mr. John Smith john.smith+promo@gmail.com +1 555-123-4567 123 Main St. Apartment 4 Widget Pro 2024-01-15
4 1005 Sara Johnson sarah.j@example.com 555.234.5678 456 Oak Ave Gadget X 89.99 2024-02-03
5 1006 Sarah M. Johnson sarahj@example.com 5552345678 456 Oak Ave Suite 2 89.99 2024-02-05
6 1008 Mike Williams mike.w@company.org 555-345-6789 789 Pine Blvd Super Tool 129.99 2024-02-10
7 1009 Dr. Michael Williams III michael.williams@company.org +15553456789 789 Pine Blvd 129.99 2024-02-12
8 1011 Emily Davis emily.davis@mail.com (555) 456-7890 321 Elm Drive Basic Set 29.99 2024-03-01
9 1013 Bob Brown r.brown@email.net 555-567-8901 654 Cedar Ln Premium Kit 199.99 2024-03-15
10 1014 Robert J. Brown Jr. rj.brown@email.net 5555678901 654 Cedar Lane Premium Kit 2024-03-17
11 1016 Jennifer Wilson jen.wilson@web.io (555) 678-9012 987 Birch Ct Starter Pack 19.99 2024-04-01
12 1018 David Martinez d.martinez@service.com 555-789-0123 246 Maple Rd Deluxe Widget 79.99 2024-04-15
13 1020 Lisa M Anderson lisa.a@domain.com 555-890-1234 135 Walnut Way Mini Tool 14.99 2024-05-01
14 1022 Tom Taylor tom.taylor@inbox.com 555-901-2345 864 Spruce Trl Combo Pack 59.99 2024-05-15
15 1025 Karen White karen.w+newsletter@test.com 555-012-3456 579 Ash Cir Widget Pro 49.99 2024-06-01
16 1027 James Clark j.clark@biz.co 555-111-2222 111 1st St N Enterprise License 499.99 2024-06-15
17 1029 Pat Lee pat.lee@mail.com 555-222-3333 222 2nd Ave S Basic Set 29.99 2024-07-01
18 1031 Chris Hall chris.hall@email.org 555-333-4444 333 3rd Blvd E Gadget X 89.99 2024-07-15
19 1033 Amanda Young amanda.y@web.net (555) 444-5555 444 Oak Pkwy Super Tool 129.99 2024-08-01
20 1035 Dan King dan.king@corp.io 555.555.6666 555 Elm Sq Premium Kit 199.99 2024-08-15
21 1037 Michelle Scott m.scott@place.com (555) 666-7777 666 Pine Hwy Deluxe Widget 79.99 2024-09-01
22 1039 Kevin Adams kevin.a+deals@shop.biz 555-777-8888 777 Cedar Ter Starter Pack 19.99 2024-09-15
23 1041 Nancy Wright nancy.w@home.net (555) 888-9999 888 Maple Pl Mini Tool 14.99 2024-10-01
24 1043 George Lopez g.lopez@firma.com 555-999-0000 999 Walnut Expy Combo Pack 59.99 2024-10-15
25 1045 Sandra Hill s.hill@provider.co 555-101-2020 1010 Spruce Xing Widget Pro 49.99 2024-11-01
26 1047 Mark Robinson mark.r@office.com 555-202-3030 2020 Ash Hts Enterprise License 499.99 2024-11-15
27 1049 Unique Customer Two unique2@solo.com (555) 000-0002 2 Unique Road Gadget X 89.99 2024-12-02
28 1050 Unique Customer Three unique3@solo.com (555) 000-0003 3 Unique Road Super Tool 129.99 2024-12-03

0
src/__init__.py Normal file
View File

4
src/__main__.py Normal file
View File

@@ -0,0 +1,4 @@
"""Allow running as ``python -m src``."""
from src.cli import main
main()

502
src/cli.py Normal file
View File

@@ -0,0 +1,502 @@
"""CLI for the DataTools deduplicator.
Usage:
python -m src.cli input.csv # dry-run preview
python -m src.cli input.csv --apply # write deduplicated output
python -m src.cli input.csv --fuzzy name --merge # fuzzy match + merge
python -m src.cli --help # full help
"""
from __future__ import annotations
import sys
from datetime import datetime
from pathlib import Path
from typing import Optional
import typer
from loguru import logger
from rapidfuzz import process as rf_process
app = typer.Typer(
name="dedup",
help=(
"Find and remove duplicate rows in CSV and Excel files.\n\n"
"By default, runs in preview mode — shows what would change without "
"modifying anything. Add --apply to write the output.\n\n"
"Examples:\n\n"
" # Preview duplicates in a CSV file\n"
" python -m src.cli customers.csv\n\n"
" # Remove duplicates and save the result\n"
" python -m src.cli customers.csv --apply\n\n"
" # Fuzzy-match on the 'name' column with 80% threshold\n"
" python -m src.cli customers.csv --fuzzy name --threshold 80 --apply\n\n"
" # Match on specific columns only\n"
" python -m src.cli customers.csv --subset email,phone --apply\n\n"
" # Keep the most complete row and merge missing fields\n"
" python -m src.cli customers.csv --survivor most-complete --merge --apply\n"
),
add_completion=False,
no_args_is_help=True,
)
# ---------------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------------
def _setup_logging(log_dir: Path) -> Path:
"""Configure loguru to write a timestamped log file. Returns the log path."""
log_dir.mkdir(parents=True, exist_ok=True)
ts = datetime.now().strftime("%Y%m%d_%H%M%S")
log_path = log_dir / f"dedup_{ts}.log"
logger.remove() # remove default stderr handler
logger.add(sys.stderr, level="WARNING", format="{message}")
logger.add(str(log_path), level="DEBUG",
format="{time:YYYY-MM-DD HH:mm:ss} | {level:<8} | {message}")
return log_path
def _suggest_column(name: str, available: list[str]) -> str:
"""Return a helpful error message when a column is not found."""
cols_str = ", ".join(available)
matches = rf_process.extract(name, available, limit=1, score_cutoff=50)
if matches:
suggestion = matches[0][0]
return (
f"Column '{name}' not found. "
f"Available columns: {cols_str}. "
f"Did you mean '{suggestion}'?"
)
return f"Column '{name}' not found. Available columns: {cols_str}."
def _validate_columns(requested: list[str], available: list[str]) -> None:
"""Raise typer.BadParameter if any requested column doesn't exist."""
for col in requested:
if col not in available:
raise typer.BadParameter(_suggest_column(col, available))
def _parse_normalize_map(raw: Optional[str]) -> dict[str, str]:
"""Parse 'col:type,col:type' into a dict."""
if not raw:
return {}
result = {}
for pair in raw.split(","):
pair = pair.strip()
if ":" not in pair:
raise typer.BadParameter(
f"Invalid normalize format: '{pair}'. "
f"Expected 'column:type' (e.g., 'email:email,phone:phone')."
)
col, ntype = pair.split(":", 1)
result[col.strip()] = ntype.strip()
return result
def _interactive_review(group, df) -> Optional[bool]:
"""Side-by-side CLI review for a match group. Returns True/False/None."""
from src.core.dedup import MatchResult
group: MatchResult
print(f"\n{'='*60}")
print(f"Match Group {group.group_id + 1} — Confidence: {group.confidence:.1f}%")
print(f"Matched on: {', '.join(group.matched_on)}")
print(f"{'='*60}")
display_cols = [c for c in df.columns if not str(c).startswith("_norm_")]
for idx in group.row_indices:
print(f"\n Row {idx + 1}:")
for col in display_cols:
val = df.iloc[idx].get(col, "")
if str(val).strip():
print(f" {col}: {val}")
while True:
choice = input("\n [y] Merge [n] Keep both [s] Skip remaining: ").strip().lower()
if choice == "y":
return True
if choice == "n":
return False
if choice == "s":
return None
print(" Please enter y, n, or s.")
# ---------------------------------------------------------------------------
# Main command
# ---------------------------------------------------------------------------
@app.command()
def dedup(
input_file: str = typer.Argument(
...,
help="Path to the CSV or Excel file to deduplicate.",
),
output: Optional[str] = typer.Option(
None, "--output", "-o",
help="Output file path. Default: {input}_deduplicated.csv",
),
apply: bool = typer.Option(
False, "--apply",
help="Write the output file. Without this flag, only a preview is shown.",
),
key: Optional[str] = typer.Option(
None, "--key", "-k",
help="Comma-separated strong-key columns (e.g., 'fb_id,ein'). Each is an independent exact-match dedup key.",
),
subset: Optional[str] = typer.Option(
None, "--subset", "-s",
help="Comma-separated columns to match on (default: auto-detect).",
),
fuzzy: Optional[str] = typer.Option(
None, "--fuzzy",
help="Comma-separated columns to fuzzy-match (others use exact match).",
),
algorithm: str = typer.Option(
"jaro_winkler", "--algorithm", "-a",
help="Fuzzy algorithm: levenshtein, jaro_winkler, or token_set_ratio.",
),
threshold: int = typer.Option(
85, "--threshold", "-t",
help="Similarity threshold 0-100 for fuzzy matching.",
),
normalize: Optional[str] = typer.Option(
None, "--normalize",
help="Column normalizers as 'col:type' pairs (e.g., 'email:email,phone:phone').",
),
survivor: str = typer.Option(
"first", "--survivor",
help="Survivor rule: first, last, most-complete, or most-recent.",
),
date_column: Optional[str] = typer.Option(
None, "--date-column",
help="Date column for most-recent survivor rule.",
),
merge: bool = typer.Option(
False, "--merge",
help="Fill missing fields in the surviving row from removed duplicates.",
),
review: bool = typer.Option(
False, "--review",
help="Interactively review each match group before merging.",
),
config: Optional[str] = typer.Option(
None, "--config",
help="Load settings from a saved JSON config file.",
),
save_config: Optional[str] = typer.Option(
None, "--save-config",
help="Save current settings to a JSON config file.",
),
sheet: Optional[str] = typer.Option(
None, "--sheet",
help="Excel sheet name or index (default: first sheet).",
),
encoding_override: Optional[str] = typer.Option(
None, "--encoding",
help="Override auto-detected file encoding.",
),
header_row: Optional[int] = typer.Option(
None, "--header-row",
help="0-based row index for the header (default: auto-detect).",
),
):
"""Find and remove duplicate rows in CSV and Excel files."""
from src.core.io import read_file, write_file, list_sheets
from src.core.dedup import (
Algorithm, ColumnMatchStrategy, MatchStrategy, SurvivorRule,
build_default_strategies, deduplicate,
)
from src.core.normalizers import NormalizerType
from src.core.config import DeduplicationConfig
# Setup
input_path = Path(input_file)
if not input_path.exists():
typer.echo(f"Error: File not found: {input_path}", err=True)
raise typer.Exit(1)
log_path = _setup_logging(Path("logs"))
# Load config if provided
cfg: Optional[DeduplicationConfig] = None
if config:
config_path = Path(config)
if not config_path.exists():
typer.echo(f"Error: Config file not found: {config_path}", err=True)
raise typer.Exit(1)
cfg = DeduplicationConfig.from_file(config_path)
logger.info("Loaded config from {}", config_path)
# Read input
typer.echo(f"Reading {input_path.name}...")
try:
sheet_arg: str | int | None = None
if sheet is not None:
try:
sheet_arg = int(sheet)
except ValueError:
sheet_arg = sheet
df = read_file(
input_path,
encoding=encoding_override,
header_row=header_row,
sheet_name=sheet_arg if sheet_arg is not None else 0,
)
if not isinstance(df, __import__("pandas").DataFrame):
# chunked reading returns generator — materialise for v1
import pandas as pd
df = pd.concat(list(df), ignore_index=True)
except Exception as e:
typer.echo(f"Error reading file: {e}", err=True)
raise typer.Exit(1)
typer.echo(f" {len(df)} rows, {len(df.columns)} columns")
available_columns = list(df.columns)
# Build strategies
strategies: Optional[list[MatchStrategy]] = None
if cfg and cfg.strategies:
strategies = cfg.to_strategies()
elif subset or fuzzy:
# Build from CLI flags
normalize_map = _parse_normalize_map(normalize)
strategies = []
fuzzy_cols = set(c.strip() for c in fuzzy.split(",")) if fuzzy else set()
if subset:
subset_cols = [c.strip() for c in subset.split(",")]
elif fuzzy_cols:
# When only --fuzzy is given, match on just those columns
subset_cols = list(fuzzy_cols)
else:
subset_cols = available_columns
_validate_columns(subset_cols, available_columns)
if fuzzy_cols:
_validate_columns(list(fuzzy_cols), available_columns)
col_strats: list[ColumnMatchStrategy] = []
for col in subset_cols:
norm = None
if col in normalize_map:
norm = NormalizerType(normalize_map[col])
if col in fuzzy_cols:
algo = Algorithm(algorithm)
thresh = float(threshold)
else:
algo = Algorithm.EXACT
thresh = 100.0
col_strats.append(ColumnMatchStrategy(
column=col, algorithm=algo, threshold=thresh, normalizer=norm,
))
strategies = [MatchStrategy(column_strategies=col_strats)]
# Apply normalizer overrides even with auto-detect
if normalize and strategies is None:
normalize_map = _parse_normalize_map(normalize)
auto_strats = build_default_strategies(df)
# Inject normalize_map into auto strategies
for strat in auto_strats:
for cs in strat.column_strategies:
if cs.column in normalize_map:
cs.normalizer = NormalizerType(normalize_map[cs.column])
strategies = auto_strats
# --key: add user-declared strong keys as standalone exact-match strategies
if key:
key_cols = [c.strip() for c in key.split(",")]
_validate_columns(key_cols, available_columns)
key_strats = [
MatchStrategy(column_strategies=[
ColumnMatchStrategy(column=col, algorithm=Algorithm.EXACT, threshold=100.0)
])
for col in key_cols
]
if strategies is None:
# Combine with auto-detect so user gets both
strategies = build_default_strategies(df) + key_strats
else:
strategies.extend(key_strats)
# Survivor rule
survivor_map = {
"first": SurvivorRule.KEEP_FIRST,
"last": SurvivorRule.KEEP_LAST,
"most-complete": SurvivorRule.KEEP_MOST_COMPLETE,
"most_complete": SurvivorRule.KEEP_MOST_COMPLETE,
"most-recent": SurvivorRule.KEEP_MOST_RECENT,
"most_recent": SurvivorRule.KEEP_MOST_RECENT,
}
if cfg:
surv_rule = cfg.to_survivor_rule()
do_merge = cfg.merge
dc = cfg.date_column
else:
surv_key = survivor.lower().replace("-", "_")
if surv_key not in {r.value for r in SurvivorRule} and surv_key not in survivor_map:
typer.echo(
f"Error: Unknown survivor rule '{survivor}'. "
f"Choose from: first, last, most-complete, most-recent.",
err=True,
)
raise typer.Exit(1)
surv_rule = survivor_map.get(survivor.lower(), SurvivorRule(surv_key))
do_merge = merge
dc = date_column
# Save config if requested
if save_config:
from src.core.config import DeduplicationConfig, StrategyConfig, ColumnStrategyConfig
save_cfg = DeduplicationConfig(
survivor_rule=surv_rule.value,
date_column=dc,
merge=do_merge,
subset_columns=[c.strip() for c in subset.split(",")] if subset else None,
fuzzy_columns=[c.strip() for c in fuzzy.split(",")] if fuzzy else None,
default_algorithm=algorithm,
default_threshold=float(threshold),
normalize_map=_parse_normalize_map(normalize),
)
if strategies:
save_cfg.strategies = [
StrategyConfig(columns=[
ColumnStrategyConfig(
column=cs.column,
algorithm=cs.algorithm.value,
threshold=cs.threshold,
normalizer=cs.normalizer.value if cs.normalizer else None,
)
for cs in s.column_strategies
])
for s in strategies
]
saved = save_cfg.to_file(save_config)
typer.echo(f"Config saved to {saved}")
# Progress bar
progress_cb = None
if len(df) > 10_000:
from tqdm import tqdm
pbar = tqdm(total=len(df) * (len(df) - 1) // 2, desc="Comparing rows",
unit="pairs", leave=False)
def _progress(current: int, total: int):
pbar.update(current - pbar.n)
if current >= total:
pbar.close()
progress_cb = _progress
# Review callback
review_cb = _interactive_review if review else None
# Run dedup
typer.echo("Finding duplicates...")
result = deduplicate(
df,
strategies=strategies,
survivor_rule=surv_rule,
date_column=dc,
merge=do_merge,
preview=not apply,
review_callback=review_cb,
progress_callback=progress_cb,
)
# Print results
_print_results(result, input_path)
# Write output files
if apply:
stem = input_path.stem
suffix = input_path.suffix
out_path = Path(output) if output else input_path.parent / f"{stem}_deduplicated.csv"
write_file(result.deduplicated_df, out_path)
typer.echo(f"\nDeduplicated file: {out_path}")
if not result.removed_df.empty:
removed_path = input_path.parent / f"{stem}_removed.csv"
write_file(result.removed_df, removed_path)
typer.echo(f"Removed rows: {removed_path}")
if result.match_groups:
groups_path = input_path.parent / f"{stem}_match_groups.csv"
_write_match_groups(result, df, groups_path)
typer.echo(f"Match groups: {groups_path}")
else:
typer.echo("\nThis was a preview. Add --apply to write the output files.")
typer.echo(f"Log: {log_path}")
# ---------------------------------------------------------------------------
# Output formatting
# ---------------------------------------------------------------------------
def _print_results(result, input_path: Path) -> None:
"""Print a human-readable summary."""
removed = result.original_row_count - len(result.deduplicated_df)
typer.echo(f"\n{''*50}")
typer.echo(f" File: {input_path.name}")
typer.echo(f" Rows in: {result.original_row_count}")
typer.echo(f" Rows out: {len(result.deduplicated_df)}")
typer.echo(f" Removed: {removed}")
typer.echo(f" Groups: {len(result.match_groups)}")
typer.echo(f"{''*50}")
if result.match_groups:
typer.echo("\nMatch groups:")
for g in result.match_groups[:20]: # cap display
rows_str = ", ".join(str(i + 1) for i in g.row_indices)
surv = g.survivor_index + 1
typer.echo(
f" Group {g.group_id + 1}: rows [{rows_str}] "
f"→ keep row {surv} "
f"(confidence: {g.confidence:.1f}%, "
f"matched on: {', '.join(g.matched_on)})"
)
if len(result.match_groups) > 20:
typer.echo(f" ... and {len(result.match_groups) - 20} more groups")
def _write_match_groups(result, original_df, path: Path) -> None:
"""Write match groups to a CSV for audit."""
import pandas as pd
from src.core.io import write_file
rows = []
for g in result.match_groups:
for idx in g.row_indices:
row_data = {"_group_id": g.group_id + 1}
row_data["_is_survivor"] = idx == g.survivor_index
row_data["_confidence"] = g.confidence
row_data["_matched_on"] = ", ".join(g.matched_on)
row_data["_original_row"] = idx + 1
# Include original data
for col in original_df.columns:
row_data[col] = original_df.iloc[idx].get(col, "")
rows.append(row_data)
groups_df = pd.DataFrame(rows)
write_file(groups_df, path)
# ---------------------------------------------------------------------------
# __main__ support
# ---------------------------------------------------------------------------
def main():
app()
if __name__ == "__main__":
main()

93
src/core/__init__.py Normal file
View File

@@ -0,0 +1,93 @@
"""DataTools deduplication engine.
Public API
----------
Core:
deduplicate(df, ...) -> DeduplicationResult
build_default_strategies(df) -> list[MatchStrategy]
Types:
Algorithm, SurvivorRule, ColumnMatchStrategy, MatchStrategy
MatchResult, DeduplicationResult
Normalizers:
get_normalizer(type) -> Callable
NormalizerType
normalize_email, normalize_phone, normalize_name,
normalize_address, normalize_string
I/O:
read_file(path, ...) -> DataFrame
write_file(df, path, ...)
list_sheets(path) -> list[str]
detect_encoding, detect_delimiter, detect_header_row
Configuration:
DeduplicationConfig.from_file(path) -> DeduplicationConfig
DeduplicationConfig.to_file(path)
"""
from .dedup import (
Algorithm,
ColumnMatchStrategy,
DeduplicationResult,
MatchResult,
MatchStrategy,
SurvivorRule,
build_default_strategies,
deduplicate,
)
from .normalizers import (
NormalizerType,
get_normalizer,
normalize_address,
normalize_email,
normalize_name,
normalize_phone,
normalize_string,
)
from .io import (
detect_delimiter,
detect_encoding,
detect_header_row,
list_sheets,
read_file,
write_file,
)
from .config import (
ColumnStrategyConfig,
DeduplicationConfig,
StrategyConfig,
)
__all__ = [
# Core
"deduplicate",
"build_default_strategies",
# Types
"Algorithm",
"SurvivorRule",
"ColumnMatchStrategy",
"MatchStrategy",
"MatchResult",
"DeduplicationResult",
# Normalizers
"NormalizerType",
"get_normalizer",
"normalize_email",
"normalize_phone",
"normalize_name",
"normalize_address",
"normalize_string",
# I/O
"read_file",
"write_file",
"list_sheets",
"detect_encoding",
"detect_delimiter",
"detect_header_row",
# Config
"DeduplicationConfig",
"StrategyConfig",
"ColumnStrategyConfig",
]

117
src/core/config.py Normal file
View File

@@ -0,0 +1,117 @@
"""Configuration profiles: save/load deduplication settings as JSON."""
from __future__ import annotations
import json
from dataclasses import dataclass, field, asdict
from pathlib import Path
from typing import Optional
from .dedup import (
Algorithm,
ColumnMatchStrategy,
MatchStrategy,
NormalizerType,
SurvivorRule,
)
@dataclass
class ColumnStrategyConfig:
"""JSON-serializable mirror of ColumnMatchStrategy."""
column: str
algorithm: str = "exact"
threshold: float = 100.0
normalizer: Optional[str] = None
@dataclass
class StrategyConfig:
"""JSON-serializable mirror of MatchStrategy."""
columns: list[ColumnStrategyConfig] = field(default_factory=list)
@dataclass
class DeduplicationConfig:
"""All deduplication settings as a flat JSON-serializable structure."""
strategies: list[StrategyConfig] = field(default_factory=list)
survivor_rule: str = "first"
date_column: Optional[str] = None
merge: bool = False
subset_columns: Optional[list[str]] = None
fuzzy_columns: Optional[list[str]] = None
default_algorithm: str = "jaro_winkler"
default_threshold: float = 85.0
normalize_map: Optional[dict[str, str]] = None # column -> normalizer type
# -----------------------------------------------------------------------
# Serialisation
# -----------------------------------------------------------------------
def to_dict(self) -> dict:
return asdict(self)
def to_file(self, path: str | Path) -> Path:
"""Save configuration to a JSON file."""
out = Path(path)
out.write_text(json.dumps(self.to_dict(), indent=2))
return out
@classmethod
def from_dict(cls, data: dict) -> DeduplicationConfig:
strategies = []
for s in data.get("strategies", []):
cols = [ColumnStrategyConfig(**c) for c in s.get("columns", [])]
strategies.append(StrategyConfig(columns=cols))
return cls(
strategies=strategies,
survivor_rule=data.get("survivor_rule", "first"),
date_column=data.get("date_column"),
merge=data.get("merge", False),
subset_columns=data.get("subset_columns"),
fuzzy_columns=data.get("fuzzy_columns"),
default_algorithm=data.get("default_algorithm", "jaro_winkler"),
default_threshold=data.get("default_threshold", 85.0),
normalize_map=data.get("normalize_map"),
)
@classmethod
def from_file(cls, path: str | Path) -> DeduplicationConfig:
"""Load configuration from a JSON file."""
data = json.loads(Path(path).read_text())
return cls.from_dict(data)
@classmethod
def default(cls) -> DeduplicationConfig:
"""Return sensible defaults (auto-detect strategies at runtime)."""
return cls()
# -----------------------------------------------------------------------
# Convert to engine objects
# -----------------------------------------------------------------------
def to_strategies(self) -> Optional[list[MatchStrategy]]:
"""Convert the config back to MatchStrategy objects.
Returns None if no explicit strategies are configured
(the engine will auto-detect).
"""
if not self.strategies:
return None
result: list[MatchStrategy] = []
for sc in self.strategies:
col_strats = []
for cc in sc.columns:
col_strats.append(ColumnMatchStrategy(
column=cc.column,
algorithm=Algorithm(cc.algorithm),
threshold=cc.threshold,
normalizer=NormalizerType(cc.normalizer) if cc.normalizer else None,
))
result.append(MatchStrategy(column_strategies=col_strats))
return result
def to_survivor_rule(self) -> SurvivorRule:
return SurvivorRule(self.survivor_rule)

568
src/core/dedup.py Normal file
View File

@@ -0,0 +1,568 @@
"""Deduplication engine: matching, survivor selection, and merge.
Core algorithm:
1. Normalise columns → shadow ``_norm_*`` columns (computed once).
2. Pairwise comparison within each strategy → candidate pairs.
3. Union-find for transitive closure (A~B, B~C ⇒ one group).
4. Multi-strategy OR: feed all pairs from all strategies into the same union-find.
5. Survivor selection per group + optional field merge.
"""
from __future__ import annotations
import re
from dataclasses import dataclass, field
from enum import Enum
from typing import Callable, Optional
import pandas as pd
from loguru import logger
from rapidfuzz import fuzz as rf_fuzz
from rapidfuzz import distance as rf_distance
from .normalizers import NormalizerType, get_normalizer
# ---------------------------------------------------------------------------
# Enums & data structures
# ---------------------------------------------------------------------------
class Algorithm(str, Enum):
EXACT = "exact"
LEVENSHTEIN = "levenshtein"
JARO_WINKLER = "jaro_winkler"
TOKEN_SET_RATIO = "token_set_ratio"
class SurvivorRule(str, Enum):
KEEP_FIRST = "first"
KEEP_LAST = "last"
KEEP_MOST_COMPLETE = "most_complete"
KEEP_MOST_RECENT = "most_recent"
@dataclass
class ColumnMatchStrategy:
"""How to match on a single column."""
column: str
algorithm: Algorithm = Algorithm.EXACT
threshold: float = 100.0 # 0-100 scale
normalizer: Optional[NormalizerType] = None
@dataclass
class MatchStrategy:
"""A set of column strategies combined with AND.
Multiple ``MatchStrategy`` instances are combined with OR at the top level.
"""
column_strategies: list[ColumnMatchStrategy]
@dataclass
class MatchResult:
"""One group of duplicate rows."""
group_id: int
row_indices: list[int]
confidence: float # min confidence across pairs in the group
matched_on: list[str] # column names that contributed to the match
survivor_index: int # index of the row to keep
@dataclass
class DeduplicationResult:
"""Full result of a deduplication run."""
original_row_count: int
deduplicated_df: pd.DataFrame
removed_df: pd.DataFrame
match_groups: list[MatchResult]
log_entries: list[str] = field(default_factory=list)
is_preview: bool = True
# ---------------------------------------------------------------------------
# Union-Find
# ---------------------------------------------------------------------------
class _UnionFind:
"""Disjoint-set / union-find for transitive closure of match pairs."""
def __init__(self, n: int):
self._parent = list(range(n))
self._rank = [0] * n
def find(self, x: int) -> int:
while self._parent[x] != x:
self._parent[x] = self._parent[self._parent[x]] # path halving
x = self._parent[x]
return x
def union(self, a: int, b: int) -> None:
ra, rb = self.find(a), self.find(b)
if ra == rb:
return
if self._rank[ra] < self._rank[rb]:
ra, rb = rb, ra
self._parent[rb] = ra
if self._rank[ra] == self._rank[rb]:
self._rank[ra] += 1
def groups(self) -> dict[int, list[int]]:
"""Return {root: [members]} for all non-singleton groups."""
from collections import defaultdict
g: dict[int, list[int]] = defaultdict(list)
for i in range(len(self._parent)):
g[self.find(i)].append(i)
return {root: members for root, members in g.items() if len(members) > 1}
# ---------------------------------------------------------------------------
# Similarity computation
# ---------------------------------------------------------------------------
def _compute_similarity(val_a: str, val_b: str, algorithm: Algorithm) -> float:
"""Return similarity score on a 0-100 scale."""
if algorithm == Algorithm.EXACT:
return 100.0 if val_a == val_b else 0.0
if algorithm == Algorithm.LEVENSHTEIN:
return rf_fuzz.ratio(val_a, val_b)
if algorithm == Algorithm.JARO_WINKLER:
# rapidfuzz jaro_winkler_similarity returns 0-100
return rf_distance.JaroWinkler.similarity(val_a, val_b) * 100
if algorithm == Algorithm.TOKEN_SET_RATIO:
return rf_fuzz.token_set_ratio(val_a, val_b)
raise ValueError(f"Unknown algorithm: {algorithm}")
# ---------------------------------------------------------------------------
# Pair comparison
# ---------------------------------------------------------------------------
def _compare_pair(
row_a: pd.Series,
row_b: pd.Series,
strategy: MatchStrategy,
norm_prefix: str = "_norm_",
) -> tuple[bool, float, list[str]]:
"""Compare two rows using a single MatchStrategy (AND of column strategies).
Returns ``(is_match, confidence, matched_columns)``.
"""
min_score = 100.0
matched_cols: list[str] = []
for cs in strategy.column_strategies:
col = f"{norm_prefix}{cs.column}" if cs.normalizer else cs.column
va = str(row_a.get(col, ""))
vb = str(row_b.get(col, ""))
# Skip if both empty
if not va and not vb:
continue
# If one empty and one not — no match for this column
if not va or not vb:
return False, 0.0, []
score = _compute_similarity(va, vb, cs.algorithm)
if score < cs.threshold:
return False, 0.0, []
min_score = min(min_score, score)
matched_cols.append(cs.column)
if not matched_cols:
return False, 0.0, []
return True, min_score, matched_cols
# ---------------------------------------------------------------------------
# Match-group finding
# ---------------------------------------------------------------------------
def _find_match_groups(
df: pd.DataFrame,
strategies: list[MatchStrategy],
*,
progress_callback: Optional[Callable[[int, int], None]] = None,
) -> tuple[list[MatchResult], dict[tuple[int, int], tuple[float, list[str]]]]:
"""Pairwise comparison + union-find for transitive closure.
Returns ``(match_groups, pair_info)`` where *pair_info* maps
``(i, j)`` → ``(confidence, matched_columns)`` for logging.
"""
n = len(df)
uf = _UnionFind(n)
pair_info: dict[tuple[int, int], tuple[float, list[str]]] = {}
total_pairs = n * (n - 1) // 2
checked = 0
for i in range(n):
for j in range(i + 1, n):
for strategy in strategies:
is_match, confidence, cols = _compare_pair(
df.iloc[i], df.iloc[j], strategy
)
if is_match:
uf.union(i, j)
key = (i, j)
# Keep the highest-confidence match for this pair
if key not in pair_info or confidence > pair_info[key][0]:
pair_info[key] = (confidence, cols)
break # OR logic: one strategy match is enough
checked += 1
if progress_callback and checked % 1000 == 0:
progress_callback(checked, total_pairs)
if progress_callback:
progress_callback(total_pairs, total_pairs)
# Build MatchResult objects (survivor not yet selected)
raw_groups = uf.groups()
match_groups: list[MatchResult] = []
for gid, (root, members) in enumerate(sorted(raw_groups.items())):
# Confidence = min across all pairs in the group
group_confidence = 100.0
group_cols: set[str] = set()
for idx_a, m in enumerate(members):
for idx_b in range(idx_a + 1, len(members)):
key = (min(m, members[idx_b]), max(m, members[idx_b]))
if key in pair_info:
conf, cols = pair_info[key]
group_confidence = min(group_confidence, conf)
group_cols.update(cols)
match_groups.append(MatchResult(
group_id=gid,
row_indices=members,
confidence=round(group_confidence, 2),
matched_on=sorted(group_cols),
survivor_index=members[0], # placeholder
))
return match_groups, pair_info
# ---------------------------------------------------------------------------
# Survivor selection
# ---------------------------------------------------------------------------
def _select_survivor(
group: MatchResult,
df: pd.DataFrame,
rule: SurvivorRule,
date_column: Optional[str] = None,
) -> int:
"""Choose the survivor row index within a match group."""
indices = group.row_indices
if rule == SurvivorRule.KEEP_FIRST:
return indices[0]
if rule == SurvivorRule.KEEP_LAST:
return indices[-1]
if rule == SurvivorRule.KEEP_MOST_COMPLETE:
# Fewest empty/blank cells wins
best_idx = indices[0]
best_empty = _count_empty(df.iloc[indices[0]])
for idx in indices[1:]:
empty = _count_empty(df.iloc[idx])
if empty < best_empty:
best_empty = empty
best_idx = idx
return best_idx
if rule == SurvivorRule.KEEP_MOST_RECENT:
if not date_column or date_column not in df.columns:
logger.warning("date_column '{}' not found; falling back to keep_first", date_column)
return indices[0]
best_idx = indices[0]
best_date = _parse_date(df.iloc[indices[0]].get(date_column, ""))
for idx in indices[1:]:
d = _parse_date(df.iloc[idx].get(date_column, ""))
if d is not None and (best_date is None or d > best_date):
best_date = d
best_idx = idx
return best_idx
return indices[0]
def _count_empty(row: pd.Series) -> int:
"""Count empty/blank cells in a row, ignoring internal shadow columns."""
count = 0
for col, val in row.items():
if isinstance(col, str) and col.startswith("_norm_"):
continue
if pd.isna(val) or str(val).strip() == "":
count += 1
return count
def _parse_date(value) -> Optional[pd.Timestamp]:
try:
return pd.to_datetime(value)
except Exception:
return None
# ---------------------------------------------------------------------------
# Merge mode
# ---------------------------------------------------------------------------
def _merge_group(df: pd.DataFrame, survivor_idx: int, loser_indices: list[int]) -> pd.Series:
"""Fill missing fields in survivor from losers (ordered by position)."""
survivor = df.iloc[survivor_idx].copy()
for col in survivor.index:
if isinstance(col, str) and col.startswith("_norm_"):
continue
val = survivor[col]
if pd.isna(val) or str(val).strip() == "":
for loser_idx in loser_indices:
candidate = df.iloc[loser_idx][col]
if not pd.isna(candidate) and str(candidate).strip() != "":
survivor[col] = candidate
break
return survivor
# ---------------------------------------------------------------------------
# Auto-detect strategies
# ---------------------------------------------------------------------------
# (pattern, normalizer, algorithm, threshold, is_strong_key)
# Strong keys (email, phone) can be standalone strategies.
# Weak keys (name, address) must be combined with a strong key via AND.
_COLUMN_TYPE_PATTERNS: list[tuple[re.Pattern, NormalizerType, Algorithm, float, bool]] = [
(re.compile(r"e[-_]?mail", re.I), NormalizerType.EMAIL, Algorithm.EXACT, 100.0, True),
(re.compile(r"phone|telephone|mobile|cell", re.I), NormalizerType.PHONE, Algorithm.EXACT, 100.0, True),
(re.compile(r"^(name|full_name|customer_name|first_name|last_name|contact_name|respondent_name)$", re.I),
NormalizerType.NAME, Algorithm.JARO_WINKLER, 85.0, False),
(re.compile(r"address|street|addr", re.I), NormalizerType.ADDRESS, Algorithm.TOKEN_SET_RATIO, 80.0, False),
]
def build_default_strategies(df: pd.DataFrame) -> list[MatchStrategy]:
"""Auto-detect column types and build match strategies.
Strategy logic:
- Strong keys (email, phone): each gets its own standalone OR strategy.
- Weak keys (name, address): combined with each strong key via AND to
form additional strategies. Weak keys never stand alone (too many
false positives — "John""Jon" at 93 % Jaro-Winkler).
- If only weak keys are found (no strong keys), they're promoted to
standalone strategies as a fallback.
- If no columns match, exact match on all columns (drop_duplicates
equivalent).
"""
strong_cols: list[ColumnMatchStrategy] = []
weak_cols: list[ColumnMatchStrategy] = []
for col in df.columns:
if col.startswith("_norm_"):
continue
for pattern, norm_type, algo, threshold, is_strong in _COLUMN_TYPE_PATTERNS:
if pattern.search(col):
cs = ColumnMatchStrategy(
column=col, algorithm=algo,
threshold=threshold, normalizer=norm_type,
)
if is_strong:
strong_cols.append(cs)
else:
weak_cols.append(cs)
break
strategies: list[MatchStrategy] = []
if strong_cols:
# Each strong key is a standalone strategy (OR)
for sc in strong_cols:
strategies.append(MatchStrategy(column_strategies=[sc]))
# Each weak key is paired with each strong key (AND) for extra recall
for wc in weak_cols:
for sc in strong_cols:
strategies.append(MatchStrategy(column_strategies=[wc, sc]))
elif weak_cols:
# No strong keys — promote weak to standalone (best effort)
for wc in weak_cols:
strategies.append(MatchStrategy(column_strategies=[wc]))
if strategies:
return strategies
# Fallback: exact match on all columns (equivalent to drop_duplicates)
logger.info("No column patterns matched; using exact match on all columns")
all_cols = [
ColumnMatchStrategy(column=c, algorithm=Algorithm.EXACT, threshold=100.0)
for c in df.columns
]
return [MatchStrategy(column_strategies=all_cols)]
# ---------------------------------------------------------------------------
# Normalisation pass
# ---------------------------------------------------------------------------
def _apply_normalizations(df: pd.DataFrame, strategies: list[MatchStrategy]) -> pd.DataFrame:
"""Add ``_norm_*`` shadow columns for every column that has a normalizer."""
df = df.copy()
seen: set[str] = set()
for strategy in strategies:
for cs in strategy.column_strategies:
if cs.normalizer and cs.column not in seen and cs.column in df.columns:
seen.add(cs.column)
norm_fn = get_normalizer(cs.normalizer)
norm_col = f"_norm_{cs.column}"
df[norm_col] = df[cs.column].apply(
lambda v, fn=norm_fn: fn(str(v)) if pd.notna(v) and str(v).strip() else ""
)
return df
# ---------------------------------------------------------------------------
# Main entry point
# ---------------------------------------------------------------------------
def deduplicate(
df: pd.DataFrame,
*,
strategies: Optional[list[MatchStrategy]] = None,
survivor_rule: SurvivorRule = SurvivorRule.KEEP_FIRST,
date_column: Optional[str] = None,
merge: bool = False,
preview: bool = True,
review_callback: Optional[Callable] = None,
progress_callback: Optional[Callable[[int, int], None]] = None,
) -> DeduplicationResult:
"""Run the full deduplication pipeline.
Parameters
----------
df : input DataFrame
strategies : matching strategies (auto-detected if None)
survivor_rule : which row to keep per group
date_column : used with ``KEEP_MOST_RECENT``
merge : fill missing fields in survivor from losers
preview : if True, result is informational only (no writes)
review_callback : ``(group: MatchResult, df: DataFrame) -> bool|None``
Called for each match group. Return True to accept, False to reject,
None to skip (keep both rows). Used for interactive review.
progress_callback : ``(current: int, total: int) -> None``
Called periodically during pairwise comparison.
Returns a ``DeduplicationResult``.
"""
log_entries: list[str] = []
original_count = len(df)
if strategies is None:
strategies = build_default_strategies(df)
log_entries.append(f"Auto-detected {len(strategies)} match strategies")
# Log strategies
for i, s in enumerate(strategies):
cols_desc = ", ".join(
f"{cs.column}({cs.algorithm.value}@{cs.threshold})"
for cs in s.column_strategies
)
log_entries.append(f"Strategy {i}: {cols_desc}")
logger.info("Strategy {}: {}", i, cols_desc)
# Normalise
df_work = _apply_normalizations(df, strategies)
# Find matches
match_groups, pair_info = _find_match_groups(
df_work, strategies, progress_callback=progress_callback
)
log_entries.append(f"Found {len(match_groups)} duplicate groups")
logger.info("Found {} duplicate groups from {} rows", len(match_groups), original_count)
# Interactive review
if review_callback and match_groups:
reviewed_groups: list[MatchResult] = []
for group in match_groups:
decision = review_callback(group, df_work)
if decision is True:
reviewed_groups.append(group)
log_entries.append(f"Group {group.group_id}: accepted by reviewer")
elif decision is False:
log_entries.append(f"Group {group.group_id}: rejected by reviewer")
else:
log_entries.append(f"Group {group.group_id}: skipped by reviewer")
match_groups = reviewed_groups
# Survivor selection
for group in match_groups:
group.survivor_index = _select_survivor(group, df_work, survivor_rule, date_column)
log_entries.append(
f"Group {group.group_id}: survivor=row {group.survivor_index} "
f"(rule={survivor_rule.value}, confidence={group.confidence}%)"
)
# Build result dataframes
remove_indices: set[int] = set()
merged_rows: dict[int, pd.Series] = {}
for group in match_groups:
survivor_idx = group.survivor_index
losers = [i for i in group.row_indices if i != survivor_idx]
remove_indices.update(losers)
if merge and losers:
merged = _merge_group(df_work, survivor_idx, losers)
merged_rows[survivor_idx] = merged
# Log merged fields
original = df_work.iloc[survivor_idx]
for col in original.index:
if isinstance(col, str) and col.startswith("_norm_"):
continue
orig_val = str(original[col]).strip()
new_val = str(merged[col]).strip()
if orig_val != new_val and not orig_val:
log_entries.append(
f"Group {group.group_id}: merged '{col}' "
f"into survivor from losers: '{new_val}'"
)
# Build output DataFrames
keep_indices = [i for i in range(len(df_work)) if i not in remove_indices]
if merged_rows:
rows = []
for i in keep_indices:
if i in merged_rows:
rows.append(merged_rows[i])
else:
rows.append(df_work.iloc[i])
deduplicated_df = pd.DataFrame(rows)
else:
deduplicated_df = df_work.iloc[keep_indices].copy()
removed_df = df_work.iloc[sorted(remove_indices)].copy() if remove_indices else pd.DataFrame()
# Drop shadow columns from output
norm_cols = [c for c in deduplicated_df.columns if str(c).startswith("_norm_")]
deduplicated_df = deduplicated_df.drop(columns=norm_cols, errors="ignore")
if not removed_df.empty:
removed_df = removed_df.drop(columns=norm_cols, errors="ignore")
# Reset index
deduplicated_df = deduplicated_df.reset_index(drop=True)
if not removed_df.empty:
removed_df = removed_df.reset_index(drop=True)
removed_count = original_count - len(deduplicated_df)
log_entries.append(f"Result: {original_count}{len(deduplicated_df)} rows ({removed_count} removed)")
return DeduplicationResult(
original_row_count=original_count,
deduplicated_df=deduplicated_df,
removed_df=removed_df,
match_groups=match_groups,
log_entries=log_entries,
is_preview=preview,
)

247
src/core/io.py Normal file
View File

@@ -0,0 +1,247 @@
"""File I/O: encoding/delimiter detection, CSV/Excel reading, output writing."""
from __future__ import annotations
import csv
import io
from pathlib import Path
from typing import Generator, Optional
import pandas as pd
from charset_normalizer import from_bytes
from loguru import logger
# ---------------------------------------------------------------------------
# Encoding detection
# ---------------------------------------------------------------------------
def detect_encoding(path: Path, sample_bytes: int = 65_536) -> str:
"""Detect file encoding by reading the first *sample_bytes*.
Returns the best-guess encoding name (e.g. ``utf-8``, ``windows-1252``).
Falls back to ``utf-8`` when detection is inconclusive.
"""
raw = Path(path).read_bytes()[:sample_bytes]
if not raw:
return "utf-8"
# Check BOM first
if raw[:3] == b"\xef\xbb\xbf":
return "utf-8-sig"
if raw[:2] in (b"\xff\xfe", b"\xfe\xff"):
return "utf-16"
result = from_bytes(raw).best()
if result is None:
return "utf-8"
enc = result.encoding.lower()
# Normalise common aliases
if enc in ("ascii", "us-ascii"):
enc = "utf-8"
return enc
# ---------------------------------------------------------------------------
# Delimiter detection
# ---------------------------------------------------------------------------
_COMMON_DELIMITERS = [",", "\t", ";", "|"]
def detect_delimiter(path: Path, encoding: str = "utf-8") -> str:
"""Sniff the delimiter from the first 20 lines of a text file.
Falls back to comma if csv.Sniffer cannot decide.
"""
raw_path = Path(path)
lines: list[str] = []
with raw_path.open("r", encoding=encoding, errors="replace") as fh:
for _ in range(20):
line = fh.readline()
if not line:
break
lines.append(line)
if not lines:
return ","
sample = "".join(lines)
try:
dialect = csv.Sniffer().sniff(sample, delimiters="".join(_COMMON_DELIMITERS))
return dialect.delimiter
except csv.Error:
return ","
# ---------------------------------------------------------------------------
# Header-row detection
# ---------------------------------------------------------------------------
def detect_header_row(path: Path, encoding: str = "utf-8", delimiter: str = ",",
max_scan: int = 20) -> int:
"""Return the 0-based index of the likely header row.
Heuristic: the first row where *every* cell looks like a column name
(non-numeric, non-empty string). Falls back to 0.
"""
raw_path = Path(path)
with raw_path.open("r", encoding=encoding, errors="replace") as fh:
reader = csv.reader(fh, delimiter=delimiter)
for idx, row in enumerate(reader):
if idx >= max_scan:
break
if not row:
continue
# All cells must be non-empty, non-numeric strings
if all(_looks_like_header(cell) for cell in row if cell.strip()):
return idx
return 0
def _looks_like_header(value: str) -> bool:
"""True if *value* looks like a column header, not a data value."""
v = value.strip()
if not v:
return False
# Pure numbers are not headers
try:
float(v.replace(",", ""))
return False
except ValueError:
pass
return True
# ---------------------------------------------------------------------------
# Excel helpers
# ---------------------------------------------------------------------------
def list_sheets(path: Path) -> list[str]:
"""Return sheet names from an Excel workbook."""
xl = pd.ExcelFile(path, engine="openpyxl")
return xl.sheet_names
# ---------------------------------------------------------------------------
# Reading
# ---------------------------------------------------------------------------
def read_file(
path: str | Path,
*,
encoding: Optional[str] = None,
delimiter: Optional[str] = None,
header_row: Optional[int] = None,
sheet_name: Optional[str | int] = 0,
chunk_size: Optional[int] = None,
) -> pd.DataFrame | Generator[pd.DataFrame, None, None]:
"""Read a CSV, TSV, or Excel file into a DataFrame.
Parameters
----------
path : file path
encoding : override detected encoding (CSV only)
delimiter : override detected delimiter (CSV only)
header_row : 0-based row index for the header; auto-detected if *None*
sheet_name : Excel sheet (name or 0-based index). Ignored for CSV.
chunk_size : if set, return a generator of DataFrames (CSV only).
Returns a DataFrame (or generator when *chunk_size* is set).
"""
filepath = Path(path)
if not filepath.exists():
raise FileNotFoundError(f"File not found: {filepath}")
suffix = filepath.suffix.lower()
if suffix in (".xlsx", ".xls"):
return _read_excel(filepath, header_row=header_row, sheet_name=sheet_name)
else:
return _read_csv(
filepath,
encoding=encoding,
delimiter=delimiter,
header_row=header_row,
chunk_size=chunk_size,
)
def _read_csv(
path: Path,
*,
encoding: Optional[str] = None,
delimiter: Optional[str] = None,
header_row: Optional[int] = None,
chunk_size: Optional[int] = None,
) -> pd.DataFrame | Generator[pd.DataFrame, None, None]:
enc = encoding or detect_encoding(path)
delim = delimiter or detect_delimiter(path, enc)
hdr = header_row if header_row is not None else detect_header_row(path, enc, delim)
logger.debug("Reading CSV {} (encoding={}, delimiter={!r}, header_row={})",
path.name, enc, delim, hdr)
kwargs: dict = dict(
filepath_or_buffer=path,
encoding=enc,
delimiter=delim,
header=hdr,
dtype=str,
keep_default_na=False,
on_bad_lines="warn",
)
if chunk_size:
return pd.read_csv(**kwargs, chunksize=chunk_size)
return pd.read_csv(**kwargs)
def _read_excel(
path: Path,
*,
header_row: Optional[int] = None,
sheet_name: Optional[str | int] = 0,
) -> pd.DataFrame:
hdr = header_row if header_row is not None else 0
logger.debug("Reading Excel {} (sheet={}, header_row={})", path.name, sheet_name, hdr)
return pd.read_excel(
path,
sheet_name=sheet_name,
header=hdr,
dtype=str,
keep_default_na=False,
engine="openpyxl",
)
# ---------------------------------------------------------------------------
# Writing
# ---------------------------------------------------------------------------
def write_file(
df: pd.DataFrame,
path: str | Path,
*,
file_format: Optional[str] = None,
encoding: str = "utf-8-sig",
) -> Path:
"""Write a DataFrame to CSV or Excel.
Parameters
----------
df : DataFrame to write
path : output file path
file_format : ``"csv"`` or ``"xlsx"``; auto-detected from *path* suffix if *None*
encoding : output encoding (default ``utf-8-sig`` for Windows Excel compat)
Returns the resolved output Path.
"""
out = Path(path)
fmt = file_format or out.suffix.lstrip(".").lower()
if fmt in ("xlsx", "xls"):
df.to_excel(out, index=False, engine="openpyxl")
else:
df.to_csv(out, index=False, encoding=encoding)
logger.info("Wrote {} rows to {}", len(df), out)
return out

224
src/core/normalizers.py Normal file
View File

@@ -0,0 +1,224 @@
"""Per-column normalization functions for deduplication matching.
Every normalizer is ``str -> str``, handles None/empty gracefully, and is
idempotent (applying it twice yields the same result as once).
"""
from __future__ import annotations
import re
from enum import Enum
from typing import Callable, Optional
import phonenumbers
# ---------------------------------------------------------------------------
# Types
# ---------------------------------------------------------------------------
class NormalizerType(str, Enum):
EMAIL = "email"
PHONE = "phone"
NAME = "name"
ADDRESS = "address"
STRING = "string"
# ---------------------------------------------------------------------------
# String normalizer (base)
# ---------------------------------------------------------------------------
def normalize_string(value: Optional[str]) -> str:
"""Trim, collapse internal whitespace, case-fold."""
if not value or not isinstance(value, str):
return ""
return re.sub(r"\s+", " ", value.strip()).casefold()
# ---------------------------------------------------------------------------
# Email normalizer
# ---------------------------------------------------------------------------
_GMAIL_DOMAINS = {"gmail.com", "googlemail.com"}
def normalize_email(value: Optional[str]) -> str:
"""Lowercase, strip whitespace, strip Gmail dots, strip +tag suffixes."""
if not value or not isinstance(value, str):
return ""
email = value.strip().lower()
if "@" not in email:
return email
local, domain = email.rsplit("@", 1)
# Strip +tag suffix
if "+" in local:
local = local.split("+", 1)[0]
# Strip dots for Gmail addresses
if domain in _GMAIL_DOMAINS:
local = local.replace(".", "")
return f"{local}@{domain}"
# ---------------------------------------------------------------------------
# Phone normalizer
# ---------------------------------------------------------------------------
def normalize_phone(value: Optional[str], default_region: str = "US") -> str:
"""Parse with phonenumbers lib, return E.164. Fallback: digits-only."""
if not value or not isinstance(value, str):
return ""
stripped = value.strip()
if not stripped:
return ""
try:
parsed = phonenumbers.parse(stripped, default_region)
if phonenumbers.is_possible_number(parsed):
return phonenumbers.format_number(parsed, phonenumbers.PhoneNumberFormat.E164)
except phonenumbers.NumberParseException:
pass
# Fallback: digits only
digits = re.sub(r"\D", "", stripped)
return digits
# ---------------------------------------------------------------------------
# Name normalizer
# ---------------------------------------------------------------------------
_TITLE_PREFIXES = {
"mr", "mrs", "ms", "miss", "dr", "prof", "professor",
"sir", "madam", "rev", "reverend", "hon", "honorable",
}
_NAME_SUFFIXES = {
"jr", "sr", "ii", "iii", "iv", "v",
"phd", "md", "esq", "dds", "rn",
}
def normalize_name(value: Optional[str]) -> str:
"""Strip titles/suffixes, collapse whitespace, case-fold."""
if not value or not isinstance(value, str):
return ""
name = value.strip()
if not name:
return ""
# Case-fold first for matching
name = name.casefold()
# Remove periods and commas that are part of titles/suffixes
name = name.replace(".", " ").replace(",", " ")
parts = name.split()
# Strip leading titles
while parts and parts[0].rstrip(".") in _TITLE_PREFIXES:
parts.pop(0)
# Strip trailing suffixes
while parts and parts[-1].rstrip(".") in _NAME_SUFFIXES:
parts.pop()
return " ".join(parts)
# ---------------------------------------------------------------------------
# Address normalizer
# ---------------------------------------------------------------------------
_USPS_ABBREVIATIONS: dict[str, str] = {
"street": "st",
"avenue": "ave",
"boulevard": "blvd",
"drive": "dr",
"lane": "ln",
"road": "rd",
"court": "ct",
"place": "pl",
"circle": "cir",
"trail": "trl",
"way": "way",
"terrace": "ter",
"parkway": "pkwy",
"highway": "hwy",
"expressway": "expy",
"freeway": "fwy",
"square": "sq",
"loop": "loop",
"alley": "aly",
"crossing": "xing",
"point": "pt",
"north": "n",
"south": "s",
"east": "e",
"west": "w",
"northeast": "ne",
"northwest": "nw",
"southeast": "se",
"southwest": "sw",
"apartment": "apt",
"suite": "ste",
"building": "bldg",
"floor": "fl",
"room": "rm",
"unit": "unit",
"number": "#",
"saint": "st",
"fort": "ft",
"mount": "mt",
"heights": "hts",
"springs": "spgs",
}
def normalize_address(value: Optional[str]) -> str:
"""USPS abbreviation normalization, collapse whitespace, case-fold."""
if not value or not isinstance(value, str):
return ""
addr = value.strip()
if not addr:
return ""
# Case-fold and clean punctuation (keep #)
addr = addr.casefold()
addr = addr.replace(".", " ").replace(",", " ")
parts = addr.split()
normalized_parts = []
for part in parts:
normalized_parts.append(_USPS_ABBREVIATIONS.get(part, part))
return " ".join(normalized_parts)
# ---------------------------------------------------------------------------
# Registry
# ---------------------------------------------------------------------------
_NORMALIZER_MAP: dict[NormalizerType, Callable[[str], str]] = {
NormalizerType.EMAIL: normalize_email,
NormalizerType.PHONE: normalize_phone,
NormalizerType.NAME: normalize_name,
NormalizerType.ADDRESS: normalize_address,
NormalizerType.STRING: normalize_string,
}
def get_normalizer(normalizer_type: NormalizerType | str) -> Callable[[str], str]:
"""Return the normalizer function for the given type.
Accepts both ``NormalizerType`` enum values and plain strings.
"""
if isinstance(normalizer_type, str):
normalizer_type = NormalizerType(normalizer_type.lower())
func = _NORMALIZER_MAP.get(normalizer_type)
if func is None:
raise ValueError(f"Unknown normalizer type: {normalizer_type}")
return func

1
src/gui/__init__.py Normal file
View File

@@ -0,0 +1 @@
"""Streamlit GUI for the DataTools Deduplicator."""

8
src/gui/__main__.py Normal file
View File

@@ -0,0 +1,8 @@
"""Allow running as ``python -m src.gui``."""
import subprocess
import sys
from pathlib import Path
app_path = Path(__file__).parent / "app.py"
subprocess.run([sys.executable, "-m", "streamlit", "run", str(app_path)])

287
src/gui/app.py Normal file
View File

@@ -0,0 +1,287 @@
"""DataTools Deduplicator — Streamlit GUI.
Launch:
streamlit run src/gui/app.py
"""
from __future__ import annotations
import io
import sys
from pathlib import Path
import pandas as pd
import streamlit as st
# Ensure project root is on sys.path so `src.core` imports work
_project_root = Path(__file__).resolve().parent.parent.parent
if str(_project_root) not in sys.path:
sys.path.insert(0, str(_project_root))
from src.core.dedup import deduplicate, build_default_strategies, DeduplicationResult
from src.core.io import read_file, list_sheets
from src.core.config import DeduplicationConfig
from src.gui.components import config_panel, match_group_card, results_summary
# ---------------------------------------------------------------------------
# Page config
# ---------------------------------------------------------------------------
st.set_page_config(
page_title="DataTools Deduplicator",
page_icon="🔍",
layout="wide",
)
# ---------------------------------------------------------------------------
# Session state defaults
# ---------------------------------------------------------------------------
_DEFAULTS = {
"df": None,
"result": None,
"review_decisions": {},
"config": None,
"file_name": "",
"sheet_names": [],
}
for key, default in _DEFAULTS.items():
if key not in st.session_state:
st.session_state[key] = default
# ---------------------------------------------------------------------------
# Header
# ---------------------------------------------------------------------------
st.title("DataTools Deduplicator")
st.caption("Find and remove duplicate rows in CSV and Excel files.")
# ---------------------------------------------------------------------------
# File upload
# ---------------------------------------------------------------------------
uploaded = st.file_uploader(
"Upload CSV or Excel file",
type=["csv", "tsv", "xlsx", "xls"],
help="Supports CSV, TSV, and Excel files. Encoding and delimiters are auto-detected.",
)
if uploaded is not None:
# Detect if file changed
if uploaded.name != st.session_state["file_name"]:
st.session_state["file_name"] = uploaded.name
st.session_state["result"] = None
st.session_state["review_decisions"] = {}
# Read the file
try:
# Write to a temp file for read_file() which needs a path
import tempfile
suffix = Path(uploaded.name).suffix
with tempfile.NamedTemporaryFile(delete=False, suffix=suffix) as tmp:
tmp.write(uploaded.getvalue())
tmp_path = Path(tmp.name)
# Check for Excel sheets
if suffix.lower() in (".xlsx", ".xls"):
st.session_state["sheet_names"] = list_sheets(tmp_path)
else:
st.session_state["sheet_names"] = []
df = read_file(tmp_path)
if not isinstance(df, pd.DataFrame):
df = pd.concat(list(df), ignore_index=True)
st.session_state["df"] = df
# Clean up temp file
tmp_path.unlink(missing_ok=True)
except Exception as e:
st.error(f"Failed to read file: {e}")
st.session_state["df"] = None
df = st.session_state["df"]
if df is not None:
# Sheet selector for Excel files
if st.session_state["sheet_names"] and len(st.session_state["sheet_names"]) > 1:
sheet = st.selectbox(
"Select sheet",
st.session_state["sheet_names"],
)
if sheet != st.session_state.get("_current_sheet"):
st.session_state["_current_sheet"] = sheet
suffix = Path(uploaded.name).suffix
import tempfile
with tempfile.NamedTemporaryFile(delete=False, suffix=suffix) as tmp:
tmp.write(uploaded.getvalue())
tmp_path = Path(tmp.name)
df = read_file(tmp_path, sheet_name=sheet)
if not isinstance(df, pd.DataFrame):
df = pd.concat(list(df), ignore_index=True)
st.session_state["df"] = df
st.session_state["result"] = None
st.session_state["review_decisions"] = {}
tmp_path.unlink(missing_ok=True)
# Preview
st.subheader(f"Preview: {uploaded.name}")
st.caption(f"{len(df)} rows, {len(df.columns)} columns")
st.dataframe(df.head(10), use_container_width=True)
# Advanced options
settings = config_panel(df)
# Apply loaded config if present
loaded_cfg = st.session_state.get("loaded_config")
if loaded_cfg is not None:
settings["strategies"] = loaded_cfg.to_strategies()
settings["survivor_rule"] = loaded_cfg.to_survivor_rule()
settings["date_column"] = loaded_cfg.date_column
settings["merge"] = loaded_cfg.merge
# Clear so it doesn't override on every rerun
del st.session_state["loaded_config"]
# ---------------------------------------------------------------------------
# Find Duplicates button
# ---------------------------------------------------------------------------
st.divider()
if st.button("Find Duplicates", type="primary", use_container_width=True):
progress_bar = st.progress(0, text="Comparing rows...")
def _gui_progress(current: int, total: int) -> None:
if total > 0:
pct = min(current / total, 1.0)
progress_bar.progress(pct, text=f"Comparing rows... {current:,}/{total:,}")
with st.spinner("Running deduplication..."):
result = deduplicate(
df,
strategies=settings["strategies"],
survivor_rule=settings["survivor_rule"],
date_column=settings["date_column"],
merge=settings["merge"],
preview=False,
progress_callback=_gui_progress,
)
progress_bar.empty()
st.session_state["result"] = result
st.session_state["review_decisions"] = {}
# ---------------------------------------------------------------------------
# Results
# ---------------------------------------------------------------------------
result: DeduplicationResult | None = st.session_state["result"]
if result is not None:
st.divider()
st.subheader("Results")
# Summary + download buttons
results_summary(result, df)
# Match group review
if result.match_groups:
st.divider()
st.subheader("Match Groups")
# Batch actions
action_left, action_mid, action_right = st.columns(3)
with action_left:
if st.button("Accept All"):
for g in result.match_groups:
st.session_state["review_decisions"][g.group_id] = True
st.rerun()
with action_mid:
if st.button("Reject All"):
for g in result.match_groups:
st.session_state["review_decisions"][g.group_id] = False
st.rerun()
with action_right:
if st.button("Clear Decisions"):
st.session_state["review_decisions"] = {}
st.rerun()
# Individual group cards
decisions = st.session_state["review_decisions"]
for i, group in enumerate(result.match_groups):
decision = match_group_card(group, df, group_num=i + 1)
if decision is not None:
decisions[group.group_id] = decision
st.session_state["review_decisions"] = decisions
st.rerun()
# Show decision summary
if decisions:
st.divider()
accepted = sum(1 for v in decisions.values() if v is True)
rejected = sum(1 for v in decisions.values() if v is False)
pending = len(result.match_groups) - len(decisions)
st.caption(
f"Decisions: {accepted} merged, {rejected} kept both, "
f"{pending} pending"
)
# Re-run dedup with review decisions applied
if st.button(
"Apply Review Decisions & Download",
type="primary",
use_container_width=True,
):
def _review_callback(group, _df):
gid = group.group_id
if gid in decisions:
return decisions[gid]
return True # default: accept
reviewed_result = deduplicate(
df,
strategies=settings["strategies"],
survivor_rule=settings["survivor_rule"],
date_column=settings["date_column"],
merge=settings["merge"],
preview=False,
review_callback=_review_callback,
)
# Update result and show downloads
st.session_state["result"] = reviewed_result
csv_bytes = reviewed_result.deduplicated_df.to_csv(
index=False
).encode("utf-8-sig")
st.download_button(
"Download Reviewed & Deduplicated CSV",
data=csv_bytes,
file_name="deduplicated_reviewed.csv",
mime="text/csv",
key="reviewed_download",
)
# Log entries
if result.log_entries:
with st.expander("Processing Log"):
st.code("\n".join(result.log_entries))
else:
# No file uploaded — show placeholder
st.info("Upload a CSV or Excel file to get started.")
# ---------------------------------------------------------------------------
# Footer
# ---------------------------------------------------------------------------
st.divider()
st.caption(
"Runs locally. Your data never leaves this computer. "
"| DataTools Deduplicator v1.0"
)

413
src/gui/components.py Normal file
View File

@@ -0,0 +1,413 @@
"""Reusable Streamlit widgets for the deduplicator GUI."""
from __future__ import annotations
import io
from typing import Optional
import pandas as pd
import streamlit as st
from src.core.dedup import (
Algorithm,
ColumnMatchStrategy,
DeduplicationResult,
MatchResult,
MatchStrategy,
SurvivorRule,
)
from src.core.config import (
ColumnStrategyConfig,
DeduplicationConfig,
StrategyConfig,
)
from src.core.normalizers import NormalizerType
# ---------------------------------------------------------------------------
# Config panel (advanced options)
# ---------------------------------------------------------------------------
def config_panel(df: pd.DataFrame) -> dict:
"""Render the Advanced Options expander. Returns a settings dict.
Keys returned:
strategies: list[MatchStrategy] | None
survivor_rule: SurvivorRule
date_column: str | None
merge: bool
"""
columns = list(df.columns)
with st.expander("Advanced Options"):
col_left, col_right = st.columns(2)
with col_left:
subset_cols = st.multiselect(
"Match on columns",
columns,
default=[],
help="Leave empty to auto-detect based on column names.",
)
key_cols = st.multiselect(
"Strong keys",
columns,
default=[],
help="Columns that uniquely identify records (e.g., EIN, SKU). Each is an independent exact-match strategy.",
)
fuzzy_cols = st.multiselect(
"Fuzzy columns",
columns,
default=[],
help="Columns to fuzzy-match. Others use exact matching.",
)
with col_right:
algorithm = st.selectbox(
"Fuzzy algorithm",
["jaro_winkler", "levenshtein", "token_set_ratio"],
index=0,
help="jaro_winkler: best for names. levenshtein: best for typos. token_set_ratio: best for addresses.",
)
threshold = st.slider(
"Similarity threshold",
min_value=50,
max_value=100,
value=85,
help="Lower = more matches but more false positives.",
)
survivor = st.selectbox(
"Survivor rule",
["first", "last", "most-complete", "most-recent"],
index=0,
help="Which row to keep when duplicates are found.",
)
# Second row of options
col_a, col_b = st.columns(2)
with col_a:
normalize_options = {c: "auto" for c in columns}
normalizer_types = ["auto", "email", "phone", "name", "address", "string", "none"]
normalize_map: dict[str, str] = {}
if fuzzy_cols or subset_cols:
target_cols = fuzzy_cols or subset_cols
st.markdown("**Per-column normalizers**")
for col_name in target_cols:
norm = st.selectbox(
f"Normalizer for '{col_name}'",
normalizer_types,
index=0,
key=f"norm_{col_name}",
)
if norm not in ("auto", "none"):
normalize_map[col_name] = norm
with col_b:
merge = st.checkbox(
"Merge mode",
value=False,
help="Fill missing fields in the surviving row from removed duplicates.",
)
date_column: Optional[str] = None
if survivor == "most-recent":
date_column = st.selectbox(
"Date column",
columns,
help="Required for most-recent survivor rule.",
)
# Config save/load
st.divider()
cfg_left, cfg_right = st.columns(2)
with cfg_left:
config_file = st.file_uploader(
"Load config profile",
type=["json"],
help="Load previously saved settings.",
key="config_upload",
)
if config_file is not None:
import json
try:
data = json.loads(config_file.read())
loaded = DeduplicationConfig.from_dict(data)
st.session_state["loaded_config"] = loaded
st.success("Config loaded.")
except Exception as e:
st.error(f"Failed to load config: {e}")
with cfg_right:
if st.button("Save current settings"):
cfg = _build_config(
subset_cols, key_cols, fuzzy_cols,
algorithm, threshold, normalize_map,
survivor, date_column, merge,
)
cfg_json = cfg.to_dict()
import json
st.download_button(
"Download config JSON",
data=json.dumps(cfg_json, indent=2),
file_name="dedup_config.json",
mime="application/json",
)
# Build strategies from selections
strategies = _build_strategies(
subset_cols, key_cols, fuzzy_cols,
algorithm, threshold, normalize_map,
)
# Survivor rule mapping
survivor_map = {
"first": SurvivorRule.KEEP_FIRST,
"last": SurvivorRule.KEEP_LAST,
"most-complete": SurvivorRule.KEEP_MOST_COMPLETE,
"most-recent": SurvivorRule.KEEP_MOST_RECENT,
}
return {
"strategies": strategies,
"survivor_rule": survivor_map[survivor],
"date_column": date_column,
"merge": merge,
}
def _build_strategies(
subset_cols: list[str],
key_cols: list[str],
fuzzy_cols: list[str],
algorithm: str,
threshold: int,
normalize_map: dict[str, str],
) -> Optional[list[MatchStrategy]]:
"""Build MatchStrategy list from GUI selections. Returns None for auto-detect."""
strategies: list[MatchStrategy] = []
# If user selected columns explicitly, build from those
if subset_cols or fuzzy_cols:
target_cols = subset_cols if subset_cols else fuzzy_cols
fuzzy_set = set(fuzzy_cols)
col_strats: list[ColumnMatchStrategy] = []
for col in target_cols:
norm = None
if col in normalize_map:
norm = NormalizerType(normalize_map[col])
if col in fuzzy_set:
algo = Algorithm(algorithm)
thresh = float(threshold)
else:
algo = Algorithm.EXACT
thresh = 100.0
col_strats.append(ColumnMatchStrategy(
column=col, algorithm=algo, threshold=thresh, normalizer=norm,
))
strategies.append(MatchStrategy(column_strategies=col_strats))
# Add strong key strategies
if key_cols:
for col in key_cols:
strategies.append(MatchStrategy(column_strategies=[
ColumnMatchStrategy(column=col, algorithm=Algorithm.EXACT, threshold=100.0)
]))
return strategies if strategies else None
def _build_config(
subset_cols, key_cols, fuzzy_cols,
algorithm, threshold, normalize_map,
survivor, date_column, merge,
) -> DeduplicationConfig:
"""Build a DeduplicationConfig from GUI state."""
cfg = DeduplicationConfig(
survivor_rule=survivor.replace("-", "_"),
date_column=date_column,
merge=merge,
subset_columns=subset_cols or None,
fuzzy_columns=fuzzy_cols or None,
default_algorithm=algorithm,
default_threshold=float(threshold),
normalize_map=normalize_map or None,
)
strategies = _build_strategies(
subset_cols, key_cols, fuzzy_cols,
algorithm, threshold, normalize_map,
)
if strategies:
cfg.strategies = [
StrategyConfig(columns=[
ColumnStrategyConfig(
column=cs.column,
algorithm=cs.algorithm.value,
threshold=cs.threshold,
normalizer=cs.normalizer.value if cs.normalizer else None,
)
for cs in s.column_strategies
])
for s in strategies
]
return cfg
# ---------------------------------------------------------------------------
# Match group review card
# ---------------------------------------------------------------------------
def match_group_card(
group: MatchResult,
df: pd.DataFrame,
group_num: int,
) -> Optional[bool]:
"""Render an expandable match group card with side-by-side diff.
Returns:
True — user clicked Merge (accept match)
False — user clicked Keep Both (reject match)
None — no decision yet
"""
confidence = group.confidence
auto_expand = confidence < 95.0
matched_on = ", ".join(group.matched_on)
n_rows = len(group.row_indices)
label = (
f"Group {group_num}: {n_rows} rows "
f"(confidence: {confidence:.0f}%) "
f"[{matched_on}]"
)
with st.expander(label, expanded=auto_expand):
# Build comparison DataFrame
display_cols = [c for c in df.columns if not str(c).startswith("_norm_")]
rows_data = []
for idx in group.row_indices:
row = {"_row": idx + 1}
for col in display_cols:
row[col] = df.iloc[idx].get(col, "")
rows_data.append(row)
compare_df = pd.DataFrame(rows_data)
compare_df = compare_df.set_index("_row")
# Highlight differences
def _highlight_diffs(s: pd.Series) -> list[str]:
"""Highlight cells that differ from the first row."""
styles = []
first_val = str(s.iloc[0]).strip() if len(s) > 0 else ""
for val in s:
val_str = str(val).strip()
if val_str != first_val and val_str and first_val:
styles.append("background-color: rgba(245, 166, 35, 0.2)")
elif not val_str and first_val:
styles.append("background-color: rgba(240, 82, 82, 0.1)")
else:
styles.append("")
return styles
styled = compare_df.style.apply(_highlight_diffs, axis=0)
st.dataframe(styled, use_container_width=True)
# Action buttons
btn_left, btn_mid, btn_right = st.columns(3)
merge_key = f"merge_{group.group_id}"
keep_key = f"keep_{group.group_id}"
with btn_left:
if st.button("Merge", key=merge_key, type="primary"):
return True
with btn_mid:
if st.button("Keep Both", key=keep_key):
return False
# Check session state for previous decisions
decisions = st.session_state.get("review_decisions", {})
if group.group_id in decisions:
decision = decisions[group.group_id]
if decision is True:
st.success("Decision: Merge")
elif decision is False:
st.info("Decision: Keep Both")
return None
# ---------------------------------------------------------------------------
# Results summary + downloads
# ---------------------------------------------------------------------------
def results_summary(
result: DeduplicationResult,
original_df: pd.DataFrame,
) -> None:
"""Render summary stats and download buttons."""
removed = result.original_row_count - len(result.deduplicated_df)
# Summary metrics
col1, col2, col3, col4 = st.columns(4)
col1.metric("Rows In", result.original_row_count)
col2.metric("Rows Out", len(result.deduplicated_df))
col3.metric("Removed", removed)
col4.metric("Groups", len(result.match_groups))
st.divider()
# Download buttons
dl_left, dl_mid, dl_right = st.columns(3)
with dl_left:
csv_bytes = result.deduplicated_df.to_csv(index=False).encode("utf-8-sig")
st.download_button(
"Download Deduplicated CSV",
data=csv_bytes,
file_name="deduplicated.csv",
mime="text/csv",
)
with dl_mid:
if not result.removed_df.empty:
removed_bytes = result.removed_df.to_csv(index=False).encode("utf-8-sig")
st.download_button(
"Download Removed Rows",
data=removed_bytes,
file_name="removed_rows.csv",
mime="text/csv",
)
with dl_right:
if result.match_groups:
groups_data = _build_match_groups_csv(result, original_df)
st.download_button(
"Download Match Groups Report",
data=groups_data,
file_name="match_groups.csv",
mime="text/csv",
)
def _build_match_groups_csv(
result: DeduplicationResult,
original_df: pd.DataFrame,
) -> bytes:
"""Build the match groups audit CSV as bytes."""
rows = []
for g in result.match_groups:
for idx in g.row_indices:
row_data = {
"_group_id": g.group_id + 1,
"_is_survivor": idx == g.survivor_index,
"_confidence": g.confidence,
"_matched_on": ", ".join(g.matched_on),
"_original_row": idx + 1,
}
for col in original_df.columns:
if not str(col).startswith("_norm_"):
row_data[col] = original_df.iloc[idx].get(col, "") if idx < len(original_df) else ""
rows.append(row_data)
groups_df = pd.DataFrame(rows)
return groups_df.to_csv(index=False).encode("utf-8-sig")

View File

@@ -0,0 +1,11 @@
customer_id,name,email,city
E001,Björn Andersson,bjorn@example.com,Malmö
E002,Bjorn Andersson,bjorn@example.com,Malmo
E003,François Dupont,f.dupont@example.com,Montréal
E004,Francois Dupont,f.dupont@example.com,Montreal
E005,José García,jgarcia@example.com,México
E006,Jose Garcia,jgarcia@example.com,Mexico
E007,Müller GmbH,kontakt@muller.de,Köln
E008,Muller GmbH,kontakt@muller.de,Koln
E009,Unique Person,unique@example.com,London
E010,Another Unique,another@example.com,Paris
1 customer_id name email city
2 E001 Björn Andersson bjorn@example.com Malmö
3 E002 Bjorn Andersson bjorn@example.com Malmo
4 E003 François Dupont f.dupont@example.com Montréal
5 E004 Francois Dupont f.dupont@example.com Montreal
6 E005 José García jgarcia@example.com México
7 E006 Jose Garcia jgarcia@example.com Mexico
8 E007 Müller GmbH kontakt@muller.de Köln
9 E008 Muller GmbH kontakt@muller.de Koln
10 E009 Unique Person unique@example.com London
11 E010 Another Unique another@example.com Paris

View File

@@ -0,0 +1,6 @@
id;name;email;amount
1;Test User;test1@example.com;1.234,56
2;Test User;TEST1@example.com;1.234,56
3;Other User;test2@example.com;987,00
4;Other User;test2@example.com;987,00
5;Unique;unique@example.com;500,00
1 id name email amount
2 1 Test User test1@example.com 1.234,56
3 2 Test User TEST1@example.com 1.234,56
4 3 Other User test2@example.com 987,00
5 4 Other User test2@example.com 987,00
6 5 Unique unique@example.com 500,00

View File

@@ -0,0 +1,6 @@
sku product price
A-100 Widget 9.99
A-100 Widget Standard 9.99
A-101 Widget Plus 12.99
A-100 WIDGET 9.99
B-200 Gadget 15.00
1 sku product price
2 A-100 Widget 9.99
3 A-100 Widget Standard 9.99
4 A-101 Widget Plus 12.99
5 A-100 WIDGET 9.99
6 B-200 Gadget 15.00

View File

@@ -0,0 +1,6 @@
customer_id,name,email
B001,Andersén,andersen@example.com
B002,Andersen,andersen@example.com
B003,Smith,smith@example.com
B004,SMITH,Smith@Example.com
B005,Lee,lee@example.com
1 customer_id name email
2 B001 Andersén andersen@example.com
3 B002 Andersen andersen@example.com
4 B003 Smith smith@example.com
5 B004 SMITH Smith@Example.com
6 B005 Lee lee@example.com

View File

@@ -0,0 +1,16 @@
customer_id,first_name,last_name,email,phone,signup_date
C001,John,Smith,john.smith@gmail.com,(555) 123-4567,2025-01-15
C002,John,Smith,JohnSmith@Gmail.com,555-123-4567,2025-02-03
C003,John,Smith,j.o.h.n.smith+shop@gmail.com,+15551234567,2025-02-19
C004,Jon,Smith,jon.smith@yahoo.com,(555) 987-6543,2025-01-22
C005,Maria,Garcia,maria@example.com,555.222.3333,2025-03-01
C006,Maria,Garcia,maria@example.com,5552223333,2025-03-10
C007,Maria,Garcia,maria@example.com,+1 (555) 222-3333,2025-03-15
C008,David,Lee,david.lee@work.com,(555) 444-1111,2025-02-01
C009,David,Lee,dlee@personal.com,555-444-1111,2025-02-28
C010,Sarah,Jones,sarah@example.com ,(555) 666-7777,2025-01-08
C011,Sarah,Jones,sarah@example.com,(555) 666-7777,2025-01-09
C012,Alice,Wong,alice.wong@example.com,(555) 333-9999,2025-03-22
C013,Robert,Brown,rob.brown@example.com,(555) 888-2222,2025-03-25
C014,Catherine,Doe,cdoe@example.com,(555) 111-0000,2025-01-30
C015,Katherine,Doe,cdoe@example.com,(555) 111-0000,2025-04-02
1 customer_id first_name last_name email phone signup_date
2 C001 John Smith john.smith@gmail.com (555) 123-4567 2025-01-15
3 C002 John Smith JohnSmith@Gmail.com 555-123-4567 2025-02-03
4 C003 John Smith j.o.h.n.smith+shop@gmail.com +15551234567 2025-02-19
5 C004 Jon Smith jon.smith@yahoo.com (555) 987-6543 2025-01-22
6 C005 Maria Garcia maria@example.com 555.222.3333 2025-03-01
7 C006 Maria Garcia maria@example.com 5552223333 2025-03-10
8 C007 Maria Garcia maria@example.com +1 (555) 222-3333 2025-03-15
9 C008 David Lee david.lee@work.com (555) 444-1111 2025-02-01
10 C009 David Lee dlee@personal.com 555-444-1111 2025-02-28
11 C010 Sarah Jones sarah@example.com (555) 666-7777 2025-01-08
12 C011 Sarah Jones sarah@example.com (555) 666-7777 2025-01-09
13 C012 Alice Wong alice.wong@example.com (555) 333-9999 2025-03-22
14 C013 Robert Brown rob.brown@example.com (555) 888-2222 2025-03-25
15 C014 Catherine Doe cdoe@example.com (555) 111-0000 2025-01-30
16 C015 Katherine Doe cdoe@example.com (555) 111-0000 2025-04-02

View File

@@ -0,0 +1,16 @@
sku,product_name,price,stock,category
DOG-001,Dog Collar - Red - Large,12.99,45,Collars
DOG-001 ,Dog Collar Red Large,12.99,12,Collars
dog-001,"Dog Collar, Red, L",12.99,8,Collars
DOG-001 ,Red Dog Collar (Large),12.99,3,Collars
DOG-002,Dog Collar - Red - Medium,11.99,20,Collars
DOG-002,Dog Collar Red Medium,11.99,15,Collars
CAT-100,Cat Scratching Post 36 inch,29.99,10,Furniture
CAT-100,"Cat Scratching Post 36""",29.99,5,Furniture
CAT-100,Cat Scratch Post 36in,29.99,2,Furniture
CAT-101,Cat Scratching Post 36 inch,31.99,8,Furniture
BIRD-50,Parrot Cage Large,89.00,4,Cages
BIRD-50,Parrot Cage Large,89.00,1,Cages
FISH-22,Aquarium Filter 20gal,45.50,12,Aquatics
RABT-7,Rabbit Hutch Outdoor,199.99,3,Cages
DOG-555,Dog Bed Memory Foam XL,75.00,8,Beds
1 sku product_name price stock category
2 DOG-001 Dog Collar - Red - Large 12.99 45 Collars
3 DOG-001 Dog Collar Red Large 12.99 12 Collars
4 dog-001 Dog Collar, Red, L 12.99 8 Collars
5 DOG-001 Red Dog Collar (Large) 12.99 3 Collars
6 DOG-002 Dog Collar - Red - Medium 11.99 20 Collars
7 DOG-002 Dog Collar Red Medium 11.99 15 Collars
8 CAT-100 Cat Scratching Post 36 inch 29.99 10 Furniture
9 CAT-100 Cat Scratching Post 36" 29.99 5 Furniture
10 CAT-100 Cat Scratch Post 36in 29.99 2 Furniture
11 CAT-101 Cat Scratching Post 36 inch 31.99 8 Furniture
12 BIRD-50 Parrot Cage Large 89.00 4 Cages
13 BIRD-50 Parrot Cage Large 89.00 1 Cages
14 FISH-22 Aquarium Filter 20gal 45.50 12 Aquatics
15 RABT-7 Rabbit Hutch Outdoor 199.99 3 Cages
16 DOG-555 Dog Bed Memory Foam XL 75.00 8 Beds

View File

@@ -0,0 +1,16 @@
cart_id,email,abandoned_at,cart_value,items_count
AC1001,buyer1@example.com,2026-04-10 14:23:00,87.50,3
AC1018,buyer1@example.com,2026-04-12 09:11:00,120.00,4
AC1042,BUYER1@example.com,2026-04-15 18:55:00,65.25,2
AC1003,buyer2@example.com,2026-04-10 10:00:00,45.00,1
AC1009,buyer2@example.com,2026-04-10 16:30:00,52.00,2
AC1011,j.doe@gmail.com,2026-04-11 12:00:00,200.00,5
AC1027,jdoe@gmail.com,2026-04-13 14:00:00,180.00,4
AC1004,alice@example.com,2026-04-10 11:00:00,30.00,1
AC1006,bob@example.com,2026-04-10 13:00:00,75.00,2
AC1019,carol@example.com,2026-04-12 15:00:00,90.00,3
AC1031,dan@example.com,2026-04-13 17:00:00,55.00,2
AC1050,eve@example.com,2026-04-16 08:00:00,112.00,3
AC1051,eve@example.com,2026-04-16 08:01:00,112.00,3
AC1099,frank@example.com,2026-01-05 10:00:00,40.00,1
AC1187,frank@example.com,2026-04-18 15:00:00,60.00,2
1 cart_id email abandoned_at cart_value items_count
2 AC1001 buyer1@example.com 2026-04-10 14:23:00 87.50 3
3 AC1018 buyer1@example.com 2026-04-12 09:11:00 120.00 4
4 AC1042 BUYER1@example.com 2026-04-15 18:55:00 65.25 2
5 AC1003 buyer2@example.com 2026-04-10 10:00:00 45.00 1
6 AC1009 buyer2@example.com 2026-04-10 16:30:00 52.00 2
7 AC1011 j.doe@gmail.com 2026-04-11 12:00:00 200.00 5
8 AC1027 jdoe@gmail.com 2026-04-13 14:00:00 180.00 4
9 AC1004 alice@example.com 2026-04-10 11:00:00 30.00 1
10 AC1006 bob@example.com 2026-04-10 13:00:00 75.00 2
11 AC1019 carol@example.com 2026-04-12 15:00:00 90.00 3
12 AC1031 dan@example.com 2026-04-13 17:00:00 55.00 2
13 AC1050 eve@example.com 2026-04-16 08:00:00 112.00 3
14 AC1051 eve@example.com 2026-04-16 08:01:00 112.00 3
15 AC1099 frank@example.com 2026-01-05 10:00:00 40.00 1
16 AC1187 frank@example.com 2026-04-18 15:00:00 60.00 2

View File

@@ -0,0 +1,15 @@
source_order_id,source,order_date,customer_email,customer_name,total
SHOP-1001,shopify,2026-04-01,kara.miller@gmail.com,Kara Miller,45.99
AMZ-A789X,amazon,2026-04-03,kara.miller@gmail.com,K Miller,32.50
MAN-2026-04,manual,2026-04-05,kara@miller-design.com,Kara Miller,120.00
SHOP-1023,shopify,2026-04-08,tom.harris@example.com,Tom Harris,67.00
MAN-2026-09,manual,2026-04-10,tom.haris@example.com,Tom Harris,55.00
SHOP-1045,shopify,2026-04-12,jenny.l@example.com,Jenny Lee,89.00
AMZ-B221Y,amazon,2026-04-15,JENNY.L@EXAMPLE.COM,Jenny L.,44.50
SHOP-2001,shopify,2026-04-02,alex@example.com,Alex Park,33.00
SHOP-2002,shopify,2026-04-02,amy@example.com,Amy Park,33.00
SHOP-3001,shopify,2026-04-06,single1@example.com,Single Buyer One,78.00
AMZ-C100Z,amazon,2026-04-07,single2@example.com,Single Buyer Two,92.00
MAN-2026-12,manual,2026-04-09,single3@example.com,Single Buyer Three,150.00
SHOP-4001,shopify,2026-04-11,sales@bigcorp.com,Mike Johnson,200.00
SHOP-4012,shopify,2026-04-14,sales@bigcorp.com,Sarah Wright,175.00
1 source_order_id source order_date customer_email customer_name total
2 SHOP-1001 shopify 2026-04-01 kara.miller@gmail.com Kara Miller 45.99
3 AMZ-A789X amazon 2026-04-03 kara.miller@gmail.com K Miller 32.50
4 MAN-2026-04 manual 2026-04-05 kara@miller-design.com Kara Miller 120.00
5 SHOP-1023 shopify 2026-04-08 tom.harris@example.com Tom Harris 67.00
6 MAN-2026-09 manual 2026-04-10 tom.haris@example.com Tom Harris 55.00
7 SHOP-1045 shopify 2026-04-12 jenny.l@example.com Jenny Lee 89.00
8 AMZ-B221Y amazon 2026-04-15 JENNY.L@EXAMPLE.COM Jenny L. 44.50
9 SHOP-2001 shopify 2026-04-02 alex@example.com Alex Park 33.00
10 SHOP-2002 shopify 2026-04-02 amy@example.com Amy Park 33.00
11 SHOP-3001 shopify 2026-04-06 single1@example.com Single Buyer One 78.00
12 AMZ-C100Z amazon 2026-04-07 single2@example.com Single Buyer Two 92.00
13 MAN-2026-12 manual 2026-04-09 single3@example.com Single Buyer Three 150.00
14 SHOP-4001 shopify 2026-04-11 sales@bigcorp.com Mike Johnson 200.00
15 SHOP-4012 shopify 2026-04-14 sales@bigcorp.com Sarah Wright 175.00

View File

@@ -0,0 +1,16 @@
email,first_name,source,subscribed_date,tags
sub1@example.com,Pat,newsletter_signup,2025-08-15,newsletter
SUB1@example.com,Pat,facebook_lead_form,2025-09-01,fb_q3_2025
sub1@EXAMPLE.com,Patricia,checkout_optin,2025-10-12,customer
sub1@example.com ,Pat W,popup_form,2026-01-08,popup_homepage
mike.smith@gmail.com,Mike,import_2024,2024-11-20,legacy
mikesmith@gmail.com,Mike S,newsletter_signup,2025-03-15,newsletter
m.i.k.e.s.m.i.t.h@gmail.com,Michael,facebook_lead_form,2025-07-22,fb_q3_2025
promos+freebie@gmail.com,Sam,freebie_download,2025-12-01,lead_magnet
promos@gmail.com,Sam,newsletter_signup,2026-02-15,newsletter
unique1@example.com,Alpha,newsletter_signup,2025-09-10,newsletter
unique2@example.com,Beta,popup_form,2025-10-05,popup_homepage
unique3@example.com,Gamma,facebook_lead_form,2025-11-12,fb_q4_2025
unique4@example.com,Delta,checkout_optin,2026-01-20,customer
jdoe@company-a.com,Jane Doe,newsletter_signup,2025-08-01,newsletter
jdoe@company-b.com,Jane Doe,newsletter_signup,2025-08-02,newsletter
1 email first_name source subscribed_date tags
2 sub1@example.com Pat newsletter_signup 2025-08-15 newsletter
3 SUB1@example.com Pat facebook_lead_form 2025-09-01 fb_q3_2025
4 sub1@EXAMPLE.com Patricia checkout_optin 2025-10-12 customer
5 sub1@example.com Pat W popup_form 2026-01-08 popup_homepage
6 mike.smith@gmail.com Mike import_2024 2024-11-20 legacy
7 mikesmith@gmail.com Mike S newsletter_signup 2025-03-15 newsletter
8 m.i.k.e.s.m.i.t.h@gmail.com Michael facebook_lead_form 2025-07-22 fb_q3_2025
9 promos+freebie@gmail.com Sam freebie_download 2025-12-01 lead_magnet
10 promos@gmail.com Sam newsletter_signup 2026-02-15 newsletter
11 unique1@example.com Alpha newsletter_signup 2025-09-10 newsletter
12 unique2@example.com Beta popup_form 2025-10-05 popup_homepage
13 unique3@example.com Gamma facebook_lead_form 2025-11-12 fb_q4_2025
14 unique4@example.com Delta checkout_optin 2026-01-20 customer
15 jdoe@company-a.com Jane Doe newsletter_signup 2025-08-01 newsletter
16 jdoe@company-b.com Jane Doe newsletter_signup 2025-08-02 newsletter

View File

@@ -0,0 +1,19 @@
txn_date,description,amount,balance_after,export_batch
2026-04-10,ACH DEPOSIT PAYROLL,2500.00,5230.45,exportA
2026-04-10,ACH Dep Payroll,2500.00,5230.45,exportB
2026-04-11,POS PURCHASE STARBUCKS #4421,-6.75,5223.70,exportA
2026-04-11,POS Purchase Starbucks 4421,-6.75,5223.70,exportB
2026-04-12,CHECK #1042,-450.00,4773.70,exportA
2026-04-12,CHECK 1042,-450.00,4773.70,exportB
2026-04-13,ATM WITHDRAWAL ATM0019,-100.00,4673.70,exportA
2026-04-13,ATM Withdrawal ATM0019,-100.00,4673.70,exportB
2026-04-15,ONLINE TRANSFER TO SAVINGS,-200.00,4473.70,exportA
2026-04-15,Online Transfer to Savings,-200.00,4473.70,exportB
2026-04-02,ACH DEBIT UTILITY CO,-145.00,2730.45,exportA
2026-04-05,POS PURCHASE GROCERY MART,-87.32,2643.13,exportA
2026-04-08,DEPOSIT MOBILE CHECK,200.00,2843.13,exportA
2026-04-20,ACH DEBIT INSURANCE CO,-220.00,4253.70,exportB
2026-04-22,POS PURCHASE GAS STATION,-45.10,4208.60,exportB
2026-04-25,INTEREST PAYMENT,0.85,4209.45,exportB
2026-04-11,POS PURCHASE STARBUCKS #4421,-6.75,5216.95,exportA
2026-04-11,POS PURCHASE PEET'S #1102,-6.75,5210.20,exportA
1 txn_date description amount balance_after export_batch
2 2026-04-10 ACH DEPOSIT PAYROLL 2500.00 5230.45 exportA
3 2026-04-10 ACH Dep Payroll 2500.00 5230.45 exportB
4 2026-04-11 POS PURCHASE STARBUCKS #4421 -6.75 5223.70 exportA
5 2026-04-11 POS Purchase Starbucks 4421 -6.75 5223.70 exportB
6 2026-04-12 CHECK #1042 -450.00 4773.70 exportA
7 2026-04-12 CHECK 1042 -450.00 4773.70 exportB
8 2026-04-13 ATM WITHDRAWAL ATM0019 -100.00 4673.70 exportA
9 2026-04-13 ATM Withdrawal ATM0019 -100.00 4673.70 exportB
10 2026-04-15 ONLINE TRANSFER TO SAVINGS -200.00 4473.70 exportA
11 2026-04-15 Online Transfer to Savings -200.00 4473.70 exportB
12 2026-04-02 ACH DEBIT UTILITY CO -145.00 2730.45 exportA
13 2026-04-05 POS PURCHASE GROCERY MART -87.32 2643.13 exportA
14 2026-04-08 DEPOSIT MOBILE CHECK 200.00 2843.13 exportA
15 2026-04-20 ACH DEBIT INSURANCE CO -220.00 4253.70 exportB
16 2026-04-22 POS PURCHASE GAS STATION -45.10 4208.60 exportB
17 2026-04-25 INTEREST PAYMENT 0.85 4209.45 exportB
18 2026-04-11 POS PURCHASE STARBUCKS #4421 -6.75 5216.95 exportA
19 2026-04-11 POS PURCHASE PEET'S #1102 -6.75 5210.20 exportA

View File

@@ -0,0 +1,16 @@
vendor_name,ein,contact_email,phone,source
Acme Corp,12-3456789,billing@acme.com,(555) 100-2000,quickbooks
Acme Corporation,12-3456789,ap@acme.com,555-100-2000,spreadsheet
ACME CORP.,12-3456789,billing@acme.com,5551002000,email_extract
Beta Solutions LLC,98-7654321,info@betasolutions.com,(555) 200-3000,quickbooks
"Beta Solutions, LLC",,billing@betasolutions.com,555.200.3000,spreadsheet
beta solutions,98-7654321,,(555) 200-3000,email_extract
Smith Consulting Inc,11-1111111,ap@smithconsulting.com,(555) 300-4000,quickbooks
Smith Consulting LLC,22-2222222,ap@smith-consulting.com,(555) 300-4001,quickbooks
Gamma Industries Inc,33-3333333,ap@gamma.com,(555) 400-5000,quickbooks
Gamma Industries Inc DBA QuickPrint,33-3333333,billing@quickprint.com,(555) 400-5001,spreadsheet
"Delta Services, Inc. ",44-4444444,ap@delta.com,(555) 500-6000,email_extract
Delta Services Inc,44-4444444,ap@delta.com,(555) 500-6000,quickbooks
Unique Vendor One,55-5555555,u1@example.com,(555) 600-7000,quickbooks
Unique Vendor Two,66-6666666,u2@example.com,(555) 700-8000,spreadsheet
Unique Vendor Three,77-7777777,u3@example.com,(555) 800-9000,email_extract
1 vendor_name ein contact_email phone source
2 Acme Corp 12-3456789 billing@acme.com (555) 100-2000 quickbooks
3 Acme Corporation 12-3456789 ap@acme.com 555-100-2000 spreadsheet
4 ACME CORP. 12-3456789 billing@acme.com 5551002000 email_extract
5 Beta Solutions LLC 98-7654321 info@betasolutions.com (555) 200-3000 quickbooks
6 Beta Solutions, LLC billing@betasolutions.com 555.200.3000 spreadsheet
7 beta solutions 98-7654321 (555) 200-3000 email_extract
8 Smith Consulting Inc 11-1111111 ap@smithconsulting.com (555) 300-4000 quickbooks
9 Smith Consulting LLC 22-2222222 ap@smith-consulting.com (555) 300-4001 quickbooks
10 Gamma Industries Inc 33-3333333 ap@gamma.com (555) 400-5000 quickbooks
11 Gamma Industries Inc DBA QuickPrint 33-3333333 billing@quickprint.com (555) 400-5001 spreadsheet
12 Delta Services, Inc. 44-4444444 ap@delta.com (555) 500-6000 email_extract
13 Delta Services Inc 44-4444444 ap@delta.com (555) 500-6000 quickbooks
14 Unique Vendor One 55-5555555 u1@example.com (555) 600-7000 quickbooks
15 Unique Vendor Two 66-6666666 u2@example.com (555) 700-8000 spreadsheet
16 Unique Vendor Three 77-7777777 u3@example.com (555) 800-9000 email_extract

View File

@@ -0,0 +1,15 @@
customer_id,name,email,phone,address,city,state,zip,last_purchase
CUST-A1,Linda Park,linda.park@example.com,,,,,,2025-03-15
CUST-A2,Linda Park,,(555) 123-9999,1234 Oak St,Portland,OR,97201,
CUST-A3,L. Park,linda.park@example.com,(555) 123-9999,1234 Oak Street,,,97201,2026-01-10
CUST-B1,James Wilson,jwilson@example.com,(555) 444-8888,,Austin,TX,,2025-11-22
CUST-B2,James Wilson,jwilson@example.com,,789 Pine Ave,Austin,TX,78701,
CUST-C1,Anna Chen,anna.chen@example.com,,,,,,2024-08-01
CUST-C2,Anna Chen,anna.chen@example.com,(555) 222-7777,,,,,2025-02-14
CUST-C3,Anna Chen,,(555) 222-7777,456 Elm Dr,Seattle,WA,,2025-09-30
CUST-C4,A Chen,anna.chen@example.com,(555) 222-7777,456 Elm Drive,Seattle,WA,98101,2026-03-20
CUST-D1,Ricardo Lopez,rlopez@example.com,(555) 666-1212,999 Maple Ln,Denver,CO,80202,2026-02-05
CUST-E1,John A Brown,jabrown@example.com,(555) 111-3333,111 First St,Boston,MA,02101,2026-01-05
CUST-E2,John B Brown,jbbrown@example.com,(555) 111-4444,222 Second St,Boston,MA,02102,2026-01-06
CUST-F1,Maria Costa,,,333 Bay Rd,Miami,FL,33101,
CUST-F2,Maria Costa,mcosta@example.com,(555) 777-2222,333 Bay Rd,Miami,FL,33101,2026-04-01
1 customer_id name email phone address city state zip last_purchase
2 CUST-A1 Linda Park linda.park@example.com 2025-03-15
3 CUST-A2 Linda Park (555) 123-9999 1234 Oak St Portland OR 97201
4 CUST-A3 L. Park linda.park@example.com (555) 123-9999 1234 Oak Street 97201 2026-01-10
5 CUST-B1 James Wilson jwilson@example.com (555) 444-8888 Austin TX 2025-11-22
6 CUST-B2 James Wilson jwilson@example.com 789 Pine Ave Austin TX 78701
7 CUST-C1 Anna Chen anna.chen@example.com 2024-08-01
8 CUST-C2 Anna Chen anna.chen@example.com (555) 222-7777 2025-02-14
9 CUST-C3 Anna Chen (555) 222-7777 456 Elm Dr Seattle WA 2025-09-30
10 CUST-C4 A Chen anna.chen@example.com (555) 222-7777 456 Elm Drive Seattle WA 98101 2026-03-20
11 CUST-D1 Ricardo Lopez rlopez@example.com (555) 666-1212 999 Maple Ln Denver CO 80202 2026-02-05
12 CUST-E1 John A Brown jabrown@example.com (555) 111-3333 111 First St Boston MA 02101 2026-01-05
13 CUST-E2 John B Brown jbbrown@example.com (555) 111-4444 222 Second St Boston MA 02102 2026-01-06
14 CUST-F1 Maria Costa 333 Bay Rd Miami FL 33101
15 CUST-F2 Maria Costa mcosta@example.com (555) 777-2222 333 Bay Rd Miami FL 33101 2026-04-01

View File

@@ -0,0 +1,15 @@
expense_id,employee,expense_date,vendor,amount,description,submitted_at
EXP-001,Tom R,2026-03-15,Marriott,234.56,Hotel client visit,2026-03-18 09:00
EXP-019,Tom R,2026-03-15,Marriott Hotels,234.56,Hotel - client mtg,2026-04-02 14:30
EXP-005,Tom R,2026-03-16,Steakhouse,187.45,Client dinner with team,2026-03-19 10:15
EXP-006,Sara K,2026-03-16,Steakhouse,187.45,Client dinner,2026-03-19 11:00
EXP-010,Mike P,2026-03-20,Local Cafe,12.50,Lunch with vendor,2026-03-21 09:00
EXP-011,Mike P,2026-03-20,Local Cafe,8.75,Coffee meeting,2026-03-21 09:01
EXP-022,Anna L,2026-03-25,Uber,45.20,Airport transfer,2026-03-26 08:00
EXP-024,Anna L,2026-03-25,Uber,45.20,Airport transfer to hotel,2026-03-26 08:15
EXP-027,Anna L,2026-03-25,UBER,45.20,airport->hotel,2026-03-26 09:30
EXP-033,Ben T,2026-03-28,Office Depot,89.99,Printer ink,2026-03-30 13:00
EXP-040,Tom R,2026-04-01,Hyatt,234.56,Hotel different city,2026-04-03 10:00
EXP-050,Sara K,2026-04-05,American Airlines,412.00,Flight to Chicago,2026-04-06 09:00
EXP-051,Mike P,2026-04-05,Hertz,189.50,Rental car,2026-04-06 10:00
EXP-052,Anna L,2026-04-06,Subway,11.25,Lunch,2026-04-07 12:00
1 expense_id employee expense_date vendor amount description submitted_at
2 EXP-001 Tom R 2026-03-15 Marriott 234.56 Hotel client visit 2026-03-18 09:00
3 EXP-019 Tom R 2026-03-15 Marriott Hotels 234.56 Hotel - client mtg 2026-04-02 14:30
4 EXP-005 Tom R 2026-03-16 Steakhouse 187.45 Client dinner with team 2026-03-19 10:15
5 EXP-006 Sara K 2026-03-16 Steakhouse 187.45 Client dinner 2026-03-19 11:00
6 EXP-010 Mike P 2026-03-20 Local Cafe 12.50 Lunch with vendor 2026-03-21 09:00
7 EXP-011 Mike P 2026-03-20 Local Cafe 8.75 Coffee meeting 2026-03-21 09:01
8 EXP-022 Anna L 2026-03-25 Uber 45.20 Airport transfer 2026-03-26 08:00
9 EXP-024 Anna L 2026-03-25 Uber 45.20 Airport transfer to hotel 2026-03-26 08:15
10 EXP-027 Anna L 2026-03-25 UBER 45.20 airport->hotel 2026-03-26 09:30
11 EXP-033 Ben T 2026-03-28 Office Depot 89.99 Printer ink 2026-03-30 13:00
12 EXP-040 Tom R 2026-04-01 Hyatt 234.56 Hotel different city 2026-04-03 10:00
13 EXP-050 Sara K 2026-04-05 American Airlines 412.00 Flight to Chicago 2026-04-06 09:00
14 EXP-051 Mike P 2026-04-05 Hertz 189.50 Rental car 2026-04-06 10:00
15 EXP-052 Anna L 2026-04-06 Subway 11.25 Lunch 2026-04-07 12:00

View File

@@ -0,0 +1,17 @@
id,full_name,email,phone,company,notes
1,Alice Johnson , alice@ex.com ,(555) 100-1000,Ex Corp ,VIP client
2,alice johnson,alice@ex.com,5551001000,Ex Corp,
3,Bob OBrien,bob.obrien@example.com,(555) 200-2000,OBrien LLC,intro by Alice
4,Bob O'Brien,BOB.OBRIEN@example.com,555-200-2000,O'Brien LLC,Intro by Alice
5,Carol Wu,carol@example.com,(555) 300-3000,WuCo,follow up
6,Carol Wu,carolexample.com,(555) 300-3000,WuCo,follow up Q2
7,Derek Hall,dhall@example.com,(555) 400-4000,Hall Industries,
8,Derek Hall,dhall@example.com,(555) 400-4099,Hall Inds.,
9,Eve Martin,eve@example.com,(555) 500-5000,Martin Co,lead
10,Eve Martin,eve@example.com,(555) 500-5000,Martin Co,lead
11,Frank Ng,fng@example.com,(555) 600-6000,Ng Group,
12,Grace Park,gpark@example.com,(555) 700-7000,Park & Co,
13,Henry Lo,hlo@example.com,(555) 800-8000,Lo Holdings Inc.,renewal
14,Henry Lo,hlo@example.com,(555) 800-8000,"Lo Holdings, Inc",renewal Q2
15,Ivan Kim,ikim@example.com,(555) 900-9000,Kim Solutions,
16,Ivan Kim,ikim@example.com,(555) 900-9000,Kim Solutions,
1 id full_name email phone company notes
2 1 Alice Johnson alice@ex.com (555) 100-1000 Ex Corp VIP client
3 2 alice johnson alice@ex.com 5551001000 Ex Corp
4 3 Bob O’Brien bob.obrien@example.com (555) 200-2000 OBrien LLC intro by Alice
5 4 Bob O'Brien BOB.OBRIEN@example.com 555-200-2000 O'Brien LLC Intro by Alice
6 5 Carol Wu carol@example.com (555) 300-3000 WuCo follow up
7 6 Carol Wu carol@example.com (555) 300-3000 WuCo follow up Q2
8 7 Derek Hall dhall@example.com (555) 400-4000 Hall Industries
9 8 Derek Hall dhall@example.com (555) 400-4099 Hall Inds.
10 9 Eve Martin eve@example.com (555) 500-5000 Martin Co lead
11 10 Eve Martin eve@example.com (555) 500-5000 Martin Co lead
12 11 Frank Ng fng@example.com (555) 600-6000 Ng Group
13 12 Grace Park gpark@example.com (555) 700-7000 Park & Co
14 13 Henry Lo hlo@example.com (555) 800-8000 Lo Holdings Inc. renewal
15 14 Henry Lo hlo@example.com (555) 800-8000 Lo Holdings, Inc renewal Q2
16 15 Ivan Kim ikim@example.com (555) 900-9000 Kim Solutions
17 16 Ivan Kim ikim@example.com (555) 900-9000 Kim Solutions

View File

@@ -0,0 +1,15 @@
response_id,submitted_at,ip_address,respondent_email,q1_satisfaction,q2_recommend,q3_comments
R0001,2026-04-20 09:15:22,192.168.1.50,user1@example.com,5,Yes,Great service
R0002,2026-04-20 09:17:45,10.0.0.12,user1@example.com,5,Yes,Great service
R0010,2026-04-20 10:30:00,192.168.1.51,user2@example.com,4,Yes,Good but could improve
R0011,2026-04-20 10:35:12,192.168.1.52,user2@example.com,5,Yes,Good but could improve
R0020,2026-04-20 11:00:00,192.168.1.55,User3@Example.com,3,Maybe,OK
R0021,2026-04-20 11:02:30,192.168.1.55,user3@example.com,3,Maybe,OK
R0030,2026-04-20 14:00:00,73.55.10.100,spouse1@example.com,5,Yes,Loved it
R0031,2026-04-20 14:30:00,73.55.10.100,spouse2@example.com,4,Yes,Liked it
R0040,2026-04-05 08:00:00,192.168.2.10,user4@example.com,3,Maybe,Mixed
R0041,2026-04-19 15:00:00,192.168.2.11,user4@example.com,5,Yes,Got better
R0050,2026-04-20 08:00:00,100.10.10.10,unique1@example.com,5,Yes,Excellent
R0051,2026-04-20 09:00:00,100.10.10.11,unique2@example.com,4,Yes,Solid
R0052,2026-04-20 10:00:00,100.10.10.12,unique3@example.com,2,No,Disappointing
R0053,2026-04-20 11:00:00,100.10.10.13,unique4@example.com,5,Yes,Recommended
1 response_id submitted_at ip_address respondent_email q1_satisfaction q2_recommend q3_comments
2 R0001 2026-04-20 09:15:22 192.168.1.50 user1@example.com 5 Yes Great service
3 R0002 2026-04-20 09:17:45 10.0.0.12 user1@example.com 5 Yes Great service
4 R0010 2026-04-20 10:30:00 192.168.1.51 user2@example.com 4 Yes Good but could improve
5 R0011 2026-04-20 10:35:12 192.168.1.52 user2@example.com 5 Yes Good but could improve
6 R0020 2026-04-20 11:00:00 192.168.1.55 User3@Example.com 3 Maybe OK
7 R0021 2026-04-20 11:02:30 192.168.1.55 user3@example.com 3 Maybe OK
8 R0030 2026-04-20 14:00:00 73.55.10.100 spouse1@example.com 5 Yes Loved it
9 R0031 2026-04-20 14:30:00 73.55.10.100 spouse2@example.com 4 Yes Liked it
10 R0040 2026-04-05 08:00:00 192.168.2.10 user4@example.com 3 Maybe Mixed
11 R0041 2026-04-19 15:00:00 192.168.2.11 user4@example.com 5 Yes Got better
12 R0050 2026-04-20 08:00:00 100.10.10.10 unique1@example.com 5 Yes Excellent
13 R0051 2026-04-20 09:00:00 100.10.10.11 unique2@example.com 4 Yes Solid
14 R0052 2026-04-20 10:00:00 100.10.10.12 unique3@example.com 2 No Disappointing
15 R0053 2026-04-20 11:00:00 100.10.10.13 unique4@example.com 5 Yes Recommended

View File

@@ -0,0 +1,15 @@
lead_id,first_name,last_name,email,phone,company,lead_source,captured_date
L001,Patricia,Anders,p.anders@bigco.com,(555) 100-2222,Big Co,trade_show,2026-02-15
L045,Pat,Anders,p.anders@bigco.com,5551002222,Big Co Inc,linkedin,2026-03-01
L091,Patty,Anders,panders@bigco.com,(555) 100-2222,BigCo,webinar,2026-03-22
L010,Mohammed,Khan,m.khan@example.com,(555) 200-3333,Khan Industries,trade_show,2026-02-15
L032,Muhammad,Khan,mkhan@example.com,(555) 200-3333,Khan Industries,cold_outreach,2026-02-28
L067,Mohammad,Khan,m.khan@example.com,(555) 200-3333,Khan Industries Inc,referral,2026-03-10
L020,Sarah,"Lee, VP Marketing",slee@target.com,(555) 300-4444,Target Corp,linkedin,2026-02-20
L058,Sarah,Lee,slee@target.com,(555) 300-4444,Target Corp,webinar,2026-03-05
L080,John,Adams,jadams@enterprise.com,(555) 400-5555,Enterprise Inc,trade_show,2026-03-15
L081,John,Adams II,jadams2@enterprise.com,(555) 400-5556,Enterprise Inc,trade_show,2026-03-15
L100,Wei,Chen,wchen@startup.io,(555) 500-6666,Startup IO,referral,2026-03-20
L101,Olivia,Mendez,omendez@agency.com,(555) 600-7777,Agency Co,linkedin,2026-03-21
L102,Raj,Patel,rpatel@firm.com,(555) 700-8888,Firm LLC,cold_outreach,2026-03-22
L103,Emma,Wright,ewright@consulting.com,(555) 800-9999,Wright Consulting,webinar,2026-03-23
1 lead_id first_name last_name email phone company lead_source captured_date
2 L001 Patricia Anders p.anders@bigco.com (555) 100-2222 Big Co trade_show 2026-02-15
3 L045 Pat Anders p.anders@bigco.com 5551002222 Big Co Inc linkedin 2026-03-01
4 L091 Patty Anders panders@bigco.com (555) 100-2222 BigCo webinar 2026-03-22
5 L010 Mohammed Khan m.khan@example.com (555) 200-3333 Khan Industries trade_show 2026-02-15
6 L032 Muhammad Khan mkhan@example.com (555) 200-3333 Khan Industries cold_outreach 2026-02-28
7 L067 Mohammad Khan m.khan@example.com (555) 200-3333 Khan Industries Inc referral 2026-03-10
8 L020 Sarah Lee, VP Marketing slee@target.com (555) 300-4444 Target Corp linkedin 2026-02-20
9 L058 Sarah Lee slee@target.com (555) 300-4444 Target Corp webinar 2026-03-05
10 L080 John Adams jadams@enterprise.com (555) 400-5555 Enterprise Inc trade_show 2026-03-15
11 L081 John Adams II jadams2@enterprise.com (555) 400-5556 Enterprise Inc trade_show 2026-03-15
12 L100 Wei Chen wchen@startup.io (555) 500-6666 Startup IO referral 2026-03-20
13 L101 Olivia Mendez omendez@agency.com (555) 600-7777 Agency Co linkedin 2026-03-21
14 L102 Raj Patel rpatel@firm.com (555) 700-8888 Firm LLC cold_outreach 2026-03-22
15 L103 Emma Wright ewright@consulting.com (555) 800-9999 Wright Consulting webinar 2026-03-23

View File

@@ -0,0 +1,16 @@
email,captured_date,source_channel,campaign,consent
lead1@example.com,2025-09-15,facebook_ad,back_to_school_2025,yes
lead1@example.com,2025-11-20,google_ad,black_friday_2025,yes
lead1@example.com,2026-01-08,organic_form,homepage_popup,yes
lead1@example.com,2026-03-12,facebook_ad,spring_2026,yes
lead2@example.com,2025-10-01,facebook_ad,halloween_2025,yes
lead2@example.com,2025-12-15,google_ad,holiday_2025,no
lead2@example.com,2026-02-01,organic_form,valentines_2026,yes
lead.three@gmail.com,2025-08-01,facebook_ad,summer_2025,yes
leadthree@gmail.com,2025-10-15,google_ad,fall_2025,yes
l.e.a.d.three@gmail.com,2026-02-20,organic_form,winter_2026,yes
solo1@example.com,2025-09-10,facebook_ad,back_to_school_2025,yes
solo2@example.com,2025-11-05,google_ad,black_friday_2025,yes
solo3@example.com,2026-01-20,organic_form,homepage_popup,yes
solo4@example.com,2026-03-08,facebook_ad,spring_2026,yes
solo5@example.com,2026-04-15,referral,ambassador_program,yes
1 email captured_date source_channel campaign consent
2 lead1@example.com 2025-09-15 facebook_ad back_to_school_2025 yes
3 lead1@example.com 2025-11-20 google_ad black_friday_2025 yes
4 lead1@example.com 2026-01-08 organic_form homepage_popup yes
5 lead1@example.com 2026-03-12 facebook_ad spring_2026 yes
6 lead2@example.com 2025-10-01 facebook_ad halloween_2025 yes
7 lead2@example.com 2025-12-15 google_ad holiday_2025 no
8 lead2@example.com 2026-02-01 organic_form valentines_2026 yes
9 lead.three@gmail.com 2025-08-01 facebook_ad summer_2025 yes
10 leadthree@gmail.com 2025-10-15 google_ad fall_2025 yes
11 l.e.a.d.three@gmail.com 2026-02-20 organic_form winter_2026 yes
12 solo1@example.com 2025-09-10 facebook_ad back_to_school_2025 yes
13 solo2@example.com 2025-11-05 google_ad black_friday_2025 yes
14 solo3@example.com 2026-01-20 organic_form homepage_popup yes
15 solo4@example.com 2026-03-08 facebook_ad spring_2026 yes
16 solo5@example.com 2026-04-15 referral ambassador_program yes

View File

@@ -0,0 +1,15 @@
email,fb_id,google_click_id,platform,first_seen,last_seen
audience1@example.com,fb_77881122,,facebook,2026-01-05,2026-04-20
audience1@example.com,,gclid_AAA111,google_ads,2026-02-10,2026-04-18
audience1@example.com,,,organic_form,2026-03-15,2026-03-15
audience2@example.com,fb_99887766,,facebook,2026-02-01,2026-04-15
audience2@example.com,,gclid_BBB222,google_ads,2026-02-20,2026-04-22
old.email@example.com,fb_55443322,,facebook,2025-06-10,2025-12-01
new.email@example.com,fb_55443322,,facebook,2026-01-15,2026-04-20
,fb_11223344,,facebook,2026-03-01,2026-04-01
,,gclid_CCC333,google_ads,2026-03-10,2026-04-10
fbonly@example.com,fb_44556677,,facebook,2026-03-20,2026-04-20
googleonly@example.com,,gclid_DDD444,google_ads,2026-03-25,2026-04-22
organiconly@example.com,,,organic_form,2026-04-01,2026-04-01
combined@example.com,fb_88990011,gclid_EEE555,manual_merge,2026-04-10,2026-04-25
,,,sms_list,2026-04-05,2026-04-05
1 email fb_id google_click_id platform first_seen last_seen
2 audience1@example.com fb_77881122 facebook 2026-01-05 2026-04-20
3 audience1@example.com gclid_AAA111 google_ads 2026-02-10 2026-04-18
4 audience1@example.com organic_form 2026-03-15 2026-03-15
5 audience2@example.com fb_99887766 facebook 2026-02-01 2026-04-15
6 audience2@example.com gclid_BBB222 google_ads 2026-02-20 2026-04-22
7 old.email@example.com fb_55443322 facebook 2025-06-10 2025-12-01
8 new.email@example.com fb_55443322 facebook 2026-01-15 2026-04-20
9 fb_11223344 facebook 2026-03-01 2026-04-01
10 gclid_CCC333 google_ads 2026-03-10 2026-04-10
11 fbonly@example.com fb_44556677 facebook 2026-03-20 2026-04-20
12 googleonly@example.com gclid_DDD444 google_ads 2026-03-25 2026-04-22
13 organiconly@example.com organic_form 2026-04-01 2026-04-01
14 combined@example.com fb_88990011 gclid_EEE555 manual_merge 2026-04-10 2026-04-25
15 sms_list 2026-04-05 2026-04-05

View File

@@ -0,0 +1,18 @@
email,suppression_reason,suppressed_date,source_list
supp1@example.com,unsubscribe,2025-08-15,list_main
supp1@example.com,hard_bounce,2025-09-20,list_promo
supp1@example.com,spam_complaint,2025-10-01,list_main
supp2@example.com,unsubscribe,2026-01-10,list_main
supp2@example.com,unsubscribe,2026-01-10,list_promo
supp2@example.com,unsubscribe,2026-01-10,list_newsletter
Supp3@Example.com,unsubscribe,2025-12-05,list_main
supp3@example.com,hard_bounce,2026-02-15,list_promo
bounced@gmail.com,hard_bounce,2026-03-01,list_main
b.o.u.n.c.e.d@gmail.com,hard_bounce,2026-03-01,list_promo
complainer@example.com,spam_complaint,2026-02-20,list_main
soft@example.com,soft_bounce_x3,2026-03-15,list_main
solo_unsub@example.com,unsubscribe,2026-04-01,list_main
solo_bounce@example.com,hard_bounce,2026-04-05,list_promo
solo_complaint@example.com,spam_complaint,2026-04-10,list_main
padded@example.com ,unsubscribe,2026-04-12,list_main
padded@example.com,unsubscribe,2026-04-12,list_promo
1 email suppression_reason suppressed_date source_list
2 supp1@example.com unsubscribe 2025-08-15 list_main
3 supp1@example.com hard_bounce 2025-09-20 list_promo
4 supp1@example.com spam_complaint 2025-10-01 list_main
5 supp2@example.com unsubscribe 2026-01-10 list_main
6 supp2@example.com unsubscribe 2026-01-10 list_promo
7 supp2@example.com unsubscribe 2026-01-10 list_newsletter
8 Supp3@Example.com unsubscribe 2025-12-05 list_main
9 supp3@example.com hard_bounce 2026-02-15 list_promo
10 bounced@gmail.com hard_bounce 2026-03-01 list_main
11 b.o.u.n.c.e.d@gmail.com hard_bounce 2026-03-01 list_promo
12 complainer@example.com spam_complaint 2026-02-20 list_main
13 soft@example.com soft_bounce_x3 2026-03-15 list_main
14 solo_unsub@example.com unsubscribe 2026-04-01 list_main
15 solo_bounce@example.com hard_bounce 2026-04-05 list_promo
16 solo_complaint@example.com spam_complaint 2026-04-10 list_main
17 padded@example.com unsubscribe 2026-04-12 list_main
18 padded@example.com unsubscribe 2026-04-12 list_promo

0
tests/__init__.py Normal file
View File

47
tests/conftest.py Normal file
View File

@@ -0,0 +1,47 @@
"""Shared test fixtures."""
import pandas as pd
import pytest
from pathlib import Path
SAMPLES_DIR = Path(__file__).parent.parent / "samples"
@pytest.fixture
def sample_csv_path():
return SAMPLES_DIR / "messy_sales.csv"
@pytest.fixture
def sample_df(sample_csv_path):
return pd.read_csv(sample_csv_path, dtype=str, keep_default_na=False)
@pytest.fixture
def simple_df():
"""Small DataFrame with obvious duplicates for unit testing."""
return pd.DataFrame({
"name": ["Alice", "alice", "Bob", "Charlie", "ALICE"],
"email": ["alice@test.com", "alice@test.com", "bob@test.com",
"charlie@test.com", "alice@test.com"],
"phone": ["555-1234", "555-1234", "555-5678", "555-9012", "555-1234"],
})
@pytest.fixture
def merge_df():
"""DataFrame with partial records that benefit from merge."""
return pd.DataFrame({
"name": ["John Doe", "John Doe", "Jane Smith"],
"email": ["john@test.com", "john@test.com", "jane@test.com"],
"phone": ["555-1111", "", "555-3333"],
"address": ["", "123 Main St", "456 Oak Ave"],
})
@pytest.fixture
def tmp_csv(tmp_path, simple_df):
"""Write simple_df to a temp CSV and return the path."""
path = tmp_path / "test_input.csv"
simple_df.to_csv(path, index=False)
return path

147
tests/test_cli.py Normal file
View File

@@ -0,0 +1,147 @@
"""Integration tests for the CLI via Typer's CliRunner."""
import pytest
from pathlib import Path
from typer.testing import CliRunner
from src.cli import app
runner = CliRunner()
class TestCliPreview:
def test_preview_default(self, tmp_csv):
result = runner.invoke(app, [str(tmp_csv)])
assert result.exit_code == 0
assert "preview" in result.output.lower() or "Rows in" in result.output
def test_preview_shows_row_counts(self, tmp_csv):
result = runner.invoke(app, [str(tmp_csv)])
assert result.exit_code == 0
assert "Rows in" in result.output
assert "Rows out" in result.output
def test_file_not_found(self):
result = runner.invoke(app, ["/tmp/nonexistent_xyz_abc.csv"])
assert result.exit_code != 0
assert "not found" in result.output.lower()
class TestCliApply:
def test_apply_writes_output(self, tmp_csv, tmp_path):
out = tmp_path / "output.csv"
result = runner.invoke(app, [str(tmp_csv), "--apply", "-o", str(out)])
assert result.exit_code == 0
assert out.exists()
def test_apply_default_output_name(self, tmp_csv):
result = runner.invoke(app, [str(tmp_csv), "--apply"])
assert result.exit_code == 0
expected = tmp_csv.parent / f"{tmp_csv.stem}_deduplicated.csv"
assert expected.exists()
def test_apply_creates_removed_file(self, tmp_csv):
result = runner.invoke(app, [str(tmp_csv), "--apply"])
assert result.exit_code == 0
removed = tmp_csv.parent / f"{tmp_csv.stem}_removed.csv"
# May or may not exist depending on whether duplicates were found
# with default auto-detect on simple_df
class TestCliFuzzy:
def test_fuzzy_flag(self, tmp_csv):
result = runner.invoke(app, [
str(tmp_csv), "--fuzzy", "name", "--threshold", "80",
])
assert result.exit_code == 0
def test_subset_flag(self, tmp_csv):
result = runner.invoke(app, [
str(tmp_csv), "--subset", "email",
])
assert result.exit_code == 0
def test_bad_column_error(self, tmp_csv):
result = runner.invoke(app, [
str(tmp_csv), "--subset", "nonexistent_column",
])
assert result.exit_code != 0
assert "not found" in result.output.lower()
class TestCliConfig:
def test_save_and_load_config(self, tmp_csv, tmp_path):
cfg_path = tmp_path / "my_config.json"
# Save
result = runner.invoke(app, [
str(tmp_csv), "--subset", "email", "--save-config", str(cfg_path),
])
assert result.exit_code == 0
assert cfg_path.exists()
# Load and apply
result = runner.invoke(app, [
str(tmp_csv), "--config", str(cfg_path), "--apply",
])
assert result.exit_code == 0
class TestCliSurvivor:
def test_survivor_last(self, tmp_csv):
result = runner.invoke(app, [str(tmp_csv), "--survivor", "last"])
assert result.exit_code == 0
def test_survivor_most_complete(self, tmp_csv):
result = runner.invoke(app, [str(tmp_csv), "--survivor", "most-complete"])
assert result.exit_code == 0
def test_invalid_survivor(self, tmp_csv):
result = runner.invoke(app, [str(tmp_csv), "--survivor", "bogus"])
assert result.exit_code != 0
class TestCliMerge:
def test_merge_flag(self, tmp_csv):
result = runner.invoke(app, [str(tmp_csv), "--merge", "--apply"])
assert result.exit_code == 0
class TestCliSampleData:
def test_sample_preview(self, sample_csv_path):
result = runner.invoke(app, [str(sample_csv_path)])
assert result.exit_code == 0
assert "Rows in: 50" in result.output
# Should find duplicates
assert "Removed:" in result.output
def test_sample_apply(self, sample_csv_path, tmp_path):
out = tmp_path / "deduped.csv"
result = runner.invoke(app, [
str(sample_csv_path), "--apply", "-o", str(out),
])
assert result.exit_code == 0
assert out.exists()
import pandas as pd
df = pd.read_csv(out, encoding="utf-8-sig")
# Should have fewer than 50 rows
assert len(df) < 50
def test_sample_fuzzy_with_merge(self, sample_csv_path, tmp_path):
out = tmp_path / "fuzzy_merged.csv"
result = runner.invoke(app, [
str(sample_csv_path),
"--fuzzy", "customer_name",
"--threshold", "80",
"--merge",
"--apply",
"-o", str(out),
])
assert result.exit_code == 0
assert out.exists()
class TestCliHelp:
def test_help(self):
result = runner.invoke(app, ["--help"])
assert result.exit_code == 0
assert "--apply" in result.output

102
tests/test_config.py Normal file
View File

@@ -0,0 +1,102 @@
"""Tests for src.core.config — save/load configuration profiles."""
import json
import pytest
from pathlib import Path
from src.core.config import (
DeduplicationConfig,
StrategyConfig,
ColumnStrategyConfig,
)
from src.core.dedup import Algorithm, SurvivorRule
from src.core.normalizers import NormalizerType
class TestDeduplicationConfig:
def test_default(self):
cfg = DeduplicationConfig.default()
assert cfg.survivor_rule == "first"
assert cfg.merge is False
assert cfg.strategies == []
def test_to_dict_roundtrip(self):
cfg = DeduplicationConfig(
strategies=[
StrategyConfig(columns=[
ColumnStrategyConfig(
column="email",
algorithm="exact",
threshold=100.0,
normalizer="email",
),
]),
],
survivor_rule="most_complete",
merge=True,
)
d = cfg.to_dict()
cfg2 = DeduplicationConfig.from_dict(d)
assert cfg2.survivor_rule == "most_complete"
assert cfg2.merge is True
assert len(cfg2.strategies) == 1
assert cfg2.strategies[0].columns[0].column == "email"
def test_to_file_from_file(self, tmp_path):
cfg = DeduplicationConfig(
strategies=[
StrategyConfig(columns=[
ColumnStrategyConfig(column="name", algorithm="jaro_winkler",
threshold=85.0, normalizer="name"),
]),
],
survivor_rule="last",
)
path = tmp_path / "test_config.json"
cfg.to_file(path)
loaded = DeduplicationConfig.from_file(path)
assert loaded.survivor_rule == "last"
assert len(loaded.strategies) == 1
assert loaded.strategies[0].columns[0].algorithm == "jaro_winkler"
def test_to_strategies(self):
cfg = DeduplicationConfig(
strategies=[
StrategyConfig(columns=[
ColumnStrategyConfig(column="email", algorithm="exact",
threshold=100.0, normalizer="email"),
ColumnStrategyConfig(column="phone", algorithm="exact",
threshold=100.0, normalizer="phone"),
]),
],
)
strats = cfg.to_strategies()
assert strats is not None
assert len(strats) == 1
assert len(strats[0].column_strategies) == 2
assert strats[0].column_strategies[0].algorithm == Algorithm.EXACT
assert strats[0].column_strategies[0].normalizer == NormalizerType.EMAIL
def test_to_strategies_empty(self):
cfg = DeduplicationConfig.default()
assert cfg.to_strategies() is None
def test_to_survivor_rule(self):
cfg = DeduplicationConfig(survivor_rule="most_complete")
assert cfg.to_survivor_rule() == SurvivorRule.KEEP_MOST_COMPLETE
def test_json_is_valid(self, tmp_path):
cfg = DeduplicationConfig(
strategies=[
StrategyConfig(columns=[
ColumnStrategyConfig(column="x", algorithm="exact"),
]),
],
normalize_map={"email": "email"},
)
path = tmp_path / "valid.json"
cfg.to_file(path)
data = json.loads(path.read_text())
assert "strategies" in data
assert "normalize_map" in data

258
tests/test_dedup.py Normal file
View File

@@ -0,0 +1,258 @@
"""Tests for src.core.dedup — matching engine."""
import pandas as pd
import pytest
from src.core.dedup import (
Algorithm,
ColumnMatchStrategy,
MatchStrategy,
SurvivorRule,
_compute_similarity,
_compare_pair,
_UnionFind,
build_default_strategies,
deduplicate,
)
from src.core.normalizers import NormalizerType
class TestComputeSimilarity:
def test_exact_match(self):
assert _compute_similarity("hello", "hello", Algorithm.EXACT) == 100.0
def test_exact_mismatch(self):
assert _compute_similarity("hello", "world", Algorithm.EXACT) == 0.0
def test_levenshtein_similar(self):
score = _compute_similarity("kitten", "sitting", Algorithm.LEVENSHTEIN)
assert 50 < score < 80
def test_jaro_winkler_similar(self):
score = _compute_similarity("john", "jon", Algorithm.JARO_WINKLER)
assert score > 80
def test_token_set_ratio(self):
score = _compute_similarity(
"123 main street apt 4",
"apt 4 123 main street",
Algorithm.TOKEN_SET_RATIO,
)
assert score == 100.0
class TestUnionFind:
def test_basic_union(self):
uf = _UnionFind(5)
uf.union(0, 1)
uf.union(1, 2)
assert uf.find(0) == uf.find(2) # transitive
def test_separate_groups(self):
uf = _UnionFind(5)
uf.union(0, 1)
uf.union(3, 4)
assert uf.find(0) != uf.find(3)
def test_groups(self):
uf = _UnionFind(5)
uf.union(0, 1)
uf.union(1, 2)
uf.union(3, 4)
groups = uf.groups()
assert len(groups) == 2
sizes = sorted(len(v) for v in groups.values())
assert sizes == [2, 3]
class TestComparePair:
def test_exact_match(self):
strategy = MatchStrategy(column_strategies=[
ColumnMatchStrategy(column="email", algorithm=Algorithm.EXACT),
])
row_a = pd.Series({"email": "test@example.com"})
row_b = pd.Series({"email": "test@example.com"})
is_match, conf, cols = _compare_pair(row_a, row_b, strategy)
assert is_match
assert conf == 100.0
assert cols == ["email"]
def test_exact_mismatch(self):
strategy = MatchStrategy(column_strategies=[
ColumnMatchStrategy(column="email", algorithm=Algorithm.EXACT),
])
row_a = pd.Series({"email": "a@test.com"})
row_b = pd.Series({"email": "b@test.com"})
is_match, conf, cols = _compare_pair(row_a, row_b, strategy)
assert not is_match
def test_fuzzy_match(self):
strategy = MatchStrategy(column_strategies=[
ColumnMatchStrategy(column="name", algorithm=Algorithm.JARO_WINKLER, threshold=80),
])
row_a = pd.Series({"name": "john smith"})
row_b = pd.Series({"name": "jon smith"})
is_match, conf, cols = _compare_pair(row_a, row_b, strategy)
assert is_match
assert conf > 80
def test_and_logic_both_must_match(self):
strategy = MatchStrategy(column_strategies=[
ColumnMatchStrategy(column="name", algorithm=Algorithm.EXACT),
ColumnMatchStrategy(column="email", algorithm=Algorithm.EXACT),
])
# name matches, email doesn't
row_a = pd.Series({"name": "alice", "email": "a@test.com"})
row_b = pd.Series({"name": "alice", "email": "b@test.com"})
is_match, conf, cols = _compare_pair(row_a, row_b, strategy)
assert not is_match
class TestBuildDefaultStrategies:
def test_detects_email(self):
df = pd.DataFrame({"email": ["a@b.com"], "name": ["Alice"]})
strats = build_default_strategies(df)
# email (strong, standalone) + name AND email (weak paired with strong) = 2
assert len(strats) == 2
found_email = any(
cs.column == "email" and cs.normalizer == NormalizerType.EMAIL
for s in strats for cs in s.column_strategies
)
assert found_email
# Name should only appear paired with email, not standalone
name_strats = [s for s in strats
if any(cs.column == "name" for cs in s.column_strategies)]
for s in name_strats:
assert len(s.column_strategies) >= 2, "Name should be paired with a strong key"
def test_fallback_all_columns(self):
df = pd.DataFrame({"x": [1], "y": [2], "z": [3]})
strats = build_default_strategies(df)
assert len(strats) == 1
assert len(strats[0].column_strategies) == 3
class TestDeduplicate:
def test_exact_duplicates(self, simple_df):
# Alice appears 3 times with same email
strategy = MatchStrategy(column_strategies=[
ColumnMatchStrategy(column="email", algorithm=Algorithm.EXACT),
])
result = deduplicate(simple_df, strategies=[strategy])
# 3 Alices -> 1, Bob stays, Charlie stays = 3 rows
assert len(result.deduplicated_df) == 3
assert result.original_row_count == 5
assert len(result.match_groups) == 1
def test_fuzzy_name_match(self):
df = pd.DataFrame({
"name": ["John Smith", "Jon Smith", "Jane Doe"],
"email": ["a@test.com", "b@test.com", "c@test.com"],
})
strategy = MatchStrategy(column_strategies=[
ColumnMatchStrategy(
column="name",
algorithm=Algorithm.JARO_WINKLER,
threshold=85,
normalizer=NormalizerType.NAME,
),
])
result = deduplicate(df, strategies=[strategy])
assert len(result.deduplicated_df) == 2
assert len(result.match_groups) == 1
def test_survivor_keep_last(self, simple_df):
strategy = MatchStrategy(column_strategies=[
ColumnMatchStrategy(column="email", algorithm=Algorithm.EXACT),
])
result = deduplicate(simple_df, strategies=[strategy],
survivor_rule=SurvivorRule.KEEP_LAST)
# The last Alice (index 4) should survive
assert len(result.match_groups) == 1
assert result.match_groups[0].survivor_index == 4
def test_survivor_most_complete(self, merge_df):
strategy = MatchStrategy(column_strategies=[
ColumnMatchStrategy(column="email", algorithm=Algorithm.EXACT),
])
result = deduplicate(merge_df, strategies=[strategy],
survivor_rule=SurvivorRule.KEEP_MOST_COMPLETE)
# Row 0 has phone but no address (1 empty)
# Row 1 has address but no phone (1 empty)
# Both have 1 empty, so keep_first among ties
assert len(result.deduplicated_df) == 2
def test_merge_mode(self, merge_df):
strategy = MatchStrategy(column_strategies=[
ColumnMatchStrategy(column="email", algorithm=Algorithm.EXACT),
])
result = deduplicate(merge_df, strategies=[strategy], merge=True)
# Survivor should have both phone and address filled
john_row = result.deduplicated_df[
result.deduplicated_df["name"] == "John Doe"
].iloc[0]
assert john_row["phone"] == "555-1111"
assert john_row["address"] == "123 Main St"
def test_multi_strategy_or(self):
df = pd.DataFrame({
"name": ["Alice", "Bob", "Alice B."],
"email": ["a@test.com", "a@test.com", "c@test.com"],
})
# Strategy 1: match on email
# Strategy 2: match on name (fuzzy)
strat1 = MatchStrategy(column_strategies=[
ColumnMatchStrategy(column="email", algorithm=Algorithm.EXACT),
])
strat2 = MatchStrategy(column_strategies=[
ColumnMatchStrategy(column="name", algorithm=Algorithm.JARO_WINKLER, threshold=70),
])
result = deduplicate(df, strategies=[strat1, strat2])
# All three should end up in one group via transitive closure:
# Alice~Bob (email), Alice~Alice B. (name)
assert len(result.deduplicated_df) == 1
def test_confidence_score(self, simple_df):
strategy = MatchStrategy(column_strategies=[
ColumnMatchStrategy(column="email", algorithm=Algorithm.EXACT),
])
result = deduplicate(simple_df, strategies=[strategy])
for group in result.match_groups:
assert 0 <= group.confidence <= 100
def test_preview_flag(self, simple_df):
result = deduplicate(simple_df, preview=True)
assert result.is_preview is True
result2 = deduplicate(simple_df, preview=False)
assert result2.is_preview is False
def test_auto_detect_strategies(self, sample_df):
result = deduplicate(sample_df)
# Should find duplicates in the sample data
assert len(result.match_groups) > 0
assert len(result.deduplicated_df) < len(sample_df)
def test_idempotent(self, sample_df):
"""Running dedup twice with same config produces same output."""
result1 = deduplicate(sample_df)
result2 = deduplicate(result1.deduplicated_df)
# Second pass should find no new duplicates
assert len(result2.match_groups) == 0
assert len(result2.deduplicated_df) == len(result1.deduplicated_df)
def test_review_callback(self):
df = pd.DataFrame({
"email": ["a@test.com", "a@test.com", "b@test.com"],
})
strategy = MatchStrategy(column_strategies=[
ColumnMatchStrategy(column="email", algorithm=Algorithm.EXACT),
])
# Reject all matches
result = deduplicate(df, strategies=[strategy],
review_callback=lambda g, d: False)
assert len(result.deduplicated_df) == 3 # nothing removed
# Accept all matches
result = deduplicate(df, strategies=[strategy],
review_callback=lambda g, d: True)
assert len(result.deduplicated_df) == 2

130
tests/test_io.py Normal file
View File

@@ -0,0 +1,130 @@
"""Tests for src.core.io — file reading, encoding/delimiter detection."""
import pandas as pd
import pytest
from pathlib import Path
from src.core.io import (
detect_encoding,
detect_delimiter,
detect_header_row,
read_file,
write_file,
list_sheets,
)
class TestDetectEncoding:
def test_utf8_file(self, sample_csv_path):
enc = detect_encoding(sample_csv_path)
assert enc.lower().replace("-", "") in ("utf8", "ascii", "utf8sig")
def test_empty_file(self, tmp_path):
f = tmp_path / "empty.csv"
f.write_bytes(b"")
assert detect_encoding(f) == "utf-8"
def test_bom_file(self, tmp_path):
f = tmp_path / "bom.csv"
f.write_bytes(b"\xef\xbb\xbfname,email\nAlice,a@b.com\n")
assert detect_encoding(f) == "utf-8-sig"
def test_latin1_file(self, tmp_path):
f = tmp_path / "latin.csv"
content = "name,city\nJosé,São Paulo\n".encode("latin-1")
f.write_bytes(content)
enc = detect_encoding(f)
# Should detect something compatible with latin-1 family
assert enc in ("iso-8859-1", "latin-1", "windows-1252", "cp1252",
"iso-8859-9", "cp1250", "iso-8859-15", "utf-8")
class TestDetectDelimiter:
def test_comma(self, sample_csv_path):
assert detect_delimiter(sample_csv_path) == ","
def test_tab(self, tmp_path):
f = tmp_path / "tabs.tsv"
f.write_text("name\temail\nAlice\ta@b.com\n")
assert detect_delimiter(f) == "\t"
def test_semicolon(self, tmp_path):
f = tmp_path / "semi.csv"
f.write_text("name;email;phone\nAlice;a@b.com;555\n")
assert detect_delimiter(f) == ";"
def test_pipe(self, tmp_path):
f = tmp_path / "pipe.csv"
f.write_text("name|email|phone\nAlice|a@b.com|555\n")
assert detect_delimiter(f) == "|"
class TestDetectHeaderRow:
def test_standard_csv(self, sample_csv_path):
assert detect_header_row(sample_csv_path) == 0
def test_with_junk_rows(self, tmp_path):
f = tmp_path / "junk.csv"
f.write_text("Report generated 2024-01-01\n\nname,email,phone\nAlice,a@b.com,555\n")
# Row 0 has "Report generated..." which is a single non-numeric string
# Row 2 has "name,email,phone" which looks like headers
# The heuristic checks all cells, so row 0 may match if it's a single cell
hdr = detect_header_row(f)
assert hdr in (0, 2) # depends on delimiter detection
class TestReadFile:
def test_read_csv(self, sample_csv_path):
df = read_file(sample_csv_path)
assert isinstance(df, pd.DataFrame)
assert len(df) == 50
assert "customer_name" in df.columns
def test_read_nonexistent(self):
with pytest.raises(FileNotFoundError):
read_file("/tmp/nonexistent_file_xyz.csv")
def test_read_with_encoding_override(self, sample_csv_path):
df = read_file(sample_csv_path, encoding="utf-8")
assert len(df) == 50
def test_chunked_reading(self, sample_csv_path):
chunks = read_file(sample_csv_path, chunk_size=10)
# Should be a generator
all_chunks = list(chunks)
assert len(all_chunks) == 5
total_rows = sum(len(c) for c in all_chunks)
assert total_rows == 50
class TestWriteFile:
def test_write_csv(self, tmp_path, simple_df):
out = tmp_path / "output.csv"
write_file(simple_df, out)
assert out.exists()
# Read back
df = pd.read_csv(out, encoding="utf-8-sig")
assert len(df) == len(simple_df)
def test_write_xlsx(self, tmp_path, simple_df):
out = tmp_path / "output.xlsx"
write_file(simple_df, out)
assert out.exists()
df = pd.read_excel(out)
assert len(df) == len(simple_df)
def test_utf8_bom_default(self, tmp_path, simple_df):
out = tmp_path / "bom.csv"
write_file(simple_df, out)
raw = out.read_bytes()
assert raw[:3] == b"\xef\xbb\xbf"
class TestListSheets:
def test_list_sheets(self, tmp_path, simple_df):
path = tmp_path / "multi.xlsx"
with pd.ExcelWriter(path, engine="openpyxl") as writer:
simple_df.to_excel(writer, sheet_name="Sheet1", index=False)
simple_df.to_excel(writer, sheet_name="Sheet2", index=False)
sheets = list_sheets(path)
assert sheets == ["Sheet1", "Sheet2"]

158
tests/test_normalizers.py Normal file
View File

@@ -0,0 +1,158 @@
"""Tests for src.core.normalizers."""
import pytest
from src.core.normalizers import (
NormalizerType,
get_normalizer,
normalize_email,
normalize_phone,
normalize_name,
normalize_address,
normalize_string,
)
class TestNormalizeEmail:
def test_basic_lowercase(self):
assert normalize_email("John@Example.COM") == "john@example.com"
def test_strip_whitespace(self):
assert normalize_email(" alice@test.com ") == "alice@test.com"
def test_strip_gmail_dots(self):
assert normalize_email("j.o.h.n@gmail.com") == "john@gmail.com"
def test_strip_plus_tag(self):
assert normalize_email("alice+promo@test.com") == "alice@test.com"
def test_gmail_dots_and_plus(self):
assert normalize_email("j.smith+tag@gmail.com") == "jsmith@gmail.com"
def test_non_gmail_keeps_dots(self):
assert normalize_email("j.smith@company.com") == "j.smith@company.com"
def test_empty(self):
assert normalize_email("") == ""
assert normalize_email(None) == ""
def test_no_at_sign(self):
assert normalize_email("not-an-email") == "not-an-email"
def test_idempotent(self):
result = normalize_email("J.Smith+tag@Gmail.com")
assert normalize_email(result) == result
class TestNormalizePhone:
def test_us_formatted(self):
assert normalize_phone("(555) 123-4567") == "+15551234567"
def test_dashes(self):
assert normalize_phone("555-123-4567") == "+15551234567"
def test_dots(self):
assert normalize_phone("555.123.4567") == "+15551234567"
def test_with_country_code(self):
assert normalize_phone("+1 555-123-4567") == "+15551234567"
def test_digits_only_input(self):
assert normalize_phone("5551234567") == "+15551234567"
def test_empty(self):
assert normalize_phone("") == ""
assert normalize_phone(None) == ""
def test_invalid_fallback_digits(self):
# Very short number that phonenumbers rejects
result = normalize_phone("123")
assert result == "123"
def test_idempotent(self):
result = normalize_phone("(555) 123-4567")
assert normalize_phone(result) == result
class TestNormalizeName:
def test_strip_mr(self):
assert normalize_name("Mr. John Smith") == "john smith"
def test_strip_dr(self):
assert normalize_name("Dr. Jane Doe") == "jane doe"
def test_strip_suffix(self):
assert normalize_name("Robert Brown Jr.") == "robert brown"
def test_strip_numeral_suffix(self):
assert normalize_name("James Wilson III") == "james wilson"
def test_title_and_suffix(self):
assert normalize_name("Dr. Michael Williams III") == "michael williams"
def test_collapse_whitespace(self):
assert normalize_name(" John Smith ") == "john smith"
def test_case_fold(self):
assert normalize_name("JOHN SMITH") == "john smith"
def test_empty(self):
assert normalize_name("") == ""
assert normalize_name(None) == ""
def test_idempotent(self):
result = normalize_name("Mr. John Smith Jr.")
assert normalize_name(result) == result
class TestNormalizeAddress:
def test_street_abbreviation(self):
assert normalize_address("123 Main Street") == "123 main st"
def test_avenue_abbreviation(self):
assert normalize_address("456 Oak Avenue") == "456 oak ave"
def test_boulevard_abbreviation(self):
assert normalize_address("789 Pine Boulevard") == "789 pine blvd"
def test_apartment(self):
assert normalize_address("123 Main St Apartment 4") == "123 main st apt 4"
def test_direction(self):
assert normalize_address("111 First Street North") == "111 first st n"
def test_collapse_whitespace(self):
assert normalize_address(" 123 Main Street ") == "123 main st"
def test_empty(self):
assert normalize_address("") == ""
assert normalize_address(None) == ""
def test_idempotent(self):
result = normalize_address("123 Main Street Apartment 4")
assert normalize_address(result) == result
class TestNormalizeString:
def test_trim_and_casefold(self):
assert normalize_string(" Hello World ") == "hello world"
def test_collapse_whitespace(self):
assert normalize_string("a b c") == "a b c"
def test_empty(self):
assert normalize_string("") == ""
assert normalize_string(None) == ""
class TestGetNormalizer:
def test_get_by_enum(self):
fn = get_normalizer(NormalizerType.EMAIL)
assert fn("TEST@Gmail.com") == "test@gmail.com"
def test_get_by_string(self):
fn = get_normalizer("phone")
assert fn("(555) 123-4567") == "+15551234567"
def test_unknown_raises(self):
with pytest.raises(ValueError):
get_normalizer("unknown_type")