feat: add documentation, Streamlit GUI, and full source tree
- Rewrite README.md with project overview, quick-start, and CLI summary - Add docs/CLI-REFERENCE.md with full flag reference and 8 recipe sections - Add docs/DEVELOPER.md with architecture, data flow, and extension guides - Rewrite src/core/__init__.py with public API exports and module docstring - Add Streamlit GUI (src/gui/) with file upload, advanced options, interactive match group review with side-by-side diff, and download buttons - Add .gitignore, requirements.txt, all source code, tests, and sample data - Add streamlit to requirements.txt Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
284
docs/CLI-REFERENCE.md
Normal file
284
docs/CLI-REFERENCE.md
Normal file
@@ -0,0 +1,284 @@
|
||||
# CLI Reference
|
||||
|
||||
Complete command-line reference for the DataTools Deduplicator.
|
||||
|
||||
```
|
||||
python -m src.cli INPUT_FILE [OPTIONS]
|
||||
```
|
||||
|
||||
## Arguments
|
||||
|
||||
| Argument | Required | Description |
|
||||
|----------|----------|-------------|
|
||||
| `INPUT_FILE` | Yes | Path to the CSV or Excel file to deduplicate |
|
||||
|
||||
## Options
|
||||
|
||||
### Core
|
||||
|
||||
| Flag | Short | Default | Description |
|
||||
|------|-------|---------|-------------|
|
||||
| `--apply` | | `false` | Write output files. Without this flag, only a preview is shown. |
|
||||
| `--output` | `-o` | `{input}_deduplicated.csv` | Output file path. |
|
||||
|
||||
### Column Selection
|
||||
|
||||
| Flag | Short | Default | Description |
|
||||
|------|-------|---------|-------------|
|
||||
| `--subset` | `-s` | auto-detect | Comma-separated columns to match on. When omitted, columns are auto-detected by name pattern (email, phone, name, address). |
|
||||
| `--key` | `-k` | none | Comma-separated strong-key columns. Each becomes an independent exact-match strategy. Use for identifiers like `fb_id`, `ein`, `sku`. |
|
||||
|
||||
### Fuzzy Matching
|
||||
|
||||
| Flag | Short | Default | Description |
|
||||
|------|-------|---------|-------------|
|
||||
| `--fuzzy` | | none | Comma-separated columns to fuzzy-match. Other columns in the strategy use exact matching. |
|
||||
| `--algorithm` | `-a` | `jaro_winkler` | Fuzzy algorithm: `levenshtein`, `jaro_winkler`, or `token_set_ratio`. |
|
||||
| `--threshold` | `-t` | `85` | Similarity threshold 0-100. Lower values find more matches but increase false positives. |
|
||||
|
||||
### Normalization
|
||||
|
||||
| Flag | Short | Default | Description |
|
||||
|------|-------|---------|-------------|
|
||||
| `--normalize` | | auto-detect | Column normalizers as `col:type` pairs, comma-separated. Types: `email`, `phone`, `name`, `address`, `string`. |
|
||||
|
||||
**Normalizer details:**
|
||||
|
||||
| Type | What it does | Example |
|
||||
|------|-------------|---------|
|
||||
| `email` | Lowercase, strip Gmail dots, strip `+tag` suffixes | `John.Doe+tag@gmail.com` → `johndoe@gmail.com` |
|
||||
| `phone` | Parse to E.164 format; fallback: digits only | `(555) 123-4567` → `+15551234567` |
|
||||
| `name` | Strip titles (Dr., Mr.) and suffixes (Jr., PhD), case-fold | `Dr. John Smith Jr.` → `john smith` |
|
||||
| `address` | USPS abbreviations (Street→St, Avenue→Ave), case-fold | `123 Main Street, Suite 4` → `123 main st ste 4` |
|
||||
| `string` | Trim, collapse whitespace, case-fold | ` HELLO WORLD ` → `hello world` |
|
||||
|
||||
### Survivor Selection
|
||||
|
||||
| Flag | Short | Default | Description |
|
||||
|------|-------|---------|-------------|
|
||||
| `--survivor` | | `first` | Which row to keep per duplicate group. |
|
||||
| `--date-column` | | none | Date column for the `most-recent` rule. |
|
||||
| `--merge` | | `false` | Fill missing fields in the surviving row from removed duplicates. |
|
||||
|
||||
**Survivor rules:**
|
||||
|
||||
| Rule | Behavior |
|
||||
|------|----------|
|
||||
| `first` | Keep the first row encountered (lowest row number) |
|
||||
| `last` | Keep the last row encountered (highest row number) |
|
||||
| `most-complete` | Keep the row with the fewest blank/empty cells |
|
||||
| `most-recent` | Keep the row with the latest date (requires `--date-column`) |
|
||||
|
||||
### Interactive Review
|
||||
|
||||
| Flag | Short | Default | Description |
|
||||
|------|-------|---------|-------------|
|
||||
| `--review` | | `false` | Interactively review each match group. For each group, choose: merge (y), keep both (n), or skip remaining (s). |
|
||||
|
||||
### Configuration
|
||||
|
||||
| Flag | Short | Default | Description |
|
||||
|------|-------|---------|-------------|
|
||||
| `--config` | | none | Load all settings from a saved JSON config file. |
|
||||
| `--save-config` | | none | Save current settings to a JSON config file for reuse. |
|
||||
|
||||
### File Handling
|
||||
|
||||
| Flag | Short | Default | Description |
|
||||
|------|-------|---------|-------------|
|
||||
| `--sheet` | | first sheet | Excel sheet name or 0-based index. Ignored for CSV files. |
|
||||
| `--encoding` | | auto-detect | Override auto-detected file encoding (e.g., `utf-8`, `windows-1252`). |
|
||||
| `--header-row` | | auto-detect | 0-based row index for the header row. |
|
||||
|
||||
---
|
||||
|
||||
## Recipes
|
||||
|
||||
### 1. Basic Dedup (Auto-Detect)
|
||||
|
||||
Let the engine detect email, phone, name, and address columns automatically.
|
||||
|
||||
```bash
|
||||
# Preview
|
||||
python -m src.cli customers.csv
|
||||
|
||||
# Apply
|
||||
python -m src.cli customers.csv --apply
|
||||
```
|
||||
|
||||
The engine scans column names for patterns like `email`, `phone`, `name`, `address` and builds strategies automatically. Strong keys (email, phone) become standalone strategies; weak keys (name, address) are paired with strong keys.
|
||||
|
||||
### 2. Fuzzy Name Matching
|
||||
|
||||
Match rows where names are similar but not identical — catches typos, nickname variations, and formatting differences.
|
||||
|
||||
```bash
|
||||
# Fuzzy-match on the "name" column at 80% similarity
|
||||
python -m src.cli customers.csv --fuzzy name --threshold 80 --apply
|
||||
|
||||
# Fuzzy-match on multiple columns
|
||||
python -m src.cli customers.csv --fuzzy name,address --threshold 85 --apply
|
||||
|
||||
# Use Levenshtein distance instead of Jaro-Winkler
|
||||
python -m src.cli customers.csv --fuzzy name --algorithm levenshtein --threshold 80 --apply
|
||||
```
|
||||
|
||||
**Algorithm comparison:**
|
||||
- `jaro_winkler` (default) — best for short strings like names; weights early characters more heavily
|
||||
- `levenshtein` — edit-distance ratio; works well for typos and transpositions
|
||||
- `token_set_ratio` — best for addresses and long strings; ignores word order
|
||||
|
||||
### 3. Custom Strong Keys
|
||||
|
||||
Use specific identifier columns to find exact duplicates.
|
||||
|
||||
```bash
|
||||
# Deduplicate by Facebook ID
|
||||
python -m src.cli donors.csv --key fb_id --apply
|
||||
|
||||
# Multiple strong keys (each is independent — matched with OR)
|
||||
python -m src.cli donors.csv --key fb_id,ein --apply
|
||||
```
|
||||
|
||||
Strong keys are OR'd: a match on `fb_id` alone OR `ein` alone marks rows as duplicates.
|
||||
|
||||
### 4. Merge Mode
|
||||
|
||||
Keep the most complete row and fill any remaining blanks from the duplicates.
|
||||
|
||||
```bash
|
||||
# Most complete row + merge missing fields
|
||||
python -m src.cli contacts.csv --survivor most-complete --merge --apply
|
||||
|
||||
# Keep most recent row and merge
|
||||
python -m src.cli contacts.csv --survivor most-recent --date-column updated_at --merge --apply
|
||||
```
|
||||
|
||||
**How merge works:** The survivor row keeps all its non-empty fields. For any blank/null fields, the engine fills from the removed rows (in row order). The result is a single row with maximum data retention.
|
||||
|
||||
### 5. Multi-Column Subset
|
||||
|
||||
Match on a specific combination of columns rather than auto-detecting.
|
||||
|
||||
```bash
|
||||
# Exact match on email + phone only
|
||||
python -m src.cli customers.csv --subset email,phone --apply
|
||||
|
||||
# Mix exact and fuzzy within a subset
|
||||
python -m src.cli customers.csv --subset email,name --fuzzy name --threshold 85 --apply
|
||||
```
|
||||
|
||||
When using `--subset`, all listed columns must match (AND logic) for a pair to be considered duplicates.
|
||||
|
||||
### 6. Save and Load Config Profiles
|
||||
|
||||
Save your settings for repeatable runs on similar files.
|
||||
|
||||
```bash
|
||||
# Save settings to a file
|
||||
python -m src.cli customers.csv --fuzzy name --threshold 80 --merge \
|
||||
--survivor most-complete --save-config customer_dedup.json
|
||||
|
||||
# Load saved settings
|
||||
python -m src.cli new_customers.csv --config customer_dedup.json --apply
|
||||
```
|
||||
|
||||
Config files are JSON. Example:
|
||||
|
||||
```json
|
||||
{
|
||||
"strategies": [],
|
||||
"survivor_rule": "most_complete",
|
||||
"merge": true,
|
||||
"default_algorithm": "jaro_winkler",
|
||||
"default_threshold": 80.0,
|
||||
"fuzzy_columns": ["name"]
|
||||
}
|
||||
```
|
||||
|
||||
### 7. Interactive Review
|
||||
|
||||
Step through each match group and decide whether to merge.
|
||||
|
||||
```bash
|
||||
python -m src.cli customers.csv --review --apply
|
||||
```
|
||||
|
||||
For each group, the CLI displays both rows side-by-side and prompts:
|
||||
|
||||
```
|
||||
============================================================
|
||||
Match Group 1 — Confidence: 92.3%
|
||||
Matched on: name, phone
|
||||
============================================================
|
||||
|
||||
Row 1:
|
||||
name: John Smith
|
||||
email: john@example.com
|
||||
phone: (555) 123-4567
|
||||
|
||||
Row 2:
|
||||
name: Jon Smith
|
||||
email:
|
||||
phone: 555-123-4567
|
||||
|
||||
[y] Merge [n] Keep both [s] Skip remaining:
|
||||
```
|
||||
|
||||
- **y** — accept the match; merge/remove duplicate
|
||||
- **n** — reject the match; keep both rows
|
||||
- **s** — skip all remaining groups (keep both for all)
|
||||
|
||||
### 8. Excel Files and Multi-Sheet
|
||||
|
||||
Work with Excel files directly — no CSV conversion needed.
|
||||
|
||||
```bash
|
||||
# Deduplicate first sheet (default)
|
||||
python -m src.cli data.xlsx --apply
|
||||
|
||||
# Specify sheet by name
|
||||
python -m src.cli data.xlsx --sheet "Sales Data" --apply
|
||||
|
||||
# Specify sheet by index (0-based)
|
||||
python -m src.cli data.xlsx --sheet 1 --apply
|
||||
```
|
||||
|
||||
Output is always CSV by default. To write Excel output, use `-o`:
|
||||
|
||||
```bash
|
||||
python -m src.cli data.xlsx -o cleaned.xlsx --apply
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Auto-Detection Details
|
||||
|
||||
When no `--subset` or `--fuzzy` flags are provided, the engine scans column names and builds strategies:
|
||||
|
||||
| Column pattern | Detection regex | Algorithm | Threshold | Normalizer | Key type |
|
||||
|---------------|----------------|-----------|-----------|------------|----------|
|
||||
| Email | `e[-_]?mail` | exact | 100% | email | strong |
|
||||
| Phone | `phone\|telephone\|mobile\|cell` | exact | 100% | phone | strong |
|
||||
| Name | `^(name\|full_name\|customer_name\|...)$` | jaro_winkler | 85% | name | weak |
|
||||
| Address | `address\|street\|addr` | token_set_ratio | 80% | address | weak |
|
||||
|
||||
**Strategy building rules:**
|
||||
- Strong keys → standalone OR strategies (email match alone is enough)
|
||||
- Weak keys → paired with each strong key via AND (name match requires email or phone match too)
|
||||
- No strong keys found → weak keys promoted to standalone
|
||||
- No patterns matched → exact match on all columns (equivalent to `drop_duplicates`)
|
||||
|
||||
## Output Files
|
||||
|
||||
When `--apply` is set, three files are written:
|
||||
|
||||
| File | Description |
|
||||
|------|-------------|
|
||||
| `{stem}_deduplicated.csv` | Cleaned DataFrame with duplicates removed |
|
||||
| `{stem}_removed.csv` | Rows that were removed |
|
||||
| `{stem}_match_groups.csv` | Audit trail with `_group_id`, `_is_survivor`, `_confidence`, `_matched_on`, `_original_row`, plus all original columns |
|
||||
|
||||
## Logging
|
||||
|
||||
Every run writes a timestamped log to `logs/dedup_YYYYMMDD_HHMMSS.log` with full debug-level details: strategies used, pair comparisons, survivor decisions, and merge actions.
|
||||
282
docs/DEVELOPER.md
Normal file
282
docs/DEVELOPER.md
Normal file
@@ -0,0 +1,282 @@
|
||||
# Developer Guide
|
||||
|
||||
Architecture, data flow, and extension guide for the DataTools Deduplicator.
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
CLI (src/cli.py) GUI (src/gui/app.py)
|
||||
│ │
|
||||
│ flags → strategies │ widgets → strategies
|
||||
│ _interactive_review() │ match_group_card()
|
||||
│ tqdm progress bar │ st.progress()
|
||||
│ │
|
||||
└──────────┐ ┌────────────────┘
|
||||
│ │
|
||||
▼ ▼
|
||||
┌─────────────────┐
|
||||
│ core.dedup │
|
||||
│ deduplicate() │
|
||||
└────────┬────────┘
|
||||
│
|
||||
┌────────────┼────────────┐
|
||||
▼ ▼ ▼
|
||||
core.io core.normalizers core.config
|
||||
read/write normalize_*() save/load JSON
|
||||
```
|
||||
|
||||
**Key principle:** All business logic lives in `src/core/`. The CLI and GUI are thin wrappers that translate user input into `deduplicate()` arguments and display the `DeduplicationResult`.
|
||||
|
||||
## File-by-File Reference
|
||||
|
||||
### src/core/dedup.py — Deduplication Engine
|
||||
|
||||
The central module. Contains:
|
||||
|
||||
- **Enums:** `Algorithm` (4 fuzzy algorithms), `SurvivorRule` (4 selection rules)
|
||||
- **Data classes:** `ColumnMatchStrategy`, `MatchStrategy`, `MatchResult`, `DeduplicationResult`
|
||||
- **`deduplicate()`** — main entry point. Takes a DataFrame + optional strategies/rules, returns a `DeduplicationResult` with deduplicated DataFrame, removed rows, match groups, and log entries.
|
||||
- **`build_default_strategies()`** — scans column names with regex patterns to auto-detect email, phone, name, and address columns. Builds strong/weak key strategies with appropriate algorithms and normalizers.
|
||||
- **`_UnionFind`** — disjoint-set data structure for transitive closure. If A matches B and B matches C, all three end up in one group.
|
||||
- **`_find_match_groups()`** — O(n^2) pairwise comparison. For each pair, tries all strategies (OR semantics). Feeds matches into union-find. Returns match groups with confidence scores.
|
||||
- **`_select_survivor()`** — picks the row to keep based on the survivor rule.
|
||||
- **`_merge_group()`** — fills blank fields in the survivor from loser rows.
|
||||
|
||||
### src/core/normalizers.py — Text Normalization
|
||||
|
||||
Five normalizer functions, each `str → str`, idempotent, None-safe:
|
||||
|
||||
- **`normalize_email()`** — lowercase, strip Gmail dots, strip `+tag` suffixes
|
||||
- **`normalize_phone()`** — parse with `phonenumbers` to E.164; fallback to digits-only
|
||||
- **`normalize_name()`** — strip title prefixes (Dr., Mr.) and suffixes (Jr., PhD), case-fold
|
||||
- **`normalize_address()`** — USPS abbreviations (Street→St, Avenue→Ave), case-fold
|
||||
- **`normalize_string()`** — trim, collapse whitespace, case-fold
|
||||
|
||||
The `get_normalizer()` registry function maps `NormalizerType` enum values to functions.
|
||||
|
||||
### src/core/io.py — File I/O
|
||||
|
||||
Auto-detection stack:
|
||||
|
||||
1. **`detect_encoding()`** — checks BOM, then uses `charset-normalizer` heuristics
|
||||
2. **`detect_delimiter()`** — uses `csv.Sniffer` on first 20 lines
|
||||
3. **`detect_header_row()`** — finds first row where all cells look like column names
|
||||
|
||||
Main functions:
|
||||
- **`read_file()`** — reads CSV/TSV/Excel with full auto-detection. Returns a DataFrame.
|
||||
- **`write_file()`** — writes DataFrame to CSV or Excel. Uses `utf-8-sig` by default for Windows Excel compatibility.
|
||||
- **`list_sheets()`** — returns sheet names from an Excel workbook.
|
||||
|
||||
### src/core/config.py — Configuration Profiles
|
||||
|
||||
Save/load deduplication settings as JSON:
|
||||
|
||||
- **`DeduplicationConfig`** — flat dataclass with all settings: strategies, survivor rule, merge flag, algorithm, threshold, normalizer map.
|
||||
- **`.to_file()` / `.from_file()`** — JSON serialization
|
||||
- **`.to_strategies()`** — converts config back to `MatchStrategy` objects for the engine
|
||||
- **`.to_survivor_rule()`** — converts string to `SurvivorRule` enum
|
||||
|
||||
### src/cli.py — Command-Line Interface
|
||||
|
||||
Typer-based CLI with 17 options. Key responsibilities:
|
||||
|
||||
- Parse flags into strategies, survivor rule, and other config
|
||||
- Set up logging (timestamped log files in `logs/`)
|
||||
- Column name validation with fuzzy suggestions on typos
|
||||
- `_interactive_review()` — side-by-side row display with y/n/s prompts
|
||||
- Progress bar via `tqdm` for files > 10,000 rows
|
||||
- Output formatting and file writing
|
||||
|
||||
### src/gui/app.py — Streamlit GUI
|
||||
|
||||
Single-page layout:
|
||||
- File upload with instant preview
|
||||
- Advanced options expander (column selection, fuzzy, normalizers, survivor rule, merge, config profiles)
|
||||
- Find Duplicates button → runs `deduplicate()` with `progress_callback`
|
||||
- Interactive review: expandable match group cards with merge/keep/skip buttons
|
||||
- Download buttons for deduplicated CSV, removed rows, and match groups report
|
||||
|
||||
### src/gui/components.py — Reusable GUI Widgets
|
||||
|
||||
- **`match_group_card()`** — expandable card showing side-by-side row comparison with diff highlighting
|
||||
- **`config_panel()`** — the advanced options expander, returns a `DeduplicationConfig`
|
||||
- **`results_summary()`** — summary stats and download buttons
|
||||
|
||||
## Data Flow
|
||||
|
||||
```
|
||||
Input File
|
||||
│
|
||||
▼
|
||||
read_file() ← auto-detect encoding, delimiter, header
|
||||
│
|
||||
▼
|
||||
DataFrame
|
||||
│
|
||||
▼
|
||||
build_default_strategies() ← (if no explicit strategies)
|
||||
│ scan column names → regex patterns
|
||||
│ strong keys: email, phone (standalone OR)
|
||||
│ weak keys: name, address (AND with strong)
|
||||
▼
|
||||
_apply_normalizations() ← add _norm_* shadow columns
|
||||
│ normalize_email(), normalize_phone(), etc.
|
||||
▼
|
||||
_find_match_groups() ← O(n²) pairwise comparison
|
||||
│ for each pair: try all strategies (OR)
|
||||
│ _compute_similarity() per column
|
||||
│ union-find for transitive closure
|
||||
▼
|
||||
[review_callback()] ← optional: interactive review per group
|
||||
│ True=accept, False=reject, None=skip
|
||||
▼
|
||||
_select_survivor() ← per group: first/last/most-complete/most-recent
|
||||
│
|
||||
▼
|
||||
[_merge_group()] ← optional: fill blanks from losers
|
||||
│
|
||||
▼
|
||||
DeduplicationResult
|
||||
├── deduplicated_df ← cleaned DataFrame (shadow cols dropped)
|
||||
├── removed_df ← rows that were removed
|
||||
├── match_groups ← list of MatchResult with confidence, columns
|
||||
└── log_entries ← human-readable audit log
|
||||
```
|
||||
|
||||
## How to Add a Normalizer
|
||||
|
||||
1. **Add the function** in `src/core/normalizers.py`:
|
||||
|
||||
```python
|
||||
def normalize_company(value: Optional[str]) -> str:
|
||||
"""Strip legal suffixes (Inc, LLC, Corp), case-fold."""
|
||||
if not value or not isinstance(value, str):
|
||||
return ""
|
||||
name = value.strip().casefold()
|
||||
# Strip common suffixes
|
||||
for suffix in ("inc", "llc", "corp", "ltd", "co"):
|
||||
name = re.sub(rf"\b{suffix}\.?\s*$", "", name).strip()
|
||||
return name
|
||||
```
|
||||
|
||||
2. **Register it** in the same file:
|
||||
|
||||
```python
|
||||
class NormalizerType(str, Enum):
|
||||
# ... existing types ...
|
||||
COMPANY = "company" # ← add enum value
|
||||
|
||||
_NORMALIZER_MAP: dict[NormalizerType, Callable[[str], str]] = {
|
||||
# ... existing entries ...
|
||||
NormalizerType.COMPANY: normalize_company, # ← add mapping
|
||||
}
|
||||
```
|
||||
|
||||
3. **Add auto-detection pattern** in `src/core/dedup.py` (optional):
|
||||
|
||||
```python
|
||||
_COLUMN_TYPE_PATTERNS = [
|
||||
# ... existing patterns ...
|
||||
(re.compile(r"company|organization|org_name", re.I),
|
||||
NormalizerType.COMPANY, Algorithm.TOKEN_SET_RATIO, 85.0, False),
|
||||
]
|
||||
```
|
||||
|
||||
## How to Add a Matching Algorithm
|
||||
|
||||
1. **Add the enum value** in `src/core/dedup.py`:
|
||||
|
||||
```python
|
||||
class Algorithm(str, Enum):
|
||||
# ... existing values ...
|
||||
SOUNDEX = "soundex"
|
||||
```
|
||||
|
||||
2. **Add the computation** in `_compute_similarity()`:
|
||||
|
||||
```python
|
||||
def _compute_similarity(val_a: str, val_b: str, algorithm: Algorithm) -> float:
|
||||
# ... existing cases ...
|
||||
if algorithm == Algorithm.SOUNDEX:
|
||||
return 100.0 if _soundex(val_a) == _soundex(val_b) else 0.0
|
||||
```
|
||||
|
||||
3. **Add the CLI flag value** in `src/cli.py` help text for `--algorithm`.
|
||||
|
||||
## How to Add a Survivor Strategy
|
||||
|
||||
1. **Add the enum value** in `src/core/dedup.py`:
|
||||
|
||||
```python
|
||||
class SurvivorRule(str, Enum):
|
||||
# ... existing values ...
|
||||
KEEP_LONGEST = "longest"
|
||||
```
|
||||
|
||||
2. **Add the logic** in `_select_survivor()`:
|
||||
|
||||
```python
|
||||
if rule == SurvivorRule.KEEP_LONGEST:
|
||||
return max(indices, key=lambda i: len(str(df.iloc[i].to_dict())))
|
||||
```
|
||||
|
||||
3. **Add to the CLI** survivor map in `src/cli.py`.
|
||||
|
||||
## Testing
|
||||
|
||||
### Run Tests
|
||||
|
||||
```bash
|
||||
# All tests
|
||||
pytest tests/ -q
|
||||
|
||||
# Specific module
|
||||
pytest tests/test_dedup.py -q
|
||||
pytest tests/test_normalizers.py -q
|
||||
pytest tests/test_io.py -q
|
||||
pytest tests/test_config.py -q
|
||||
pytest tests/test_cli.py -q
|
||||
|
||||
# Verbose with output
|
||||
pytest tests/ -v
|
||||
|
||||
# Stop on first failure
|
||||
pytest tests/ -x
|
||||
```
|
||||
|
||||
### Test Structure
|
||||
|
||||
```
|
||||
tests/
|
||||
├── conftest.py # Shared fixtures
|
||||
│ ├── sample_csv_path # Path to samples/messy_sales.csv
|
||||
│ ├── sample_df # Loaded sample CSV as DataFrame
|
||||
│ ├── simple_df # Small 5-row DataFrame with obvious duplicates
|
||||
│ ├── merge_df # DataFrame with partial records
|
||||
│ └── tmp_csv # Temporary CSV from simple_df
|
||||
├── test_dedup.py # Engine tests: similarity, union-find, pairs, integration
|
||||
├── test_normalizers.py # Normalizer tests: all 5 types with edge cases
|
||||
├── test_io.py # I/O tests: encoding, delimiter, header, read/write
|
||||
├── test_config.py # Config tests: serialization round-trip
|
||||
└── test_cli.py # CLI tests: argument parsing, file handling
|
||||
```
|
||||
|
||||
### Writing Tests
|
||||
|
||||
Follow existing patterns. Tests use pytest fixtures from `conftest.py`:
|
||||
|
||||
```python
|
||||
def test_my_feature(simple_df):
|
||||
"""Test description."""
|
||||
result = deduplicate(simple_df, ...)
|
||||
assert len(result.match_groups) == expected
|
||||
assert result.deduplicated_df.shape[0] == expected_rows
|
||||
```
|
||||
|
||||
## Known Limitations
|
||||
|
||||
- **O(n^2) pairwise comparison** — no blocking or indexing. Works well up to ~50,000 rows. Beyond that, performance degrades quadratically. Future optimization: add blocking (partition by first letter, zip code prefix, etc.) to reduce comparison space.
|
||||
- **No multi-sheet dedup** — each Excel sheet is processed independently. Cross-sheet deduplication is not supported.
|
||||
- **Phone normalization requires valid-length numbers** — the `phonenumbers` library rejects numbers that are too short or too long for the detected region. Fallback is digits-only, which may produce false negatives for international numbers without country codes.
|
||||
- **Single-threaded** — no parallel comparison. Could benefit from `multiprocessing` for large files.
|
||||
- **Memory-bound** — entire file is loaded into a pandas DataFrame. Files larger than available RAM will fail. Chunked reading exists but is not integrated with the dedup engine.
|
||||
Reference in New Issue
Block a user