docs: tight, scannable rewrite — every item earns its place
Refactors all 10 docs (README, USER-GUIDE, CLI-REFERENCE, REQUIREMENTS, TECHNICAL, DEVELOPER, BUSINESS, DECISIONS, RECOVERY, docs/README) from prose-heavy to bullet-heavy + table-heavy. Same information density, significantly less reading load. Net: 2600 → 1652 lines (~37% reduction) WHILE adding the new content that landed since v1.6: - Format Standardizer (3rd Ready tool) - 199-row buyer corpus - src/core/errors.py structured hierarchy + ensure_dataframe / ensure_choice / wrap_file_read|write / format_for_user helpers - src/core/_constants.py shared USPS/state lookup tables - Cross-tool audit fixes (NaN matching, removed_df schema, validation, enum-bounds checks, forward-compat config) - Per-domain error_policy across format standardizers - Inconsistent-date-format detector - Excel header-row auto-detection + write_file delimiter param Per-doc changes: - README.md (175 → 71): 9-tool table at top, status column, 3 CLI entry points listed, dropped repeated marketing prose. - docs/README.md (38 → 27): pure index — buyer-facing vs creator-only split + version footer. - USER-GUIDE.md (208 → 118): tool table replaces script descriptions, troubleshooting compressed to bullets, gate explanation tightened. - CLI-REFERENCE.md (451 → 235): collapsed flag tables, removed redundant intro text, kept full recipes section. - REQUIREMENTS.md (146 → 129): 18 numbered sections (was 17), added §18 Error Handling, formatting tightened to single-line entries. - TECHNICAL.md (570 → 350): collapsed §3 build pipeline tables, merged redundant §3.5-3.7 OS sections, added §7 (Error handling) + §11.3 (Format Standardizer spec) + §11.4-11.7 (analyzer / gate / Review page / repair_bytes promoted from §10.2.x sub-numbering). - DEVELOPER.md (285 → 161): module map table replaces per-file prose, extension recipes condensed, new §Errors covers when to use each hierarchy class. - BUSINESS.md (278 → 225): collapsed prose to tables (use cases, competitive landscape, costs, risks); honest-status updated. - DECISIONS.md (269 → 189): scoring rubric + GUI matrix preserved, decision log compressed to single-line entries, added v1.6 entries (Format Standardizer Ready, errors module). - RECOVERY.md (180 → 147): rebuild steps as numbered + tabular, external dependencies as one table, recovery priorities tightened. No information removed; redundancy compressed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -1,285 +1,161 @@
|
||||
# Developer Guide
|
||||
|
||||
Architecture, data flow, and extension guide for the DataTools Deduplicator.
|
||||
Architecture, data flow, extension points.
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
CLI (src/cli.py) GUI (src/gui/app.py)
|
||||
│ │
|
||||
│ flags → strategies │ widgets → strategies
|
||||
│ _interactive_review() │ match_group_card()
|
||||
│ tqdm progress bar │ st.progress()
|
||||
│ │
|
||||
└──────────┐ ┌────────────────┘
|
||||
│ │
|
||||
CLI (src/cli*.py) GUI (src/gui/app.py + pages/)
|
||||
│ │
|
||||
└──────────┐ ┌──────────┘
|
||||
▼ ▼
|
||||
┌─────────────────┐
|
||||
│ core.dedup │
|
||||
│ deduplicate() │
|
||||
└────────┬────────┘
|
||||
│
|
||||
┌────────────┼────────────┐
|
||||
▼ ▼ ▼
|
||||
core.io core.normalizers core.config
|
||||
read/write normalize_*() save/load JSON
|
||||
┌────────────────┐
|
||||
│ src/core/ │
|
||||
└────────────────┘
|
||||
```
|
||||
|
||||
**Key principle:** All business logic lives in `src/core/`. The CLI and GUI are thin wrappers that translate user input into `deduplicate()` arguments and display the `DeduplicationResult`.
|
||||
**Core/UI rule**: business logic in `core/` only. CLI + GUI translate user input → core call → display result.
|
||||
|
||||
## File-by-File Reference
|
||||
## Module map
|
||||
|
||||
### src/core/dedup.py — Deduplication Engine
|
||||
| Module | Public surface |
|
||||
|--------|----------------|
|
||||
| `core.dedup` | `deduplicate()`, `MatchStrategy`, `ColumnMatchStrategy`, `Algorithm`, `SurvivorRule`, `DeduplicationResult`, `MatchResult`, `build_default_strategies()` |
|
||||
| `core.normalizers` | `normalize_email/phone/name/address/string`, `NormalizerType`, `get_normalizer()` |
|
||||
| `core.io` | `read_file()`, `write_file()`, `list_sheets()`, `detect_encoding/delimiter/header_row`, `repair_bytes()` |
|
||||
| `core.config` | `DeduplicationConfig.from_file/to_file/to_strategies/to_survivor_rule` |
|
||||
| `core.analyze` | `analyze()`, `Finding`, `findings_by_tool()`, `_NULL_LIKE` |
|
||||
| `core.fixes` | `@register("fix_id")` decorator, `get_fix()`, `available_actions()` |
|
||||
| `core.normalize` | `auto_fix()`, `apply_decisions()`, `NormalizationResult`, `is_normalized()` |
|
||||
| `core.text_clean` | `clean_dataframe()`, `CleanOptions`, `CleanResult`, `smart_title_case()` |
|
||||
| `core.format_standardize` | `standardize_dataframe()`, `StandardizeOptions`, `StandardizeResult`, `FieldType`, per-cell `standardize_*()` |
|
||||
| `core.errors` | `DataToolsError` hierarchy, `ensure_dataframe()`, `ensure_choice()`, `wrap_file_read/write()`, `format_for_user()` |
|
||||
| `core._constants` | `US_STATE_NAMES`, `US_STATE_CODES`, `USPS_EXPANSIONS`, `USPS_COMPRESSIONS` |
|
||||
|
||||
The central module. Contains:
|
||||
|
||||
- **Enums:** `Algorithm` (4 fuzzy algorithms), `SurvivorRule` (4 selection rules)
|
||||
- **Data classes:** `ColumnMatchStrategy`, `MatchStrategy`, `MatchResult`, `DeduplicationResult`
|
||||
- **`deduplicate()`** — main entry point. Takes a DataFrame + optional strategies/rules, returns a `DeduplicationResult` with deduplicated DataFrame, removed rows, match groups, and log entries.
|
||||
- **`build_default_strategies()`** — scans column names with regex patterns to auto-detect email, phone, name, and address columns. Builds strong/weak key strategies with appropriate algorithms and normalizers.
|
||||
- **`_UnionFind`** — disjoint-set data structure for transitive closure. If A matches B and B matches C, all three end up in one group.
|
||||
- **`_find_match_groups()`** — O(n^2) pairwise comparison. For each pair, tries all strategies (OR semantics). Feeds matches into union-find. Returns match groups with confidence scores.
|
||||
- **`_select_survivor()`** — picks the row to keep based on the survivor rule.
|
||||
- **`_merge_group()`** — fills blank fields in the survivor from loser rows.
|
||||
|
||||
### src/core/normalizers.py — Text Normalization
|
||||
|
||||
Five normalizer functions, each `str → str`, idempotent, None-safe:
|
||||
|
||||
- **`normalize_email()`** — lowercase, strip Gmail dots, strip `+tag` suffixes
|
||||
- **`normalize_phone()`** — parse with `phonenumbers` to E.164; fallback to digits-only
|
||||
- **`normalize_name()`** — strip title prefixes (Dr., Mr.) and suffixes (Jr., PhD), case-fold
|
||||
- **`normalize_address()`** — USPS abbreviations (Street→St, Avenue→Ave), case-fold
|
||||
- **`normalize_string()`** — trim, collapse whitespace, case-fold
|
||||
|
||||
The `get_normalizer()` registry function maps `NormalizerType` enum values to functions.
|
||||
|
||||
### src/core/io.py — File I/O
|
||||
|
||||
Auto-detection stack:
|
||||
|
||||
1. **`detect_encoding()`** — checks BOM, then uses `charset-normalizer` heuristics
|
||||
2. **`detect_delimiter()`** — uses `csv.Sniffer` on first 20 lines
|
||||
3. **`detect_header_row()`** — finds first row where all cells look like column names
|
||||
|
||||
Main functions:
|
||||
- **`read_file()`** — reads CSV/TSV/Excel with full auto-detection. Returns a DataFrame.
|
||||
- **`write_file()`** — writes DataFrame to CSV or Excel. Uses `utf-8-sig` by default for Windows Excel compatibility.
|
||||
- **`list_sheets()`** — returns sheet names from an Excel workbook.
|
||||
|
||||
### src/core/config.py — Configuration Profiles
|
||||
|
||||
Save/load deduplication settings as JSON:
|
||||
|
||||
- **`DeduplicationConfig`** — flat dataclass with all settings: strategies, survivor rule, merge flag, algorithm, threshold, normalizer map.
|
||||
- **`.to_file()` / `.from_file()`** — JSON serialization
|
||||
- **`.to_strategies()`** — converts config back to `MatchStrategy` objects for the engine
|
||||
- **`.to_survivor_rule()`** — converts string to `SurvivorRule` enum
|
||||
|
||||
### src/cli.py — Command-Line Interface
|
||||
|
||||
Typer-based CLI with 17 options. Key responsibilities:
|
||||
|
||||
- Parse flags into strategies, survivor rule, and other config
|
||||
- Set up logging (timestamped log files in `logs/`)
|
||||
- Column name validation with fuzzy suggestions on typos
|
||||
- `_interactive_review()` — side-by-side row display with y/n/s prompts
|
||||
- Progress bar via `tqdm` for files > 10,000 rows
|
||||
- Output formatting and file writing
|
||||
|
||||
### src/gui/app.py — Streamlit GUI
|
||||
|
||||
Single-page layout:
|
||||
- File upload with instant preview and configurable delimiter (comma, tab, semicolon, pipe, or custom)
|
||||
- Advanced options expander (column selection, fuzzy, normalizers, survivor rule, merge, config profiles)
|
||||
- Find Duplicates button → runs `deduplicate()` with `progress_callback`
|
||||
- Interactive review via `st.data_editor` with inline checkboxes and column dropdowns
|
||||
- Batch actions: Accept All, Reject All, Clear Decisions
|
||||
- Apply review decisions and download cleaned results
|
||||
- Download buttons for deduplicated CSV, removed rows, and match groups report
|
||||
|
||||
### src/gui/components.py — Reusable GUI Widgets
|
||||
|
||||
- **`match_group_card()`** — expandable card with `st.data_editor`: inline Keep checkboxes per row, `SelectboxColumn` dropdowns for differing columns, and a live surviving rows preview
|
||||
- **`config_panel()`** — the advanced options expander, returns settings dict with strategies, survivor rule, merge flag
|
||||
- **`results_summary()`** — summary metrics and download buttons
|
||||
- **`apply_review_decisions()`** — builds final DataFrames from user review decisions (merge, split, or keep-all per group) with column override support
|
||||
|
||||
## Data Flow
|
||||
## Data flow — Deduplicator
|
||||
|
||||
```
|
||||
Input File
|
||||
│
|
||||
▼
|
||||
read_file() ← auto-detect encoding, delimiter, header
|
||||
│
|
||||
▼
|
||||
DataFrame
|
||||
│
|
||||
▼
|
||||
build_default_strategies() ← (if no explicit strategies)
|
||||
│ scan column names → regex patterns
|
||||
│ strong keys: email, phone (standalone OR)
|
||||
│ weak keys: name, address (AND with strong)
|
||||
▼
|
||||
_apply_normalizations() ← add _norm_* shadow columns
|
||||
│ normalize_email(), normalize_phone(), etc.
|
||||
▼
|
||||
_find_match_groups() ← O(n²) pairwise comparison
|
||||
│ for each pair: try all strategies (OR)
|
||||
│ _compute_similarity() per column
|
||||
│ union-find for transitive closure
|
||||
▼
|
||||
[review_callback()] ← optional: interactive review per group
|
||||
│ True=accept, False=reject, None=skip
|
||||
▼
|
||||
_select_survivor() ← per group: first/last/most-complete/most-recent
|
||||
│
|
||||
▼
|
||||
[_merge_group()] ← optional: fill blanks from losers
|
||||
│
|
||||
▼
|
||||
DeduplicationResult
|
||||
├── deduplicated_df ← cleaned DataFrame (shadow cols dropped)
|
||||
├── removed_df ← rows that were removed
|
||||
├── match_groups ← list of MatchResult with confidence, columns
|
||||
└── log_entries ← human-readable audit log
|
||||
read_file() # auto-detect encoding, delimiter, header
|
||||
▼ DataFrame
|
||||
build_default_strategies() # if no explicit strategies
|
||||
▼ # strong keys (email, phone) → standalone OR
|
||||
# weak keys (name, address) → AND with strong
|
||||
_apply_normalizations() # add _norm_* shadow columns
|
||||
▼
|
||||
_find_match_groups() # O(n²) pair compare, OR strategies, union-find
|
||||
▼
|
||||
[review_callback()] # optional interactive review
|
||||
▼
|
||||
_select_survivor() # per group: first/last/most-complete/most-recent
|
||||
▼
|
||||
[_merge_group()] # optional: fill blanks from losers
|
||||
▼
|
||||
DeduplicationResult # deduplicated_df, removed_df, match_groups, log
|
||||
```
|
||||
|
||||
## How to Add a Normalizer
|
||||
## Extension recipes
|
||||
|
||||
1. **Add the function** in `src/core/normalizers.py`:
|
||||
### Add a normalizer
|
||||
|
||||
```python
|
||||
def normalize_company(value: Optional[str]) -> str:
|
||||
"""Strip legal suffixes (Inc, LLC, Corp), case-fold."""
|
||||
if not value or not isinstance(value, str):
|
||||
return ""
|
||||
name = value.strip().casefold()
|
||||
# Strip common suffixes
|
||||
for suffix in ("inc", "llc", "corp", "ltd", "co"):
|
||||
name = re.sub(rf"\b{suffix}\.?\s*$", "", name).strip()
|
||||
return name
|
||||
```
|
||||
1. Add function to `core/normalizers.py`:
|
||||
```python
|
||||
def normalize_company(value: Optional[str]) -> str:
|
||||
if not value or not isinstance(value, str): return ""
|
||||
name = value.strip().casefold()
|
||||
for sfx in ("inc", "llc", "corp", "ltd", "co"):
|
||||
name = re.sub(rf"\b{sfx}\.?\s*$", "", name).strip()
|
||||
return name
|
||||
```
|
||||
2. Register: add `COMPANY = "company"` to `NormalizerType` + entry in `_NORMALIZER_MAP`.
|
||||
3. Auto-detect (optional): add a `_COLUMN_TYPE_PATTERNS` row in `core/dedup.py`.
|
||||
|
||||
2. **Register it** in the same file:
|
||||
### Add a fuzzy algorithm
|
||||
|
||||
```python
|
||||
class NormalizerType(str, Enum):
|
||||
# ... existing types ...
|
||||
COMPANY = "company" # ← add enum value
|
||||
1. Add value to `Algorithm` enum in `core/dedup.py`.
|
||||
2. Add case in `_compute_similarity()`.
|
||||
3. Document the value in CLI help text.
|
||||
|
||||
_NORMALIZER_MAP: dict[NormalizerType, Callable[[str], str]] = {
|
||||
# ... existing entries ...
|
||||
NormalizerType.COMPANY: normalize_company, # ← add mapping
|
||||
}
|
||||
```
|
||||
### Add a survivor rule
|
||||
|
||||
3. **Add auto-detection pattern** in `src/core/dedup.py` (optional):
|
||||
1. Add value to `SurvivorRule` enum.
|
||||
2. Add branch in `_select_survivor()`.
|
||||
3. Add CLI mapping.
|
||||
|
||||
```python
|
||||
_COLUMN_TYPE_PATTERNS = [
|
||||
# ... existing patterns ...
|
||||
(re.compile(r"company|organization|org_name", re.I),
|
||||
NormalizerType.COMPANY, Algorithm.TOKEN_SET_RATIO, 85.0, False),
|
||||
]
|
||||
```
|
||||
### Add a fix + detector (analyzer/gate)
|
||||
|
||||
## How to Add a Matching Algorithm
|
||||
1. **Detector** in `core/analyze.py`: add `_detect_<thing>(df) -> list[Finding]`, hook into the main `analyze()` pipeline. Emit Finding with a unique `fix_action` id.
|
||||
2. **Fix** in `core/fixes.py`:
|
||||
```python
|
||||
@register("fix_id")
|
||||
def my_fix(df, payload=None) -> tuple[pd.DataFrame, int]:
|
||||
# ...
|
||||
return out_df, cells_changed
|
||||
```
|
||||
3. **Constant** in `core/analyze.py`: add `FIX_<NAME> = "fix_id"` so the detector and fix can reference it.
|
||||
|
||||
1. **Add the enum value** in `src/core/dedup.py`:
|
||||
No other call sites change. Gate auto-discovers it via the registry.
|
||||
|
||||
```python
|
||||
class Algorithm(str, Enum):
|
||||
# ... existing values ...
|
||||
SOUNDEX = "soundex"
|
||||
```
|
||||
### Add a format-standardizer field type
|
||||
|
||||
2. **Add the computation** in `_compute_similarity()`:
|
||||
1. Add value to `FieldType` enum in `core/format_standardize.py`.
|
||||
2. Add per-cell `standardize_<x>(value, *, …)` returning `(new_value, changed)`.
|
||||
3. Add option fields to `StandardizeOptions` (with defaults that preserve existing behavior).
|
||||
4. Wire into `_apply_field_type()` dispatcher (the `else` branch raises `AssertionError` — every enum value needs a branch).
|
||||
5. Add validation entry in `StandardizeOptions.from_dict()` for any new enum-shaped option.
|
||||
|
||||
```python
|
||||
def _compute_similarity(val_a: str, val_b: str, algorithm: Algorithm) -> float:
|
||||
# ... existing cases ...
|
||||
if algorithm == Algorithm.SOUNDEX:
|
||||
return 100.0 if _soundex(val_a) == _soundex(val_b) else 0.0
|
||||
```
|
||||
## Errors
|
||||
|
||||
3. **Add the CLI flag value** in `src/cli.py` help text for `--algorithm`.
|
||||
Use `core/errors.py` instead of raw `ValueError` / `OSError`:
|
||||
|
||||
## How to Add a Survivor Strategy
|
||||
| Pattern | Use |
|
||||
|---------|-----|
|
||||
| Bad arg, wrong type, missing column | `InputValidationError` |
|
||||
| Bad config / options file | `ConfigError` |
|
||||
| File parses but isn't what we expected | `FileFormatError` |
|
||||
| File I/O failure (perms, missing, disk full) | `FileAccessError` |
|
||||
| Internal invariant broken (unreachable branch) | `AssertionError` |
|
||||
|
||||
1. **Add the enum value** in `src/core/dedup.py`:
|
||||
Helpers:
|
||||
- `ensure_dataframe(value, function="my_func")` at every public entry that takes a df.
|
||||
- `ensure_choice(value, name="mode", choices=[...])` at every entry that takes a literal.
|
||||
- `wrap_file_read(path, "operation", exc)` / `wrap_file_write(...)` when wrapping `OSError`.
|
||||
|
||||
```python
|
||||
class SurvivorRule(str, Enum):
|
||||
# ... existing values ...
|
||||
KEEP_LONGEST = "longest"
|
||||
```
|
||||
GUI / CLI handlers: use `format_for_user(exc, context="...")` to render.
|
||||
|
||||
2. **Add the logic** in `_select_survivor()`:
|
||||
All `DataToolsError` subclasses extend stdlib `ValueError` or `OSError` so existing handlers still catch them.
|
||||
|
||||
```python
|
||||
if rule == SurvivorRule.KEEP_LONGEST:
|
||||
return max(indices, key=lambda i: len(str(df.iloc[i].to_dict())))
|
||||
```
|
||||
|
||||
3. **Add to the CLI** survivor map in `src/cli.py`.
|
||||
|
||||
## Testing
|
||||
|
||||
### Run Tests
|
||||
## Tests
|
||||
|
||||
```bash
|
||||
# All tests
|
||||
pytest tests/ -q
|
||||
|
||||
# Specific module
|
||||
pytest tests/test_dedup.py -q
|
||||
pytest tests/test_normalizers.py -q
|
||||
pytest tests/test_io.py -q
|
||||
pytest tests/test_config.py -q
|
||||
pytest tests/test_cli.py -q
|
||||
|
||||
# Verbose with output
|
||||
pytest tests/ -v
|
||||
|
||||
# Stop on first failure
|
||||
pytest tests/ -x
|
||||
# All
|
||||
pytest -q
|
||||
# By module
|
||||
pytest tests/test_dedup.py
|
||||
# Include slow / integration
|
||||
pytest -m slow
|
||||
# Single test
|
||||
pytest tests/test_dedup.py::TestExactMatch::test_basic
|
||||
```
|
||||
|
||||
### Test Structure
|
||||
|
||||
Test layout:
|
||||
```
|
||||
tests/
|
||||
├── conftest.py # Shared fixtures
|
||||
│ ├── sample_csv_path # Path to samples/messy_sales.csv
|
||||
│ ├── sample_df # Loaded sample CSV as DataFrame
|
||||
│ ├── simple_df # Small 5-row DataFrame with obvious duplicates
|
||||
│ ├── merge_df # DataFrame with partial records
|
||||
│ └── tmp_csv # Temporary CSV from simple_df
|
||||
├── test_dedup.py # Engine tests: similarity, union-find, pairs, integration
|
||||
├── test_normalizers.py # Normalizer tests: all 5 types with edge cases
|
||||
├── test_io.py # I/O tests: encoding, delimiter, header, read/write
|
||||
├── test_config.py # Config tests: serialization round-trip
|
||||
└── test_cli.py # CLI tests: argument parsing, file handling
|
||||
├── conftest.py # fixtures
|
||||
├── test_dedup.py · test_normalizers.py · test_io.py · test_config.py
|
||||
├── test_analyze.py · test_normalize.py · test_text_clean.py
|
||||
├── test_format_standardize.py
|
||||
├── test_format_standardize_corpus.py # 199-row buyer corpus
|
||||
├── test_audit_fixes.py · test_errors.py · test_fixes_unit.py
|
||||
├── test_corpus.py · test_encodings_corpus.py · test_fixtures_sweep.py
|
||||
└── test_cli.py · test_cli_*.py · test_e2e.py · test_install.py
|
||||
```
|
||||
|
||||
### Writing Tests
|
||||
Fixture corpora: `test-cases/text-cleaner-corpus/` (21 files) · `test-cases/encodings-corpus/` (31 files) · `test-cases/format-cleaner-corpus/` (7 files + spec).
|
||||
|
||||
Follow existing patterns. Tests use pytest fixtures from `conftest.py`:
|
||||
## Known limitations
|
||||
|
||||
```python
|
||||
def test_my_feature(simple_df):
|
||||
"""Test description."""
|
||||
result = deduplicate(simple_df, ...)
|
||||
assert len(result.match_groups) == expected
|
||||
assert result.deduplicated_df.shape[0] == expected_rows
|
||||
```
|
||||
|
||||
## Known Limitations
|
||||
|
||||
- **O(n^2) pairwise comparison** — no blocking or indexing. Works well up to ~50,000 rows. Beyond that, performance degrades quadratically. Future optimization: add blocking (partition by first letter, zip code prefix, etc.) to reduce comparison space.
|
||||
- **No multi-sheet dedup** — each Excel sheet is processed independently. Cross-sheet deduplication is not supported.
|
||||
- **Phone normalization requires valid-length numbers** — the `phonenumbers` library rejects numbers that are too short or too long for the detected region. Fallback is digits-only, which may produce false negatives for international numbers without country codes.
|
||||
- **Single-threaded** — no parallel comparison. Could benefit from `multiprocessing` for large files.
|
||||
- **Memory-bound** — entire file is loaded into a pandas DataFrame. Files larger than available RAM will fail. Chunked reading exists but is not integrated with the dedup engine.
|
||||
- **Dedup is O(n²)** — no blocking. Works to ~50k rows. Future: partition by first letter / ZIP prefix.
|
||||
- **Single-threaded** — could benefit from `multiprocessing`.
|
||||
- **Memory-bound** — entire file loaded into pandas. Streaming reads exist but not integrated with dedup engine.
|
||||
- **No multi-sheet dedup** — each Excel sheet processed independently.
|
||||
- **Phonenumbers minimum-length** — international numbers without country codes fall back to digits-only.
|
||||
|
||||
Reference in New Issue
Block a user