docs: update all documentation to reflect v3.0 functionality

Update README, CLI reference, and developer guide to cover delimiter
selector, inline checkboxes/dropdowns, live surviving rows preview,
multi-row survivors, and apply_review_decisions(). Remove dead link.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-04-29 00:58:38 +00:00
parent 27fe87c4fe
commit 9ec371a85f
3 changed files with 11 additions and 9 deletions

View File

@@ -1,6 +1,6 @@
# DataTools Deduplicator # DataTools Deduplicator
Find and remove duplicate rows in CSV and Excel files — with fuzzy matching, smart normalization, and interactive review. Find and remove duplicate rows in CSV, delimited text, and Excel files — with fuzzy matching, smart normalization, and interactive review.
## Features ## Features
@@ -9,7 +9,7 @@ Find and remove duplicate rows in CSV and Excel files — with fuzzy matching, s
- **5 built-in normalizers** — email (Gmail dot/plus), phone (E.164), name (titles/suffixes), address (USPS), string (whitespace/case) - **5 built-in normalizers** — email (Gmail dot/plus), phone (E.164), name (titles/suffixes), address (USPS), string (whitespace/case)
- **Merge mode** — fill missing fields in the surviving row from removed duplicates - **Merge mode** — fill missing fields in the surviving row from removed duplicates
- **4 survivor rules** — keep first, last, most complete, or most recent row per group - **4 survivor rules** — keep first, last, most complete, or most recent row per group
- **Interactive review** — inspect each match group and decide: merge, keep both, or skip - **Interactive review** — inspect match groups with inline checkboxes and column dropdowns, cherry-pick values, preview surviving rows live
- **Config profiles** — save and reload your settings as JSON for repeatable runs - **Config profiles** — save and reload your settings as JSON for repeatable runs
- **Dual interface** — full CLI for automation, Streamlit GUI for visual review - **Dual interface** — full CLI for automation, Streamlit GUI for visual review
- **Dry-run by default** — preview what would change before writing anything - **Dry-run by default** — preview what would change before writing anything
@@ -111,7 +111,6 @@ When `--apply` is used, three files are produced:
- [CLI Reference](docs/CLI-REFERENCE.md) — every flag with examples and recipe sections - [CLI Reference](docs/CLI-REFERENCE.md) — every flag with examples and recipe sections
- [Developer Guide](docs/DEVELOPER.md) — architecture, data flow, how to extend - [Developer Guide](docs/DEVELOPER.md) — architecture, data flow, how to extend
- [User Guide](docs/USER-GUIDE.md) — installation and usage for end users
## Requirements ## Requirements

View File

@@ -10,7 +10,7 @@ python -m src.cli INPUT_FILE [OPTIONS]
| Argument | Required | Description | | Argument | Required | Description |
|----------|----------|-------------| |----------|----------|-------------|
| `INPUT_FILE` | Yes | Path to the CSV or Excel file to deduplicate | | `INPUT_FILE` | Yes | Path to the CSV, delimited text, or Excel file to deduplicate |
## Options ## Options

View File

@@ -90,17 +90,20 @@ Typer-based CLI with 17 options. Key responsibilities:
### src/gui/app.py — Streamlit GUI ### src/gui/app.py — Streamlit GUI
Single-page layout: Single-page layout:
- File upload with instant preview - File upload with instant preview and configurable delimiter (comma, tab, semicolon, pipe, or custom)
- Advanced options expander (column selection, fuzzy, normalizers, survivor rule, merge, config profiles) - Advanced options expander (column selection, fuzzy, normalizers, survivor rule, merge, config profiles)
- Find Duplicates button → runs `deduplicate()` with `progress_callback` - Find Duplicates button → runs `deduplicate()` with `progress_callback`
- Interactive review: expandable match group cards with merge/keep/skip buttons - Interactive review via `st.data_editor` with inline checkboxes and column dropdowns
- Batch actions: Accept All, Reject All, Clear Decisions
- Apply review decisions and download cleaned results
- Download buttons for deduplicated CSV, removed rows, and match groups report - Download buttons for deduplicated CSV, removed rows, and match groups report
### src/gui/components.py — Reusable GUI Widgets ### src/gui/components.py — Reusable GUI Widgets
- **`match_group_card()`** — expandable card showing side-by-side row comparison with diff highlighting - **`match_group_card()`** — expandable card with `st.data_editor`: inline Keep checkboxes per row, `SelectboxColumn` dropdowns for differing columns, and a live surviving rows preview
- **`config_panel()`** — the advanced options expander, returns a `DeduplicationConfig` - **`config_panel()`** — the advanced options expander, returns settings dict with strategies, survivor rule, merge flag
- **`results_summary()`** — summary stats and download buttons - **`results_summary()`** — summary metrics and download buttons
- **`apply_review_decisions()`** — builds final DataFrames from user review decisions (merge, split, or keep-all per group) with column override support
## Data Flow ## Data Flow