docs: update all documentation to reflect v3.0 functionality
Update README, CLI reference, and developer guide to cover delimiter selector, inline checkboxes/dropdowns, live surviving rows preview, multi-row survivors, and apply_review_decisions(). Remove dead link. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -1,6 +1,6 @@
|
|||||||
# DataTools Deduplicator
|
# DataTools Deduplicator
|
||||||
|
|
||||||
Find and remove duplicate rows in CSV and Excel files — with fuzzy matching, smart normalization, and interactive review.
|
Find and remove duplicate rows in CSV, delimited text, and Excel files — with fuzzy matching, smart normalization, and interactive review.
|
||||||
|
|
||||||
## Features
|
## Features
|
||||||
|
|
||||||
@@ -9,7 +9,7 @@ Find and remove duplicate rows in CSV and Excel files — with fuzzy matching, s
|
|||||||
- **5 built-in normalizers** — email (Gmail dot/plus), phone (E.164), name (titles/suffixes), address (USPS), string (whitespace/case)
|
- **5 built-in normalizers** — email (Gmail dot/plus), phone (E.164), name (titles/suffixes), address (USPS), string (whitespace/case)
|
||||||
- **Merge mode** — fill missing fields in the surviving row from removed duplicates
|
- **Merge mode** — fill missing fields in the surviving row from removed duplicates
|
||||||
- **4 survivor rules** — keep first, last, most complete, or most recent row per group
|
- **4 survivor rules** — keep first, last, most complete, or most recent row per group
|
||||||
- **Interactive review** — inspect each match group and decide: merge, keep both, or skip
|
- **Interactive review** — inspect match groups with inline checkboxes and column dropdowns, cherry-pick values, preview surviving rows live
|
||||||
- **Config profiles** — save and reload your settings as JSON for repeatable runs
|
- **Config profiles** — save and reload your settings as JSON for repeatable runs
|
||||||
- **Dual interface** — full CLI for automation, Streamlit GUI for visual review
|
- **Dual interface** — full CLI for automation, Streamlit GUI for visual review
|
||||||
- **Dry-run by default** — preview what would change before writing anything
|
- **Dry-run by default** — preview what would change before writing anything
|
||||||
@@ -111,7 +111,6 @@ When `--apply` is used, three files are produced:
|
|||||||
|
|
||||||
- [CLI Reference](docs/CLI-REFERENCE.md) — every flag with examples and recipe sections
|
- [CLI Reference](docs/CLI-REFERENCE.md) — every flag with examples and recipe sections
|
||||||
- [Developer Guide](docs/DEVELOPER.md) — architecture, data flow, how to extend
|
- [Developer Guide](docs/DEVELOPER.md) — architecture, data flow, how to extend
|
||||||
- [User Guide](docs/USER-GUIDE.md) — installation and usage for end users
|
|
||||||
|
|
||||||
## Requirements
|
## Requirements
|
||||||
|
|
||||||
|
|||||||
@@ -10,7 +10,7 @@ python -m src.cli INPUT_FILE [OPTIONS]
|
|||||||
|
|
||||||
| Argument | Required | Description |
|
| Argument | Required | Description |
|
||||||
|----------|----------|-------------|
|
|----------|----------|-------------|
|
||||||
| `INPUT_FILE` | Yes | Path to the CSV or Excel file to deduplicate |
|
| `INPUT_FILE` | Yes | Path to the CSV, delimited text, or Excel file to deduplicate |
|
||||||
|
|
||||||
## Options
|
## Options
|
||||||
|
|
||||||
|
|||||||
@@ -90,17 +90,20 @@ Typer-based CLI with 17 options. Key responsibilities:
|
|||||||
### src/gui/app.py — Streamlit GUI
|
### src/gui/app.py — Streamlit GUI
|
||||||
|
|
||||||
Single-page layout:
|
Single-page layout:
|
||||||
- File upload with instant preview
|
- File upload with instant preview and configurable delimiter (comma, tab, semicolon, pipe, or custom)
|
||||||
- Advanced options expander (column selection, fuzzy, normalizers, survivor rule, merge, config profiles)
|
- Advanced options expander (column selection, fuzzy, normalizers, survivor rule, merge, config profiles)
|
||||||
- Find Duplicates button → runs `deduplicate()` with `progress_callback`
|
- Find Duplicates button → runs `deduplicate()` with `progress_callback`
|
||||||
- Interactive review: expandable match group cards with merge/keep/skip buttons
|
- Interactive review via `st.data_editor` with inline checkboxes and column dropdowns
|
||||||
|
- Batch actions: Accept All, Reject All, Clear Decisions
|
||||||
|
- Apply review decisions and download cleaned results
|
||||||
- Download buttons for deduplicated CSV, removed rows, and match groups report
|
- Download buttons for deduplicated CSV, removed rows, and match groups report
|
||||||
|
|
||||||
### src/gui/components.py — Reusable GUI Widgets
|
### src/gui/components.py — Reusable GUI Widgets
|
||||||
|
|
||||||
- **`match_group_card()`** — expandable card showing side-by-side row comparison with diff highlighting
|
- **`match_group_card()`** — expandable card with `st.data_editor`: inline Keep checkboxes per row, `SelectboxColumn` dropdowns for differing columns, and a live surviving rows preview
|
||||||
- **`config_panel()`** — the advanced options expander, returns a `DeduplicationConfig`
|
- **`config_panel()`** — the advanced options expander, returns settings dict with strategies, survivor rule, merge flag
|
||||||
- **`results_summary()`** — summary stats and download buttons
|
- **`results_summary()`** — summary metrics and download buttons
|
||||||
|
- **`apply_review_decisions()`** — builds final DataFrames from user review decisions (merge, split, or keep-all per group) with column override support
|
||||||
|
|
||||||
## Data Flow
|
## Data Flow
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user