diff --git a/README.md b/README.md index e0c61a2..0d9ee5d 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ # DataTools Deduplicator -Find and remove duplicate rows in CSV and Excel files — with fuzzy matching, smart normalization, and interactive review. +Find and remove duplicate rows in CSV, delimited text, and Excel files — with fuzzy matching, smart normalization, and interactive review. ## Features @@ -9,7 +9,7 @@ Find and remove duplicate rows in CSV and Excel files — with fuzzy matching, s - **5 built-in normalizers** — email (Gmail dot/plus), phone (E.164), name (titles/suffixes), address (USPS), string (whitespace/case) - **Merge mode** — fill missing fields in the surviving row from removed duplicates - **4 survivor rules** — keep first, last, most complete, or most recent row per group -- **Interactive review** — inspect each match group and decide: merge, keep both, or skip +- **Interactive review** — inspect match groups with inline checkboxes and column dropdowns, cherry-pick values, preview surviving rows live - **Config profiles** — save and reload your settings as JSON for repeatable runs - **Dual interface** — full CLI for automation, Streamlit GUI for visual review - **Dry-run by default** — preview what would change before writing anything @@ -111,7 +111,6 @@ When `--apply` is used, three files are produced: - [CLI Reference](docs/CLI-REFERENCE.md) — every flag with examples and recipe sections - [Developer Guide](docs/DEVELOPER.md) — architecture, data flow, how to extend -- [User Guide](docs/USER-GUIDE.md) — installation and usage for end users ## Requirements diff --git a/docs/CLI-REFERENCE.md b/docs/CLI-REFERENCE.md index 15e1d25..71541c9 100644 --- a/docs/CLI-REFERENCE.md +++ b/docs/CLI-REFERENCE.md @@ -10,7 +10,7 @@ python -m src.cli INPUT_FILE [OPTIONS] | Argument | Required | Description | |----------|----------|-------------| -| `INPUT_FILE` | Yes | Path to the CSV or Excel file to deduplicate | +| `INPUT_FILE` | Yes | Path to the CSV, delimited text, or Excel file to deduplicate | ## Options diff --git a/docs/DEVELOPER.md b/docs/DEVELOPER.md index 43d03dc..b36e5cf 100644 --- a/docs/DEVELOPER.md +++ b/docs/DEVELOPER.md @@ -90,17 +90,20 @@ Typer-based CLI with 17 options. Key responsibilities: ### src/gui/app.py — Streamlit GUI Single-page layout: -- File upload with instant preview +- File upload with instant preview and configurable delimiter (comma, tab, semicolon, pipe, or custom) - Advanced options expander (column selection, fuzzy, normalizers, survivor rule, merge, config profiles) - Find Duplicates button → runs `deduplicate()` with `progress_callback` -- Interactive review: expandable match group cards with merge/keep/skip buttons +- Interactive review via `st.data_editor` with inline checkboxes and column dropdowns +- Batch actions: Accept All, Reject All, Clear Decisions +- Apply review decisions and download cleaned results - Download buttons for deduplicated CSV, removed rows, and match groups report ### src/gui/components.py — Reusable GUI Widgets -- **`match_group_card()`** — expandable card showing side-by-side row comparison with diff highlighting -- **`config_panel()`** — the advanced options expander, returns a `DeduplicationConfig` -- **`results_summary()`** — summary stats and download buttons +- **`match_group_card()`** — expandable card with `st.data_editor`: inline Keep checkboxes per row, `SelectboxColumn` dropdowns for differing columns, and a live surviving rows preview +- **`config_panel()`** — the advanced options expander, returns settings dict with strategies, survivor rule, merge flag +- **`results_summary()`** — summary metrics and download buttons +- **`apply_review_decisions()`** — builds final DataFrames from user review decisions (merge, split, or keep-all per group) with column override support ## Data Flow