# CLI Reference Complete command-line reference for the DataTools bundle. DataTools ships two CLI modules so each script can be invoked independently: | Module | Command | Purpose | |---|---|---| | `src.cli` | `python -m src.cli INPUT_FILE [OPTIONS]` | Deduplicator (script 01) | | `src.cli_text_clean` | `python -m src.cli_text_clean INPUT_FILE [OPTIONS]` | Text cleaner (script 02) | The deduplicator section is below; the text cleaner reference is in [Section: Text Cleaner CLI](#text-cleaner-cli). ## Deduplicator ``` python -m src.cli INPUT_FILE [OPTIONS] ``` ## Arguments | Argument | Required | Description | |----------|----------|-------------| | `INPUT_FILE` | Yes | Path to the CSV, delimited text, or Excel file to deduplicate | ## Options ### Core | Flag | Short | Default | Description | |------|-------|---------|-------------| | `--apply` | | `false` | Write output files. Without this flag, only a preview is shown. | | `--output` | `-o` | `{input}_deduplicated.csv` | Output file path. | ### Column Selection | Flag | Short | Default | Description | |------|-------|---------|-------------| | `--subset` | `-s` | auto-detect | Comma-separated columns to match on. When omitted, columns are auto-detected by name pattern (email, phone, name, address). | | `--key` | `-k` | none | Comma-separated strong-key columns. Each becomes an independent exact-match strategy. Use for identifiers like `fb_id`, `ein`, `sku`. | ### Fuzzy Matching | Flag | Short | Default | Description | |------|-------|---------|-------------| | `--fuzzy` | | none | Comma-separated columns to fuzzy-match. Other columns in the strategy use exact matching. | | `--algorithm` | `-a` | `jaro_winkler` | Fuzzy algorithm: `levenshtein`, `jaro_winkler`, or `token_set_ratio`. | | `--threshold` | `-t` | `85` | Similarity threshold 0-100. Lower values find more matches but increase false positives. | ### Normalization | Flag | Short | Default | Description | |------|-------|---------|-------------| | `--normalize` | | auto-detect | Column normalizers as `col:type` pairs, comma-separated. Types: `email`, `phone`, `name`, `address`, `string`. | **Normalizer details:** | Type | What it does | Example | |------|-------------|---------| | `email` | Lowercase, strip Gmail dots, strip `+tag` suffixes | `John.Doe+tag@gmail.com` → `johndoe@gmail.com` | | `phone` | Parse to E.164 format; fallback: digits only | `(555) 123-4567` → `+15551234567` | | `name` | Strip titles (Dr., Mr.) and suffixes (Jr., PhD), case-fold | `Dr. John Smith Jr.` → `john smith` | | `address` | USPS abbreviations (Street→St, Avenue→Ave), case-fold | `123 Main Street, Suite 4` → `123 main st ste 4` | | `string` | Trim, collapse whitespace, case-fold | ` HELLO WORLD ` → `hello world` | ### Survivor Selection | Flag | Short | Default | Description | |------|-------|---------|-------------| | `--survivor` | | `first` | Which row to keep per duplicate group. | | `--date-column` | | none | Date column for the `most-recent` rule. | | `--merge` | | `false` | Fill missing fields in the surviving row from removed duplicates. | **Survivor rules:** | Rule | Behavior | |------|----------| | `first` | Keep the first row encountered (lowest row number) | | `last` | Keep the last row encountered (highest row number) | | `most-complete` | Keep the row with the fewest blank/empty cells | | `most-recent` | Keep the row with the latest date (requires `--date-column`) | ### Interactive Review | Flag | Short | Default | Description | |------|-------|---------|-------------| | `--review` | | `false` | Interactively review each match group. For each group, choose: merge (y), keep both (n), or skip remaining (s). | ### Configuration | Flag | Short | Default | Description | |------|-------|---------|-------------| | `--config` | | none | Load all settings from a saved JSON config file. | | `--save-config` | | none | Save current settings to a JSON config file for reuse. | ### File Handling | Flag | Short | Default | Description | |------|-------|---------|-------------| | `--sheet` | | first sheet | Excel sheet name or 0-based index. Ignored for CSV files. | | `--encoding` | | auto-detect | Override auto-detected file encoding (e.g., `utf-8`, `windows-1252`). | | `--header-row` | | auto-detect | 0-based row index for the header row. | --- ## Recipes ### 1. Basic Dedup (Auto-Detect) Let the engine detect email, phone, name, and address columns automatically. ```bash # Preview python -m src.cli customers.csv # Apply python -m src.cli customers.csv --apply ``` The engine scans column names for patterns like `email`, `phone`, `name`, `address` and builds strategies automatically. Strong keys (email, phone) become standalone strategies; weak keys (name, address) are paired with strong keys. ### 2. Fuzzy Name Matching Match rows where names are similar but not identical — catches typos, nickname variations, and formatting differences. ```bash # Fuzzy-match on the "name" column at 80% similarity python -m src.cli customers.csv --fuzzy name --threshold 80 --apply # Fuzzy-match on multiple columns python -m src.cli customers.csv --fuzzy name,address --threshold 85 --apply # Use Levenshtein distance instead of Jaro-Winkler python -m src.cli customers.csv --fuzzy name --algorithm levenshtein --threshold 80 --apply ``` **Algorithm comparison:** - `jaro_winkler` (default) — best for short strings like names; weights early characters more heavily - `levenshtein` — edit-distance ratio; works well for typos and transpositions - `token_set_ratio` — best for addresses and long strings; ignores word order ### 3. Custom Strong Keys Use specific identifier columns to find exact duplicates. ```bash # Deduplicate by Facebook ID python -m src.cli donors.csv --key fb_id --apply # Multiple strong keys (each is independent — matched with OR) python -m src.cli donors.csv --key fb_id,ein --apply ``` Strong keys are OR'd: a match on `fb_id` alone OR `ein` alone marks rows as duplicates. ### 4. Merge Mode Keep the most complete row and fill any remaining blanks from the duplicates. ```bash # Most complete row + merge missing fields python -m src.cli contacts.csv --survivor most-complete --merge --apply # Keep most recent row and merge python -m src.cli contacts.csv --survivor most-recent --date-column updated_at --merge --apply ``` **How merge works:** The survivor row keeps all its non-empty fields. For any blank/null fields, the engine fills from the removed rows (in row order). The result is a single row with maximum data retention. ### 5. Multi-Column Subset Match on a specific combination of columns rather than auto-detecting. ```bash # Exact match on email + phone only python -m src.cli customers.csv --subset email,phone --apply # Mix exact and fuzzy within a subset python -m src.cli customers.csv --subset email,name --fuzzy name --threshold 85 --apply ``` When using `--subset`, all listed columns must match (AND logic) for a pair to be considered duplicates. ### 6. Save and Load Config Profiles Save your settings for repeatable runs on similar files. ```bash # Save settings to a file python -m src.cli customers.csv --fuzzy name --threshold 80 --merge \ --survivor most-complete --save-config customer_dedup.json # Load saved settings python -m src.cli new_customers.csv --config customer_dedup.json --apply ``` Config files are JSON. Example: ```json { "strategies": [], "survivor_rule": "most_complete", "merge": true, "default_algorithm": "jaro_winkler", "default_threshold": 80.0, "fuzzy_columns": ["name"] } ``` ### 7. Interactive Review Step through each match group and decide whether to merge. ```bash python -m src.cli customers.csv --review --apply ``` For each group, the CLI displays both rows side-by-side and prompts: ``` ============================================================ Match Group 1 — Confidence: 92.3% Matched on: name, phone ============================================================ Row 1: name: John Smith email: john@example.com phone: (555) 123-4567 Row 2: name: Jon Smith email: phone: 555-123-4567 [y] Merge [n] Keep both [s] Skip remaining: ``` - **y** — accept the match; merge/remove duplicate - **n** — reject the match; keep both rows - **s** — skip all remaining groups (keep both for all) ### 8. Excel Files and Multi-Sheet Work with Excel files directly — no CSV conversion needed. ```bash # Deduplicate first sheet (default) python -m src.cli data.xlsx --apply # Specify sheet by name python -m src.cli data.xlsx --sheet "Sales Data" --apply # Specify sheet by index (0-based) python -m src.cli data.xlsx --sheet 1 --apply ``` Output is always CSV by default. To write Excel output, use `-o`: ```bash python -m src.cli data.xlsx -o cleaned.xlsx --apply ``` --- ## Auto-Detection Details When no `--subset` or `--fuzzy` flags are provided, the engine scans column names and builds strategies: | Column pattern | Detection regex | Algorithm | Threshold | Normalizer | Key type | |---------------|----------------|-----------|-----------|------------|----------| | Email | `e[-_]?mail` | exact | 100% | email | strong | | Phone | `phone\|telephone\|mobile\|cell` | exact | 100% | phone | strong | | Name | `^(name\|full_name\|customer_name\|...)$` | jaro_winkler | 85% | name | weak | | Address | `address\|street\|addr` | token_set_ratio | 80% | address | weak | **Strategy building rules:** - Strong keys → standalone OR strategies (email match alone is enough) - Weak keys → paired with each strong key via AND (name match requires email or phone match too) - No strong keys found → weak keys promoted to standalone - No patterns matched → exact match on all columns (equivalent to `drop_duplicates`) ## Output Files When `--apply` is set, three files are written: | File | Description | |------|-------------| | `{stem}_deduplicated.csv` | Cleaned DataFrame with duplicates removed | | `{stem}_removed.csv` | Rows that were removed | | `{stem}_match_groups.csv` | Audit trail with `_group_id`, `_is_survivor`, `_confidence`, `_matched_on`, `_original_row`, plus all original columns | ## Logging Every run writes a timestamped log to `logs/dedup_YYYYMMDD_HHMMSS.log` with full debug-level details: strategies used, pair comparisons, survivor decisions, and merge actions. --- # Text Cleaner CLI Character-level hygiene for CSV / Excel files: whitespace trim and collapse, smart-character folding, Unicode normalization, BOM strip, control-char strip, line-ending normalization, optional case conversion. See TECHNICAL.md Section 10.2 for the full functional spec. ``` python -m src.cli_text_clean INPUT_FILE [OPTIONS] ``` ## Arguments | Argument | Required | Description | |----------|----------|-------------| | `INPUT_FILE` | Yes | Path to the CSV, TSV, or Excel file to clean | ## Options ### Core | Flag | Short | Default | Description | |------|-------|---------|-------------| | `--apply` | | `false` | Write output files. Without this flag, only a preview is shown. | | `--output` | `-o` | `{input}_cleaned.csv` | Output file path. | | `--preset` | | `excel-hygiene` | Preset bundle of safe defaults. See [Presets](#presets). | ### Scope | Flag | Default | Description | |------|---------|-------------| | `--columns` | all string columns | Comma-separated columns to clean. | | `--skip` | none | Comma-separated columns to skip even if they look like text. Useful for free-text notes columns you don't want touched. | ### Per-operation toggles These override the active preset. | Flag | Effect | |------|--------| | `--no-trim` | Disable leading/trailing whitespace strip | | `--no-collapse` | Disable internal whitespace collapse | | `--no-nfc` | Disable Unicode NFC normalization | | `--nfkc` | Enable NFKC compatibility fold (lossy: `①` → `1`, `fi` → `fi`) | | `--no-smart-chars` | Disable smart-character folding (curly quotes, em/en-dash, NBSP, ellipsis) | | `--no-zero-width` | Disable zero-width / invisible character strip | | `--no-bom` | Disable leading BOM strip | | `--no-control` | Disable control-character strip | | `--no-line-endings` | Disable line-ending normalization | ### Case conversion | Flag | Forms | Description | |------|-------|-------------| | `--case` | `upper`, `lower`, `title`, `sentence` | Apply this case to every selected column | | `--case` | `mode:col[,mode:col]` | Per-column case (e.g., `--case title:name,upper:code`) | Title case preserves all-caps tokens (`USA` stays `USA`) and lowercases mid-string particles (`of`, `and`, `the`, etc.). ### Audit and config | Flag | Default | Description | |------|---------|-------------| | `--full-changelog` | `false` | Write every cell change to the audit CSV (default caps to first 1000). | | `--config` | none | Load options from a saved JSON config file. | | `--save-config` | none | Save the current options to a JSON config file. | ### File format / encoding | Flag | Default | Description | |------|---------|-------------| | `--sheet` | `0` | Excel sheet name or 0-based index. | | `--encoding` | auto-detect | Override auto-detected file encoding. | | `--header-row` | auto-detect | 0-based row index for the header. | ## Presets | Preset | What it does | |---|---| | `minimal` | Trim + collapse whitespace only. Nothing else. | | `excel-hygiene` (default) | Trim, collapse, NFC, smart-character fold, zero-width strip, BOM strip, control strip, line-ending normalize. NFKC off. | | `paranoid` | All of `excel-hygiene` plus NFKC compatibility fold (lossy). | ## Output Files When `--apply` is set: | File | Description | |------|-------------| | `{stem}_cleaned.csv` | Cleaned DataFrame | | `{stem}_changes.csv` | Per-cell audit: `row`, `column`, `old`, `new`, `ops_applied` (capped to 1000 rows by default; use `--full-changelog` for all) | A timestamped log is always written to `logs/text_clean_YYYYMMDD_HHMMSS.log`. ## Recipes ```bash # Preview what would change with the safe defaults python -m src.cli_text_clean messy.csv # Apply the safe defaults python -m src.cli_text_clean messy.csv --apply # Just the basics — only trim and collapse, leave Unicode/quotes alone python -m src.cli_text_clean messy.csv --preset minimal --apply # Title-case the name column, upper-case the SKU column, leave others alone for case python -m src.cli_text_clean people.csv --case title:name,upper:sku --apply # Clean only specific columns python -m src.cli_text_clean orders.csv --columns vendor,product --apply # Skip a free-text notes column from cleaning python -m src.cli_text_clean tickets.csv --skip notes --apply # Save the current settings as a profile and reload it later python -m src.cli_text_clean messy.csv --preset minimal --case upper --save-config my.json python -m src.cli_text_clean other.csv --config my.json --apply ``` --- ## Analyzer (upload-time scan) ``` python -m src.cli_analyze INPUT_FILE [OPTIONS] --sample-rows N Cap on rows scanned (default 1000) --json Print findings as a JSON array on stdout --strict Exit non-zero on any warn/error finding ``` JSON output schema (one object per finding): ```json { "id": "smart_punctuation_in_data", "severity": "warn", "confidence": "high", "fix_action": "fold_smart_punctuation", "pre_applied": false, "tool": "02_text_cleaner", "count": 17, "description": "17 cell(s) contain curly quotes…", "column": null, "samples": [{"row": 3, "column": "name", "value": "“Alice”"}] } ``` - `severity` — `info` / `warn` / `error`. Only `error` blocks the GUI normalization gate. - `confidence` — `high` (round-trip-safe, eligible for one-click auto-fix), `medium` (preview before applying), `low` (heuristic, opt-in only). - `fix_action` — stable id naming the algorithm in `src/core/fixes.py` that resolves the finding. Empty string for informational-only findings. - `pre_applied` — `true` for fixes already applied during the byte-level read pass (BOM strip, NUL strip, line-ending normalize, byte-level smart-quote fold, transcode-to-UTF-8 from UTF-16/32). The GUI gate treats these as already-resolved; the CLI emits them so callers can audit what changed during read. The detector set covers smart punctuation, NBSP / Unicode whitespace, zero-width characters, dirty headers, whitespace padding, null-like sentinels, mojibake fingerprints (UTF-8-as-cp1252), mixed-case email columns, near-duplicate rows (case-and-padding stripped), leading-zero IDs (Excel hazard), mixed line endings, encoding decode failure (`encoding_decode_failed`), and U+FFFD presence in the loaded text (`encoding_uncertain`). New detectors plug in by appending one entry to `analyze.py` and one matching fix in `fixes.py`.