> 🌐 **Language:** English Β· [EspaΓ±ol](CLI-REFERENCE.es.md) # CLI Reference Three CLI modules, one per Ready tool: | Module | Command | Purpose | |--------|---------|---------| | `src.cli` | `python -m src.cli FILE` | Find Duplicates | | `src.cli_text_clean` | `python -m src.cli_text_clean FILE` | Clean Text | | `src.cli_analyze` | `python -m src.cli_analyze FILE` | Analyzer (read-only scan) | Every command is **preview-only by default** β€” add `--apply` to write output. --- # Find Duplicates ``` python -m src.cli INPUT_FILE [OPTIONS] ``` ## Options ### Core - `--apply` β€” write output files (default: preview). - `-o, --output PATH` β€” output path (default `{input}_deduplicated.csv`). ### Column selection - `-s, --subset COLS` β€” comma-separated columns to match on (default: auto-detect). - `-k, --key COLS` β€” strong-key columns; each becomes an independent exact-match strategy (`fb_id`, `ein`, `sku`). ### Fuzzy matching - `--fuzzy COLS` β€” comma-separated columns to fuzzy-match. - `-a, --algorithm ALG` β€” `levenshtein` / `jaro_winkler` (default) / `token_set_ratio`. - `-t, --threshold N` β€” similarity 0-100 (default 85). ### Normalization - `--normalize COL:TYPE` β€” comma-separated `col:type` pairs. Types: `email`, `phone`, `name`, `address`, `string`. | Type | Effect | Example | |------|--------|---------| | `email` | lowercase, strip Gmail dots, strip `+tag` | `John.Doe+x@gmail.com` β†’ `johndoe@gmail.com` | | `phone` | E.164 (+ ext preserved) | `(555) 123-4567 ext 100` β†’ `+15551234567;ext=100` | | `name` | strip titles + suffixes + particles, case-fold | `Dr. Charles de Gaulle Jr.` β†’ `charles gaulle` | | `address` | USPS abbrevs + state name β†’ 2-letter, case-fold | `123 Main Street, California` β†’ `123 main st ca` | | `string` | trim + collapse + case-fold | ` HELLO WORLD ` β†’ `hello world` | ### Survivor selection - `--survivor RULE` β€” `first` (default) / `last` / `most-complete` / `most-recent`. - `--date-column COL` β€” required for `most-recent`. - `--merge` β€” fill blanks in survivor from removed rows. ### Interactive review - `--review` β€” prompt y/n/s per match group with side-by-side diff. ### Configuration - `--config PATH` β€” load all settings from JSON. - `--save-config PATH` β€” save current settings to JSON. ### File handling - `--sheet NAME|N` β€” Excel sheet name or 0-based index. - `--encoding ENC` β€” override auto-detected encoding. - `--header-row N` β€” 0-based header row. ## Recipes ```bash # Basic auto-detect dedup python -m src.cli customers.csv [--apply] # Fuzzy name match at 80% python -m src.cli customers.csv --fuzzy name --threshold 80 --apply # Multiple strong keys (OR logic) python -m src.cli donors.csv --key fb_id,ein --apply # Most-complete row + merge missing fields python -m src.cli contacts.csv --survivor most-complete --merge --apply # Most-recent + merge python -m src.cli contacts.csv --survivor most-recent --date-column updated_at --merge --apply # Interactive review python -m src.cli customers.csv --review --apply # Save / load profile python -m src.cli customers.csv --fuzzy name --threshold 80 --save-config dedup.json python -m src.cli new.csv --config dedup.json --apply # Excel python -m src.cli data.xlsx --sheet "Sales" --apply ``` ## Algorithms - **`jaro_winkler`** (default) β€” best for short strings (names); weights early chars. - **`levenshtein`** β€” edit-distance ratio; typos and transpositions. - **`token_set_ratio`** β€” best for addresses; ignores word order. ## Auto-detection When no `--subset` / `--fuzzy` flags, columns are detected by name: | Pattern | Algorithm | Threshold | Normalizer | Key | |---------|-----------|-----------|------------|-----| | Email | exact | 100% | email | strong | | Phone | exact | 100% | phone | strong | | Name | jaro_winkler | 85% | name | weak | | Address | token_set_ratio | 80% | address | weak | **Strategy rules**: strong keys β†’ standalone OR; weak keys β†’ AND-paired with each strong key; no strong keys β†’ weak promoted to standalone; no patterns β†’ exact match on all columns. ## Output files (with `--apply`) | File | Contents | |------|----------| | `{stem}_deduplicated.csv` | Cleaned data | | `{stem}_removed.csv` | Removed rows | | `{stem}_match_groups.csv` | `_group_id`, `_is_survivor`, `_confidence`, `_matched_on`, `_original_row` + originals | Log: `logs/dedup_YYYYMMDD_HHMMSS.log`. --- # Clean Text ``` python -m src.cli_text_clean INPUT_FILE [OPTIONS] ``` Character-level hygiene. See [TECHNICAL.md Β§10.2](TECHNICAL.md) for the spec. ## Options ### Core - `--apply` β€” write output (default: preview). - `-o, --output PATH` β€” output path (default `{input}_cleaned.csv`). - `--preset NAME` β€” `minimal` / `excel-hygiene` (default) / `paranoid`. ### Scope - `--columns COLS` β€” comma-separated columns to clean (default: all string columns). - `--skip COLS` β€” exclude these columns. ### Per-op overrides (override the active preset) - `--no-trim`, `--no-collapse`, `--no-nfc`, `--nfkc`, `--no-smart-chars`, `--no-zero-width`, `--no-bom`, `--no-control`, `--no-line-endings`. ### Case - `--case MODE` β€” `upper` / `lower` / `title` / `sentence`. Or per-column: `--case title:name,upper:sku`. - Title case preserves all-caps tokens (`USA`) and lowercases mid-string particles (`of`, `and`). ### Audit + config - `--full-changelog` β€” write every change (default caps to first 1000). - `--config PATH` / `--save-config PATH`. ### File - `--sheet`, `--encoding`, `--header-row` β€” same as Find Duplicates. ## Presets | Preset | What it does | |--------|--------------| | `minimal` | Trim + collapse only. | | `excel-hygiene` (default) | Trim, collapse, NFC, smart-char fold, zero-width strip, BOM strip, control strip, line-ending normalize. | | `paranoid` | `excel-hygiene` + NFKC compatibility fold (lossy). | ## Recipes ```bash # Safe defaults (preview, then apply) python -m src.cli_text_clean messy.csv [--apply] # Just trim + collapse, leave Unicode alone python -m src.cli_text_clean messy.csv --preset minimal --apply # Title-case names, upper-case SKUs python -m src.cli_text_clean people.csv --case title:name,upper:sku --apply # Clean only specific columns python -m src.cli_text_clean orders.csv --columns vendor,product --apply # Skip a free-text notes column python -m src.cli_text_clean tickets.csv --skip notes --apply ``` ## Output files (with `--apply`) | File | Contents | |------|----------| | `{stem}_cleaned.csv` | Cleaned data | | `{stem}_changes.csv` | `row`, `column`, `old`, `new`, `ops_applied` (capped to 1000; `--full-changelog` removes cap) | Log: `logs/text_clean_YYYYMMDD_HHMMSS.log`. --- # Analyzer ``` python -m src.cli_analyze INPUT_FILE [OPTIONS] ``` Read-only scan; surfaces every detector finding without modifying the file. ## Options - `--sample-rows N` β€” cap on rows scanned (default 1000). - `--json` β€” print findings as a JSON array on stdout. - `--strict` β€” exit non-zero on any warn/error finding. ## JSON schema (one object per finding) ```json { "id": "smart_punctuation_in_data", "severity": "warn", "confidence": "high", "fix_action": "fold_smart_punctuation", "pre_applied": false, "tool": "02_text_cleaner", "count": 17, "description": "17 cell(s) contain curly quotes…", "column": null, "samples": [{"row": 3, "column": "name", "value": "β€œAlice”"}] } ``` ## Field meanings - `severity` β€” `info` / `warn` / `error`. Only `error` blocks the GUI gate. - `confidence` β€” `high` (one-click), `medium` (preview), `low` (opt-in). - `fix_action` β€” id of the algorithm in `src/core/fixes.py`. Empty for informational-only. - `pre_applied` β€” `true` for fixes already applied during the byte-level read pass. ## Detectors Smart punctuation, NBSP / Unicode whitespace, zero-width chars, dirty headers, whitespace padding, null-like sentinels, mojibake fingerprints, mixed-case email columns, inconsistent date formats, near-duplicate rows, leading-zero IDs, mixed line endings, encoding decode failure, U+FFFD presence. Add a detector: append entry in `analyze.py` + matching fix in `fixes.py`. No other call sites change.