datatools-dev/docs/CLI-REFERENCE.md

> 🌐 **Language:** English · [Español](CLI-REFERENCE.es.md)

# CLI Reference

Three CLI modules, one per Ready tool:

| Module | Command | Purpose |
|--------|---------|---------|
| `src.cli` | `python -m src.cli FILE` | Find Duplicates |
| `src.cli_text_clean` | `python -m src.cli_text_clean FILE` | Clean Text |
| `src.cli_analyze` | `python -m src.cli_analyze FILE` | Analyzer (read-only scan) |

Every command is **preview-only by default** — add `--apply` to write output.

---

# Find Duplicates

```
python -m src.cli INPUT_FILE [OPTIONS]
```

## Options

### Core
- `--apply` — write output files (default: preview).
- `-o, --output PATH` — output path (default `{input}_deduplicated.csv`).

### Column selection
- `-s, --subset COLS` — comma-separated columns to match on (default: auto-detect).
- `-k, --key COLS` — strong-key columns; each becomes an independent exact-match strategy (`fb_id`, `ein`, `sku`).

### Fuzzy matching
- `--fuzzy COLS` — comma-separated columns to fuzzy-match.
- `-a, --algorithm ALG` — `levenshtein` / `jaro_winkler` (default) / `token_set_ratio`.
- `-t, --threshold N` — similarity 0-100 (default 85).

### Normalization
- `--normalize COL:TYPE` — comma-separated `col:type` pairs. Types: `email`, `phone`, `name`, `address`, `string`.

| Type | Effect | Example |
|------|--------|---------|
| `email` | lowercase, strip Gmail dots, strip `+tag` | `John.Doe+x@gmail.com` → `johndoe@gmail.com` |
| `phone` | E.164 (+ ext preserved) | `(555) 123-4567 ext 100` → `+15551234567;ext=100` |
| `name` | strip titles + suffixes + particles, case-fold | `Dr. Charles de Gaulle Jr.` → `charles gaulle` |
| `address` | USPS abbrevs + state name → 2-letter, case-fold | `123 Main Street, California` → `123 main st ca` |
| `string` | trim + collapse + case-fold | `  HELLO   WORLD  ` → `hello world` |

### Survivor selection
- `--survivor RULE` — `first` (default) / `last` / `most-complete` / `most-recent`.
- `--date-column COL` — required for `most-recent`.
- `--merge` — fill blanks in survivor from removed rows.

### Interactive review
- `--review` — prompt y/n/s per match group with side-by-side diff.

### Configuration
- `--config PATH` — load all settings from JSON.
- `--save-config PATH` — save current settings to JSON.

### File handling
- `--sheet NAME|N` — Excel sheet name or 0-based index.
- `--encoding ENC` — override auto-detected encoding.
- `--header-row N` — 0-based header row.

## Recipes

```bash
# Basic auto-detect dedup
python -m src.cli customers.csv [--apply]

# Fuzzy name match at 80%
python -m src.cli customers.csv --fuzzy name --threshold 80 --apply

# Multiple strong keys (OR logic)
python -m src.cli donors.csv --key fb_id,ein --apply

# Most-complete row + merge missing fields
python -m src.cli contacts.csv --survivor most-complete --merge --apply

# Most-recent + merge
python -m src.cli contacts.csv --survivor most-recent --date-column updated_at --merge --apply

# Interactive review
python -m src.cli customers.csv --review --apply

# Save / load profile
python -m src.cli customers.csv --fuzzy name --threshold 80 --save-config dedup.json
python -m src.cli new.csv       --config dedup.json --apply

# Excel
python -m src.cli data.xlsx --sheet "Sales" --apply
```

## Algorithms

- **`jaro_winkler`** (default) — best for short strings (names); weights early chars.
- **`levenshtein`** — edit-distance ratio; typos and transpositions.
- **`token_set_ratio`** — best for addresses; ignores word order.

## Auto-detection

When no `--subset` / `--fuzzy` flags, columns are detected by name:

| Pattern | Algorithm | Threshold | Normalizer | Key |
|---------|-----------|-----------|------------|-----|
| Email | exact | 100% | email | strong |
| Phone | exact | 100% | phone | strong |
| Name | jaro_winkler | 85% | name | weak |
| Address | token_set_ratio | 80% | address | weak |

**Strategy rules**: strong keys → standalone OR; weak keys → AND-paired with each strong key; no strong keys → weak promoted to standalone; no patterns → exact match on all columns.

## Output files (with `--apply`)

| File | Contents |
|------|----------|
| `{stem}_deduplicated.csv` | Cleaned data |
| `{stem}_removed.csv` | Removed rows |
| `{stem}_match_groups.csv` | `_group_id`, `_is_survivor`, `_confidence`, `_matched_on`, `_original_row` + originals |

Log: `logs/dedup_YYYYMMDD_HHMMSS.log`.

---

# Clean Text

```
python -m src.cli_text_clean INPUT_FILE [OPTIONS]
```

Character-level hygiene. See [TECHNICAL.md §10.2](TECHNICAL.md) for the spec.

## Options

### Core
- `--apply` — write output (default: preview).
- `-o, --output PATH` — output path (default `{input}_cleaned.csv`).
- `--preset NAME` — `minimal` / `excel-hygiene` (default) / `paranoid`.

### Scope
- `--columns COLS` — comma-separated columns to clean (default: all string columns).
- `--skip COLS` — exclude these columns.

### Per-op overrides (override the active preset)
- `--no-trim`, `--no-collapse`, `--no-nfc`, `--nfkc`, `--no-smart-chars`, `--no-zero-width`, `--no-bom`, `--no-control`, `--no-line-endings`.

### Case
- `--case MODE` — `upper` / `lower` / `title` / `sentence`. Or per-column: `--case title:name,upper:sku`.
- Title case preserves all-caps tokens (`USA`) and lowercases mid-string particles (`of`, `and`).

### Audit + config
- `--full-changelog` — write every change (default caps to first 1000).
- `--config PATH` / `--save-config PATH`.

### File
- `--sheet`, `--encoding`, `--header-row` — same as Find Duplicates.

## Presets

| Preset | What it does |
|--------|--------------|
| `minimal` | Trim + collapse only. |
| `excel-hygiene` (default) | Trim, collapse, NFC, smart-char fold, zero-width strip, BOM strip, control strip, line-ending normalize. |
| `paranoid` | `excel-hygiene` + NFKC compatibility fold (lossy). |

## Recipes

```bash
# Safe defaults (preview, then apply)
python -m src.cli_text_clean messy.csv [--apply]

# Just trim + collapse, leave Unicode alone
python -m src.cli_text_clean messy.csv --preset minimal --apply

# Title-case names, upper-case SKUs
python -m src.cli_text_clean people.csv --case title:name,upper:sku --apply

# Clean only specific columns
python -m src.cli_text_clean orders.csv --columns vendor,product --apply

# Skip a free-text notes column
python -m src.cli_text_clean tickets.csv --skip notes --apply
```

## Output files (with `--apply`)

| File | Contents |
|------|----------|
| `{stem}_cleaned.csv` | Cleaned data |
| `{stem}_changes.csv` | `row`, `column`, `old`, `new`, `ops_applied` (capped to 1000; `--full-changelog` removes cap) |

Log: `logs/text_clean_YYYYMMDD_HHMMSS.log`.

---

# Analyzer

```
python -m src.cli_analyze INPUT_FILE [OPTIONS]
```

Read-only scan; surfaces every detector finding without modifying the file.

## Options
- `--sample-rows N` — cap on rows scanned (default 1000).
- `--json` — print findings as a JSON array on stdout.
- `--strict` — exit non-zero on any warn/error finding.

## JSON schema (one object per finding)

```json
{
  "id": "smart_punctuation_in_data",
  "severity": "warn",
  "confidence": "high",
  "fix_action": "fold_smart_punctuation",
  "pre_applied": false,
  "tool": "02_text_cleaner",
  "count": 17,
  "description": "17 cell(s) contain curly quotes…",
  "column": null,
  "samples": [{"row": 3, "column": "name", "value": "“Alice”"}]
}
```

## Field meanings
- `severity` — `info` / `warn` / `error`. Only `error` blocks the GUI gate.
- `confidence` — `high` (one-click), `medium` (preview), `low` (opt-in).
- `fix_action` — id of the algorithm in `src/core/fixes.py`. Empty for informational-only.
- `pre_applied` — `true` for fixes already applied during the byte-level read pass.

## Detectors

Smart punctuation, NBSP / Unicode whitespace, zero-width chars, dirty headers, whitespace padding, null-like sentinels, mojibake fingerprints, mixed-case email columns, inconsistent date formats, near-duplicate rows, leading-zero IDs, mixed line endings, encoding decode failure, U+FFFD presence.

Add a detector: append entry in `analyze.py` + matching fix in `fixes.py`. No other call sites change.