Files
datatools-dev/docs/CLI-REFERENCE.md
Michael db5ec084da docs+code: rename tool labels everywhere
Sweep follow-up to 93e43fc. Display labels now consistent across docs,
landing pages, CLI output, code comments, docstrings, and test prose.
Five parallel surfaces touched:

- docs (EN + ES): README, USER-GUIDE, CLI-REFERENCE, and 11 internal
  design/planning docs
- landing pages: index + bookkeeper/revops/shopify-pet
- src: CLI module docstrings, _TOOL_DISPLAY dicts in cli_analyze.py
  and gui/components/_legacy.py, core module headers, every tool
  page's module docstring
- tests: class/method/module docstrings and section-header comments
- test-cases READMEs

Page slugs (1_Deduplicator etc.), tool_id strings (01_deduplicator
etc.), Python class names (TestDeduplicatorWorkflow, FeatureFlag.*),
URL paths, anchor IDs, CSS classes, and asset filenames were left
intact since they're code identifiers / structural references.

All 2033 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 19:50:09 +00:00

238 lines
8.0 KiB
Markdown

> 🌐 **Language:** English · [Español](CLI-REFERENCE.es.md)
# CLI Reference
Three CLI modules, one per Ready tool:
| Module | Command | Purpose |
|--------|---------|---------|
| `src.cli` | `python -m src.cli FILE` | Find Duplicates |
| `src.cli_text_clean` | `python -m src.cli_text_clean FILE` | Clean Text |
| `src.cli_analyze` | `python -m src.cli_analyze FILE` | Analyzer (read-only scan) |
Every command is **preview-only by default** — add `--apply` to write output.
---
# Find Duplicates
```
python -m src.cli INPUT_FILE [OPTIONS]
```
## Options
### Core
- `--apply` — write output files (default: preview).
- `-o, --output PATH` — output path (default `{input}_deduplicated.csv`).
### Column selection
- `-s, --subset COLS` — comma-separated columns to match on (default: auto-detect).
- `-k, --key COLS` — strong-key columns; each becomes an independent exact-match strategy (`fb_id`, `ein`, `sku`).
### Fuzzy matching
- `--fuzzy COLS` — comma-separated columns to fuzzy-match.
- `-a, --algorithm ALG``levenshtein` / `jaro_winkler` (default) / `token_set_ratio`.
- `-t, --threshold N` — similarity 0-100 (default 85).
### Normalization
- `--normalize COL:TYPE` — comma-separated `col:type` pairs. Types: `email`, `phone`, `name`, `address`, `string`.
| Type | Effect | Example |
|------|--------|---------|
| `email` | lowercase, strip Gmail dots, strip `+tag` | `John.Doe+x@gmail.com``johndoe@gmail.com` |
| `phone` | E.164 (+ ext preserved) | `(555) 123-4567 ext 100``+15551234567;ext=100` |
| `name` | strip titles + suffixes + particles, case-fold | `Dr. Charles de Gaulle Jr.``charles gaulle` |
| `address` | USPS abbrevs + state name → 2-letter, case-fold | `123 Main Street, California``123 main st ca` |
| `string` | trim + collapse + case-fold | ` HELLO WORLD ``hello world` |
### Survivor selection
- `--survivor RULE``first` (default) / `last` / `most-complete` / `most-recent`.
- `--date-column COL` — required for `most-recent`.
- `--merge` — fill blanks in survivor from removed rows.
### Interactive review
- `--review` — prompt y/n/s per match group with side-by-side diff.
### Configuration
- `--config PATH` — load all settings from JSON.
- `--save-config PATH` — save current settings to JSON.
### File handling
- `--sheet NAME|N` — Excel sheet name or 0-based index.
- `--encoding ENC` — override auto-detected encoding.
- `--header-row N` — 0-based header row.
## Recipes
```bash
# Basic auto-detect dedup
python -m src.cli customers.csv [--apply]
# Fuzzy name match at 80%
python -m src.cli customers.csv --fuzzy name --threshold 80 --apply
# Multiple strong keys (OR logic)
python -m src.cli donors.csv --key fb_id,ein --apply
# Most-complete row + merge missing fields
python -m src.cli contacts.csv --survivor most-complete --merge --apply
# Most-recent + merge
python -m src.cli contacts.csv --survivor most-recent --date-column updated_at --merge --apply
# Interactive review
python -m src.cli customers.csv --review --apply
# Save / load profile
python -m src.cli customers.csv --fuzzy name --threshold 80 --save-config dedup.json
python -m src.cli new.csv --config dedup.json --apply
# Excel
python -m src.cli data.xlsx --sheet "Sales" --apply
```
## Algorithms
- **`jaro_winkler`** (default) — best for short strings (names); weights early chars.
- **`levenshtein`** — edit-distance ratio; typos and transpositions.
- **`token_set_ratio`** — best for addresses; ignores word order.
## Auto-detection
When no `--subset` / `--fuzzy` flags, columns are detected by name:
| Pattern | Algorithm | Threshold | Normalizer | Key |
|---------|-----------|-----------|------------|-----|
| Email | exact | 100% | email | strong |
| Phone | exact | 100% | phone | strong |
| Name | jaro_winkler | 85% | name | weak |
| Address | token_set_ratio | 80% | address | weak |
**Strategy rules**: strong keys → standalone OR; weak keys → AND-paired with each strong key; no strong keys → weak promoted to standalone; no patterns → exact match on all columns.
## Output files (with `--apply`)
| File | Contents |
|------|----------|
| `{stem}_deduplicated.csv` | Cleaned data |
| `{stem}_removed.csv` | Removed rows |
| `{stem}_match_groups.csv` | `_group_id`, `_is_survivor`, `_confidence`, `_matched_on`, `_original_row` + originals |
Log: `logs/dedup_YYYYMMDD_HHMMSS.log`.
---
# Clean Text
```
python -m src.cli_text_clean INPUT_FILE [OPTIONS]
```
Character-level hygiene. See [TECHNICAL.md §10.2](TECHNICAL.md) for the spec.
## Options
### Core
- `--apply` — write output (default: preview).
- `-o, --output PATH` — output path (default `{input}_cleaned.csv`).
- `--preset NAME``minimal` / `excel-hygiene` (default) / `paranoid`.
### Scope
- `--columns COLS` — comma-separated columns to clean (default: all string columns).
- `--skip COLS` — exclude these columns.
### Per-op overrides (override the active preset)
- `--no-trim`, `--no-collapse`, `--no-nfc`, `--nfkc`, `--no-smart-chars`, `--no-zero-width`, `--no-bom`, `--no-control`, `--no-line-endings`.
### Case
- `--case MODE``upper` / `lower` / `title` / `sentence`. Or per-column: `--case title:name,upper:sku`.
- Title case preserves all-caps tokens (`USA`) and lowercases mid-string particles (`of`, `and`).
### Audit + config
- `--full-changelog` — write every change (default caps to first 1000).
- `--config PATH` / `--save-config PATH`.
### File
- `--sheet`, `--encoding`, `--header-row` — same as Find Duplicates.
## Presets
| Preset | What it does |
|--------|--------------|
| `minimal` | Trim + collapse only. |
| `excel-hygiene` (default) | Trim, collapse, NFC, smart-char fold, zero-width strip, BOM strip, control strip, line-ending normalize. |
| `paranoid` | `excel-hygiene` + NFKC compatibility fold (lossy). |
## Recipes
```bash
# Safe defaults (preview, then apply)
python -m src.cli_text_clean messy.csv [--apply]
# Just trim + collapse, leave Unicode alone
python -m src.cli_text_clean messy.csv --preset minimal --apply
# Title-case names, upper-case SKUs
python -m src.cli_text_clean people.csv --case title:name,upper:sku --apply
# Clean only specific columns
python -m src.cli_text_clean orders.csv --columns vendor,product --apply
# Skip a free-text notes column
python -m src.cli_text_clean tickets.csv --skip notes --apply
```
## Output files (with `--apply`)
| File | Contents |
|------|----------|
| `{stem}_cleaned.csv` | Cleaned data |
| `{stem}_changes.csv` | `row`, `column`, `old`, `new`, `ops_applied` (capped to 1000; `--full-changelog` removes cap) |
Log: `logs/text_clean_YYYYMMDD_HHMMSS.log`.
---
# Analyzer
```
python -m src.cli_analyze INPUT_FILE [OPTIONS]
```
Read-only scan; surfaces every detector finding without modifying the file.
## Options
- `--sample-rows N` — cap on rows scanned (default 1000).
- `--json` — print findings as a JSON array on stdout.
- `--strict` — exit non-zero on any warn/error finding.
## JSON schema (one object per finding)
```json
{
"id": "smart_punctuation_in_data",
"severity": "warn",
"confidence": "high",
"fix_action": "fold_smart_punctuation",
"pre_applied": false,
"tool": "02_text_cleaner",
"count": 17,
"description": "17 cell(s) contain curly quotes…",
"column": null,
"samples": [{"row": 3, "column": "name", "value": "“Alice”"}]
}
```
## Field meanings
- `severity``info` / `warn` / `error`. Only `error` blocks the GUI gate.
- `confidence``high` (one-click), `medium` (preview), `low` (opt-in).
- `fix_action` — id of the algorithm in `src/core/fixes.py`. Empty for informational-only.
- `pre_applied``true` for fixes already applied during the byte-level read pass.
## Detectors
Smart punctuation, NBSP / Unicode whitespace, zero-width chars, dirty headers, whitespace padding, null-like sentinels, mojibake fingerprints, mixed-case email columns, inconsistent date formats, near-duplicate rows, leading-zero IDs, mixed line endings, encoding decode failure, U+FFFD presence.
Add a detector: append entry in `analyze.py` + matching fix in `fixes.py`. No other call sites change.