Sweep follow-up to 93e43fc. Display labels now consistent across docs,
landing pages, CLI output, code comments, docstrings, and test prose.
Five parallel surfaces touched:
- docs (EN + ES): README, USER-GUIDE, CLI-REFERENCE, and 11 internal
design/planning docs
- landing pages: index + bookkeeper/revops/shopify-pet
- src: CLI module docstrings, _TOOL_DISPLAY dicts in cli_analyze.py
and gui/components/_legacy.py, core module headers, every tool
page's module docstring
- tests: class/method/module docstrings and section-header comments
- test-cases READMEs
Page slugs (1_Deduplicator etc.), tool_id strings (01_deduplicator
etc.), Python class names (TestDeduplicatorWorkflow, FeatureFlag.*),
URL paths, anchor IDs, CSS classes, and asset filenames were left
intact since they're code identifiers / structural references.
All 2033 tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
238 lines
8.0 KiB
Markdown
238 lines
8.0 KiB
Markdown
> 🌐 **Language:** English · [Español](CLI-REFERENCE.es.md)
|
|
|
|
# CLI Reference
|
|
|
|
Three CLI modules, one per Ready tool:
|
|
|
|
| Module | Command | Purpose |
|
|
|--------|---------|---------|
|
|
| `src.cli` | `python -m src.cli FILE` | Find Duplicates |
|
|
| `src.cli_text_clean` | `python -m src.cli_text_clean FILE` | Clean Text |
|
|
| `src.cli_analyze` | `python -m src.cli_analyze FILE` | Analyzer (read-only scan) |
|
|
|
|
Every command is **preview-only by default** — add `--apply` to write output.
|
|
|
|
---
|
|
|
|
# Find Duplicates
|
|
|
|
```
|
|
python -m src.cli INPUT_FILE [OPTIONS]
|
|
```
|
|
|
|
## Options
|
|
|
|
### Core
|
|
- `--apply` — write output files (default: preview).
|
|
- `-o, --output PATH` — output path (default `{input}_deduplicated.csv`).
|
|
|
|
### Column selection
|
|
- `-s, --subset COLS` — comma-separated columns to match on (default: auto-detect).
|
|
- `-k, --key COLS` — strong-key columns; each becomes an independent exact-match strategy (`fb_id`, `ein`, `sku`).
|
|
|
|
### Fuzzy matching
|
|
- `--fuzzy COLS` — comma-separated columns to fuzzy-match.
|
|
- `-a, --algorithm ALG` — `levenshtein` / `jaro_winkler` (default) / `token_set_ratio`.
|
|
- `-t, --threshold N` — similarity 0-100 (default 85).
|
|
|
|
### Normalization
|
|
- `--normalize COL:TYPE` — comma-separated `col:type` pairs. Types: `email`, `phone`, `name`, `address`, `string`.
|
|
|
|
| Type | Effect | Example |
|
|
|------|--------|---------|
|
|
| `email` | lowercase, strip Gmail dots, strip `+tag` | `John.Doe+x@gmail.com` → `johndoe@gmail.com` |
|
|
| `phone` | E.164 (+ ext preserved) | `(555) 123-4567 ext 100` → `+15551234567;ext=100` |
|
|
| `name` | strip titles + suffixes + particles, case-fold | `Dr. Charles de Gaulle Jr.` → `charles gaulle` |
|
|
| `address` | USPS abbrevs + state name → 2-letter, case-fold | `123 Main Street, California` → `123 main st ca` |
|
|
| `string` | trim + collapse + case-fold | ` HELLO WORLD ` → `hello world` |
|
|
|
|
### Survivor selection
|
|
- `--survivor RULE` — `first` (default) / `last` / `most-complete` / `most-recent`.
|
|
- `--date-column COL` — required for `most-recent`.
|
|
- `--merge` — fill blanks in survivor from removed rows.
|
|
|
|
### Interactive review
|
|
- `--review` — prompt y/n/s per match group with side-by-side diff.
|
|
|
|
### Configuration
|
|
- `--config PATH` — load all settings from JSON.
|
|
- `--save-config PATH` — save current settings to JSON.
|
|
|
|
### File handling
|
|
- `--sheet NAME|N` — Excel sheet name or 0-based index.
|
|
- `--encoding ENC` — override auto-detected encoding.
|
|
- `--header-row N` — 0-based header row.
|
|
|
|
## Recipes
|
|
|
|
```bash
|
|
# Basic auto-detect dedup
|
|
python -m src.cli customers.csv [--apply]
|
|
|
|
# Fuzzy name match at 80%
|
|
python -m src.cli customers.csv --fuzzy name --threshold 80 --apply
|
|
|
|
# Multiple strong keys (OR logic)
|
|
python -m src.cli donors.csv --key fb_id,ein --apply
|
|
|
|
# Most-complete row + merge missing fields
|
|
python -m src.cli contacts.csv --survivor most-complete --merge --apply
|
|
|
|
# Most-recent + merge
|
|
python -m src.cli contacts.csv --survivor most-recent --date-column updated_at --merge --apply
|
|
|
|
# Interactive review
|
|
python -m src.cli customers.csv --review --apply
|
|
|
|
# Save / load profile
|
|
python -m src.cli customers.csv --fuzzy name --threshold 80 --save-config dedup.json
|
|
python -m src.cli new.csv --config dedup.json --apply
|
|
|
|
# Excel
|
|
python -m src.cli data.xlsx --sheet "Sales" --apply
|
|
```
|
|
|
|
## Algorithms
|
|
|
|
- **`jaro_winkler`** (default) — best for short strings (names); weights early chars.
|
|
- **`levenshtein`** — edit-distance ratio; typos and transpositions.
|
|
- **`token_set_ratio`** — best for addresses; ignores word order.
|
|
|
|
## Auto-detection
|
|
|
|
When no `--subset` / `--fuzzy` flags, columns are detected by name:
|
|
|
|
| Pattern | Algorithm | Threshold | Normalizer | Key |
|
|
|---------|-----------|-----------|------------|-----|
|
|
| Email | exact | 100% | email | strong |
|
|
| Phone | exact | 100% | phone | strong |
|
|
| Name | jaro_winkler | 85% | name | weak |
|
|
| Address | token_set_ratio | 80% | address | weak |
|
|
|
|
**Strategy rules**: strong keys → standalone OR; weak keys → AND-paired with each strong key; no strong keys → weak promoted to standalone; no patterns → exact match on all columns.
|
|
|
|
## Output files (with `--apply`)
|
|
|
|
| File | Contents |
|
|
|------|----------|
|
|
| `{stem}_deduplicated.csv` | Cleaned data |
|
|
| `{stem}_removed.csv` | Removed rows |
|
|
| `{stem}_match_groups.csv` | `_group_id`, `_is_survivor`, `_confidence`, `_matched_on`, `_original_row` + originals |
|
|
|
|
Log: `logs/dedup_YYYYMMDD_HHMMSS.log`.
|
|
|
|
---
|
|
|
|
# Clean Text
|
|
|
|
```
|
|
python -m src.cli_text_clean INPUT_FILE [OPTIONS]
|
|
```
|
|
|
|
Character-level hygiene. See [TECHNICAL.md §10.2](TECHNICAL.md) for the spec.
|
|
|
|
## Options
|
|
|
|
### Core
|
|
- `--apply` — write output (default: preview).
|
|
- `-o, --output PATH` — output path (default `{input}_cleaned.csv`).
|
|
- `--preset NAME` — `minimal` / `excel-hygiene` (default) / `paranoid`.
|
|
|
|
### Scope
|
|
- `--columns COLS` — comma-separated columns to clean (default: all string columns).
|
|
- `--skip COLS` — exclude these columns.
|
|
|
|
### Per-op overrides (override the active preset)
|
|
- `--no-trim`, `--no-collapse`, `--no-nfc`, `--nfkc`, `--no-smart-chars`, `--no-zero-width`, `--no-bom`, `--no-control`, `--no-line-endings`.
|
|
|
|
### Case
|
|
- `--case MODE` — `upper` / `lower` / `title` / `sentence`. Or per-column: `--case title:name,upper:sku`.
|
|
- Title case preserves all-caps tokens (`USA`) and lowercases mid-string particles (`of`, `and`).
|
|
|
|
### Audit + config
|
|
- `--full-changelog` — write every change (default caps to first 1000).
|
|
- `--config PATH` / `--save-config PATH`.
|
|
|
|
### File
|
|
- `--sheet`, `--encoding`, `--header-row` — same as Find Duplicates.
|
|
|
|
## Presets
|
|
|
|
| Preset | What it does |
|
|
|--------|--------------|
|
|
| `minimal` | Trim + collapse only. |
|
|
| `excel-hygiene` (default) | Trim, collapse, NFC, smart-char fold, zero-width strip, BOM strip, control strip, line-ending normalize. |
|
|
| `paranoid` | `excel-hygiene` + NFKC compatibility fold (lossy). |
|
|
|
|
## Recipes
|
|
|
|
```bash
|
|
# Safe defaults (preview, then apply)
|
|
python -m src.cli_text_clean messy.csv [--apply]
|
|
|
|
# Just trim + collapse, leave Unicode alone
|
|
python -m src.cli_text_clean messy.csv --preset minimal --apply
|
|
|
|
# Title-case names, upper-case SKUs
|
|
python -m src.cli_text_clean people.csv --case title:name,upper:sku --apply
|
|
|
|
# Clean only specific columns
|
|
python -m src.cli_text_clean orders.csv --columns vendor,product --apply
|
|
|
|
# Skip a free-text notes column
|
|
python -m src.cli_text_clean tickets.csv --skip notes --apply
|
|
```
|
|
|
|
## Output files (with `--apply`)
|
|
|
|
| File | Contents |
|
|
|------|----------|
|
|
| `{stem}_cleaned.csv` | Cleaned data |
|
|
| `{stem}_changes.csv` | `row`, `column`, `old`, `new`, `ops_applied` (capped to 1000; `--full-changelog` removes cap) |
|
|
|
|
Log: `logs/text_clean_YYYYMMDD_HHMMSS.log`.
|
|
|
|
---
|
|
|
|
# Analyzer
|
|
|
|
```
|
|
python -m src.cli_analyze INPUT_FILE [OPTIONS]
|
|
```
|
|
|
|
Read-only scan; surfaces every detector finding without modifying the file.
|
|
|
|
## Options
|
|
- `--sample-rows N` — cap on rows scanned (default 1000).
|
|
- `--json` — print findings as a JSON array on stdout.
|
|
- `--strict` — exit non-zero on any warn/error finding.
|
|
|
|
## JSON schema (one object per finding)
|
|
|
|
```json
|
|
{
|
|
"id": "smart_punctuation_in_data",
|
|
"severity": "warn",
|
|
"confidence": "high",
|
|
"fix_action": "fold_smart_punctuation",
|
|
"pre_applied": false,
|
|
"tool": "02_text_cleaner",
|
|
"count": 17,
|
|
"description": "17 cell(s) contain curly quotes…",
|
|
"column": null,
|
|
"samples": [{"row": 3, "column": "name", "value": "“Alice”"}]
|
|
}
|
|
```
|
|
|
|
## Field meanings
|
|
- `severity` — `info` / `warn` / `error`. Only `error` blocks the GUI gate.
|
|
- `confidence` — `high` (one-click), `medium` (preview), `low` (opt-in).
|
|
- `fix_action` — id of the algorithm in `src/core/fixes.py`. Empty for informational-only.
|
|
- `pre_applied` — `true` for fixes already applied during the byte-level read pass.
|
|
|
|
## Detectors
|
|
|
|
Smart punctuation, NBSP / Unicode whitespace, zero-width chars, dirty headers, whitespace padding, null-like sentinels, mojibake fingerprints, mixed-case email columns, inconsistent date formats, near-duplicate rows, leading-zero IDs, mixed line endings, encoding decode failure, U+FFFD presence.
|
|
|
|
Add a detector: append entry in `analyze.py` + matching fix in `fixes.py`. No other call sites change.
|