Files
datatools-dev/docs/CLI-REFERENCE.md
Michael db5ec084da docs+code: rename tool labels everywhere
Sweep follow-up to 93e43fc. Display labels now consistent across docs,
landing pages, CLI output, code comments, docstrings, and test prose.
Five parallel surfaces touched:

- docs (EN + ES): README, USER-GUIDE, CLI-REFERENCE, and 11 internal
  design/planning docs
- landing pages: index + bookkeeper/revops/shopify-pet
- src: CLI module docstrings, _TOOL_DISPLAY dicts in cli_analyze.py
  and gui/components/_legacy.py, core module headers, every tool
  page's module docstring
- tests: class/method/module docstrings and section-header comments
- test-cases READMEs

Page slugs (1_Deduplicator etc.), tool_id strings (01_deduplicator
etc.), Python class names (TestDeduplicatorWorkflow, FeatureFlag.*),
URL paths, anchor IDs, CSS classes, and asset filenames were left
intact since they're code identifiers / structural references.

All 2033 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 19:50:09 +00:00

8.0 KiB

🌐 Language: English · Español

CLI Reference

Three CLI modules, one per Ready tool:

Module Command Purpose
src.cli python -m src.cli FILE Find Duplicates
src.cli_text_clean python -m src.cli_text_clean FILE Clean Text
src.cli_analyze python -m src.cli_analyze FILE Analyzer (read-only scan)

Every command is preview-only by default — add --apply to write output.


Find Duplicates

python -m src.cli INPUT_FILE [OPTIONS]

Options

Core

  • --apply — write output files (default: preview).
  • -o, --output PATH — output path (default {input}_deduplicated.csv).

Column selection

  • -s, --subset COLS — comma-separated columns to match on (default: auto-detect).
  • -k, --key COLS — strong-key columns; each becomes an independent exact-match strategy (fb_id, ein, sku).

Fuzzy matching

  • --fuzzy COLS — comma-separated columns to fuzzy-match.
  • -a, --algorithm ALGlevenshtein / jaro_winkler (default) / token_set_ratio.
  • -t, --threshold N — similarity 0-100 (default 85).

Normalization

  • --normalize COL:TYPE — comma-separated col:type pairs. Types: email, phone, name, address, string.
Type Effect Example
email lowercase, strip Gmail dots, strip +tag John.Doe+x@gmail.comjohndoe@gmail.com
phone E.164 (+ ext preserved) (555) 123-4567 ext 100+15551234567;ext=100
name strip titles + suffixes + particles, case-fold Dr. Charles de Gaulle Jr.charles gaulle
address USPS abbrevs + state name → 2-letter, case-fold 123 Main Street, California123 main st ca
string trim + collapse + case-fold HELLO WORLD hello world

Survivor selection

  • --survivor RULEfirst (default) / last / most-complete / most-recent.
  • --date-column COL — required for most-recent.
  • --merge — fill blanks in survivor from removed rows.

Interactive review

  • --review — prompt y/n/s per match group with side-by-side diff.

Configuration

  • --config PATH — load all settings from JSON.
  • --save-config PATH — save current settings to JSON.

File handling

  • --sheet NAME|N — Excel sheet name or 0-based index.
  • --encoding ENC — override auto-detected encoding.
  • --header-row N — 0-based header row.

Recipes

# Basic auto-detect dedup
python -m src.cli customers.csv [--apply]

# Fuzzy name match at 80%
python -m src.cli customers.csv --fuzzy name --threshold 80 --apply

# Multiple strong keys (OR logic)
python -m src.cli donors.csv --key fb_id,ein --apply

# Most-complete row + merge missing fields
python -m src.cli contacts.csv --survivor most-complete --merge --apply

# Most-recent + merge
python -m src.cli contacts.csv --survivor most-recent --date-column updated_at --merge --apply

# Interactive review
python -m src.cli customers.csv --review --apply

# Save / load profile
python -m src.cli customers.csv --fuzzy name --threshold 80 --save-config dedup.json
python -m src.cli new.csv       --config dedup.json --apply

# Excel
python -m src.cli data.xlsx --sheet "Sales" --apply

Algorithms

  • jaro_winkler (default) — best for short strings (names); weights early chars.
  • levenshtein — edit-distance ratio; typos and transpositions.
  • token_set_ratio — best for addresses; ignores word order.

Auto-detection

When no --subset / --fuzzy flags, columns are detected by name:

Pattern Algorithm Threshold Normalizer Key
Email exact 100% email strong
Phone exact 100% phone strong
Name jaro_winkler 85% name weak
Address token_set_ratio 80% address weak

Strategy rules: strong keys → standalone OR; weak keys → AND-paired with each strong key; no strong keys → weak promoted to standalone; no patterns → exact match on all columns.

Output files (with --apply)

File Contents
{stem}_deduplicated.csv Cleaned data
{stem}_removed.csv Removed rows
{stem}_match_groups.csv _group_id, _is_survivor, _confidence, _matched_on, _original_row + originals

Log: logs/dedup_YYYYMMDD_HHMMSS.log.


Clean Text

python -m src.cli_text_clean INPUT_FILE [OPTIONS]

Character-level hygiene. See TECHNICAL.md §10.2 for the spec.

Options

Core

  • --apply — write output (default: preview).
  • -o, --output PATH — output path (default {input}_cleaned.csv).
  • --preset NAMEminimal / excel-hygiene (default) / paranoid.

Scope

  • --columns COLS — comma-separated columns to clean (default: all string columns).
  • --skip COLS — exclude these columns.

Per-op overrides (override the active preset)

  • --no-trim, --no-collapse, --no-nfc, --nfkc, --no-smart-chars, --no-zero-width, --no-bom, --no-control, --no-line-endings.

Case

  • --case MODEupper / lower / title / sentence. Or per-column: --case title:name,upper:sku.
  • Title case preserves all-caps tokens (USA) and lowercases mid-string particles (of, and).

Audit + config

  • --full-changelog — write every change (default caps to first 1000).
  • --config PATH / --save-config PATH.

File

  • --sheet, --encoding, --header-row — same as Find Duplicates.

Presets

Preset What it does
minimal Trim + collapse only.
excel-hygiene (default) Trim, collapse, NFC, smart-char fold, zero-width strip, BOM strip, control strip, line-ending normalize.
paranoid excel-hygiene + NFKC compatibility fold (lossy).

Recipes

# Safe defaults (preview, then apply)
python -m src.cli_text_clean messy.csv [--apply]

# Just trim + collapse, leave Unicode alone
python -m src.cli_text_clean messy.csv --preset minimal --apply

# Title-case names, upper-case SKUs
python -m src.cli_text_clean people.csv --case title:name,upper:sku --apply

# Clean only specific columns
python -m src.cli_text_clean orders.csv --columns vendor,product --apply

# Skip a free-text notes column
python -m src.cli_text_clean tickets.csv --skip notes --apply

Output files (with --apply)

File Contents
{stem}_cleaned.csv Cleaned data
{stem}_changes.csv row, column, old, new, ops_applied (capped to 1000; --full-changelog removes cap)

Log: logs/text_clean_YYYYMMDD_HHMMSS.log.


Analyzer

python -m src.cli_analyze INPUT_FILE [OPTIONS]

Read-only scan; surfaces every detector finding without modifying the file.

Options

  • --sample-rows N — cap on rows scanned (default 1000).
  • --json — print findings as a JSON array on stdout.
  • --strict — exit non-zero on any warn/error finding.

JSON schema (one object per finding)

{
  "id": "smart_punctuation_in_data",
  "severity": "warn",
  "confidence": "high",
  "fix_action": "fold_smart_punctuation",
  "pre_applied": false,
  "tool": "02_text_cleaner",
  "count": 17,
  "description": "17 cell(s) contain curly quotes…",
  "column": null,
  "samples": [{"row": 3, "column": "name", "value": "“Alice”"}]
}

Field meanings

  • severityinfo / warn / error. Only error blocks the GUI gate.
  • confidencehigh (one-click), medium (preview), low (opt-in).
  • fix_action — id of the algorithm in src/core/fixes.py. Empty for informational-only.
  • pre_appliedtrue for fixes already applied during the byte-level read pass.

Detectors

Smart punctuation, NBSP / Unicode whitespace, zero-width chars, dirty headers, whitespace padding, null-like sentinels, mojibake fingerprints, mixed-case email columns, inconsistent date formats, near-duplicate rows, leading-zero IDs, mixed line endings, encoding decode failure, U+FFFD presence.

Add a detector: append entry in analyze.py + matching fix in fixes.py. No other call sites change.