Files

Michael db5ec084da docs+code: rename tool labels everywhere

Sweep follow-up to 93e43fc. Display labels now consistent across docs,
landing pages, CLI output, code comments, docstrings, and test prose.
Five parallel surfaces touched:

- docs (EN + ES): README, USER-GUIDE, CLI-REFERENCE, and 11 internal
  design/planning docs
- landing pages: index + bookkeeper/revops/shopify-pet
- src: CLI module docstrings, _TOOL_DISPLAY dicts in cli_analyze.py
  and gui/components/_legacy.py, core module headers, every tool
  page's module docstring
- tests: class/method/module docstrings and section-header comments
- test-cases READMEs

Page slugs (1_Deduplicator etc.), tool_id strings (01_deduplicator
etc.), Python class names (TestDeduplicatorWorkflow, FeatureFlag.*),
URL paths, anchor IDs, CSS classes, and asset filenames were left
intact since they're code identifiers / structural references.

All 2033 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-16 19:50:09 +00:00

8.0 KiB

Raw Permalink Blame History

🌐 Language: English · Español

CLI Reference

Three CLI modules, one per Ready tool:

Module	Command	Purpose
`src.cli`	`python -m src.cli FILE`	Find Duplicates
`src.cli_text_clean`	`python -m src.cli_text_clean FILE`	Clean Text
`src.cli_analyze`	`python -m src.cli_analyze FILE`	Analyzer (read-only scan)

Every command is preview-only by default — add --apply to write output.

Find Duplicates

python -m src.cli INPUT_FILE [OPTIONS]

Options

Core

--apply — write output files (default: preview).
-o, --output PATH — output path (default {input}_deduplicated.csv).

Column selection

-s, --subset COLS — comma-separated columns to match on (default: auto-detect).
-k, --key COLS — strong-key columns; each becomes an independent exact-match strategy (fb_id, ein, sku).

Fuzzy matching

--fuzzy COLS — comma-separated columns to fuzzy-match.
-a, --algorithm ALG — levenshtein / jaro_winkler (default) / token_set_ratio.
-t, --threshold N — similarity 0-100 (default 85).

Normalization

--normalize COL:TYPE — comma-separated col:type pairs. Types: email, phone, name, address, string.

Type	Effect	Example
`email`	lowercase, strip Gmail dots, strip `+tag`	`John.Doe+x@gmail.com` → `johndoe@gmail.com`
`phone`	E.164 (+ ext preserved)	`(555) 123-4567 ext 100` → `+15551234567;ext=100`
`name`	strip titles + suffixes + particles, case-fold	`Dr. Charles de Gaulle Jr.` → `charles gaulle`
`address`	USPS abbrevs + state name → 2-letter, case-fold	`123 Main Street, California` → `123 main st ca`
`string`	trim + collapse + case-fold	`HELLO WORLD` → `hello world`

Survivor selection

--survivor RULE — first (default) / last / most-complete / most-recent.
--date-column COL — required for most-recent.
--merge — fill blanks in survivor from removed rows.

Interactive review

--review — prompt y/n/s per match group with side-by-side diff.

Configuration

--config PATH — load all settings from JSON.
--save-config PATH — save current settings to JSON.

File handling

--sheet NAME|N — Excel sheet name or 0-based index.
--encoding ENC — override auto-detected encoding.
--header-row N — 0-based header row.

Recipes

# Basic auto-detect dedup
python -m src.cli customers.csv [--apply]

# Fuzzy name match at 80%
python -m src.cli customers.csv --fuzzy name --threshold 80 --apply

# Multiple strong keys (OR logic)
python -m src.cli donors.csv --key fb_id,ein --apply

# Most-complete row + merge missing fields
python -m src.cli contacts.csv --survivor most-complete --merge --apply

# Most-recent + merge
python -m src.cli contacts.csv --survivor most-recent --date-column updated_at --merge --apply

# Interactive review
python -m src.cli customers.csv --review --apply

# Save / load profile
python -m src.cli customers.csv --fuzzy name --threshold 80 --save-config dedup.json
python -m src.cli new.csv       --config dedup.json --apply

# Excel
python -m src.cli data.xlsx --sheet "Sales" --apply

Algorithms

jaro_winkler (default) — best for short strings (names); weights early chars.
levenshtein — edit-distance ratio; typos and transpositions.
token_set_ratio — best for addresses; ignores word order.

Auto-detection

When no --subset / --fuzzy flags, columns are detected by name:

Pattern	Algorithm	Threshold	Normalizer	Key
Email	exact	100%	email	strong
Phone	exact	100%	phone	strong
Name	jaro_winkler	85%	name	weak
Address	token_set_ratio	80%	address	weak

Strategy rules: strong keys → standalone OR; weak keys → AND-paired with each strong key; no strong keys → weak promoted to standalone; no patterns → exact match on all columns.

Output files (with `--apply`)

File	Contents
`{stem}_deduplicated.csv`	Cleaned data
`{stem}_removed.csv`	Removed rows
`{stem}_match_groups.csv`	`_group_id`, `_is_survivor`, `_confidence`, `_matched_on`, `_original_row` + originals

Log: logs/dedup_YYYYMMDD_HHMMSS.log.

Clean Text

python -m src.cli_text_clean INPUT_FILE [OPTIONS]

Character-level hygiene. See TECHNICAL.md §10.2 for the spec.

Options

Core

--apply — write output (default: preview).
-o, --output PATH — output path (default {input}_cleaned.csv).
--preset NAME — minimal / excel-hygiene (default) / paranoid.

Scope

--columns COLS — comma-separated columns to clean (default: all string columns).
--skip COLS — exclude these columns.

Per-op overrides (override the active preset)

--no-trim, --no-collapse, --no-nfc, --nfkc, --no-smart-chars, --no-zero-width, --no-bom, --no-control, --no-line-endings.

Case

--case MODE — upper / lower / title / sentence. Or per-column: --case title:name,upper:sku.
Title case preserves all-caps tokens (USA) and lowercases mid-string particles (of, and).

Audit + config

--full-changelog — write every change (default caps to first 1000).
--config PATH / --save-config PATH.

File

--sheet, --encoding, --header-row — same as Find Duplicates.

Presets

Preset	What it does
`minimal`	Trim + collapse only.
`excel-hygiene` (default)	Trim, collapse, NFC, smart-char fold, zero-width strip, BOM strip, control strip, line-ending normalize.
`paranoid`	`excel-hygiene` + NFKC compatibility fold (lossy).

Recipes

# Safe defaults (preview, then apply)
python -m src.cli_text_clean messy.csv [--apply]

# Just trim + collapse, leave Unicode alone
python -m src.cli_text_clean messy.csv --preset minimal --apply

# Title-case names, upper-case SKUs
python -m src.cli_text_clean people.csv --case title:name,upper:sku --apply

# Clean only specific columns
python -m src.cli_text_clean orders.csv --columns vendor,product --apply

# Skip a free-text notes column
python -m src.cli_text_clean tickets.csv --skip notes --apply

Output files (with `--apply`)

File	Contents
`{stem}_cleaned.csv`	Cleaned data
`{stem}_changes.csv`	`row`, `column`, `old`, `new`, `ops_applied` (capped to 1000; `--full-changelog` removes cap)

Log: logs/text_clean_YYYYMMDD_HHMMSS.log.

Analyzer

python -m src.cli_analyze INPUT_FILE [OPTIONS]

Read-only scan; surfaces every detector finding without modifying the file.

Options

--sample-rows N — cap on rows scanned (default 1000).
--json — print findings as a JSON array on stdout.
--strict — exit non-zero on any warn/error finding.

JSON schema (one object per finding)

{
  "id": "smart_punctuation_in_data",
  "severity": "warn",
  "confidence": "high",
  "fix_action": "fold_smart_punctuation",
  "pre_applied": false,
  "tool": "02_text_cleaner",
  "count": 17,
  "description": "17 cell(s) contain curly quotes…",
  "column": null,
  "samples": [{"row": 3, "column": "name", "value": "“Alice”"}]
}

Field meanings

severity — info / warn / error. Only error blocks the GUI gate.
confidence — high (one-click), medium (preview), low (opt-in).
fix_action — id of the algorithm in src/core/fixes.py. Empty for informational-only.
pre_applied — true for fixes already applied during the byte-level read pass.

Detectors

Smart punctuation, NBSP / Unicode whitespace, zero-width chars, dirty headers, whitespace padding, null-like sentinels, mojibake fingerprints, mixed-case email columns, inconsistent date formats, near-duplicate rows, leading-zero IDs, mixed line endings, encoding decode failure, U+FFFD presence.

Add a detector: append entry in analyze.py + matching fix in fixes.py. No other call sites change.

8.0 KiB Raw Permalink Blame History

CLI Reference

Find Duplicates

Options

Core

Column selection

Fuzzy matching

Normalization

Survivor selection

Interactive review

Configuration

File handling

Recipes

Algorithms

Auto-detection

Output files (with --apply)

Clean Text

Options

Core

Scope

Per-op overrides (override the active preset)

Case

Audit + config

File

Presets

Recipes

Output files (with --apply)

Analyzer

Options

JSON schema (one object per finding)

Field meanings

Detectors

8.0 KiB

Raw Permalink Blame History

Output files (with `--apply`)

Output files (with `--apply`)