Sweep follow-up to 93e43fc. Display labels now consistent across docs,
landing pages, CLI output, code comments, docstrings, and test prose.
Five parallel surfaces touched:
- docs (EN + ES): README, USER-GUIDE, CLI-REFERENCE, and 11 internal
design/planning docs
- landing pages: index + bookkeeper/revops/shopify-pet
- src: CLI module docstrings, _TOOL_DISPLAY dicts in cli_analyze.py
and gui/components/_legacy.py, core module headers, every tool
page's module docstring
- tests: class/method/module docstrings and section-header comments
- test-cases READMEs
Page slugs (1_Deduplicator etc.), tool_id strings (01_deduplicator
etc.), Python class names (TestDeduplicatorWorkflow, FeatureFlag.*),
URL paths, anchor IDs, CSS classes, and asset filenames were left
intact since they're code identifiers / structural references.
All 2033 tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8.0 KiB
🌐 Language: English · Español
CLI Reference
Three CLI modules, one per Ready tool:
| Module | Command | Purpose |
|---|---|---|
src.cli |
python -m src.cli FILE |
Find Duplicates |
src.cli_text_clean |
python -m src.cli_text_clean FILE |
Clean Text |
src.cli_analyze |
python -m src.cli_analyze FILE |
Analyzer (read-only scan) |
Every command is preview-only by default — add --apply to write output.
Find Duplicates
python -m src.cli INPUT_FILE [OPTIONS]
Options
Core
--apply— write output files (default: preview).-o, --output PATH— output path (default{input}_deduplicated.csv).
Column selection
-s, --subset COLS— comma-separated columns to match on (default: auto-detect).-k, --key COLS— strong-key columns; each becomes an independent exact-match strategy (fb_id,ein,sku).
Fuzzy matching
--fuzzy COLS— comma-separated columns to fuzzy-match.-a, --algorithm ALG—levenshtein/jaro_winkler(default) /token_set_ratio.-t, --threshold N— similarity 0-100 (default 85).
Normalization
--normalize COL:TYPE— comma-separatedcol:typepairs. Types:email,phone,name,address,string.
| Type | Effect | Example |
|---|---|---|
email |
lowercase, strip Gmail dots, strip +tag |
John.Doe+x@gmail.com → johndoe@gmail.com |
phone |
E.164 (+ ext preserved) | (555) 123-4567 ext 100 → +15551234567;ext=100 |
name |
strip titles + suffixes + particles, case-fold | Dr. Charles de Gaulle Jr. → charles gaulle |
address |
USPS abbrevs + state name → 2-letter, case-fold | 123 Main Street, California → 123 main st ca |
string |
trim + collapse + case-fold | HELLO WORLD → hello world |
Survivor selection
--survivor RULE—first(default) /last/most-complete/most-recent.--date-column COL— required formost-recent.--merge— fill blanks in survivor from removed rows.
Interactive review
--review— prompt y/n/s per match group with side-by-side diff.
Configuration
--config PATH— load all settings from JSON.--save-config PATH— save current settings to JSON.
File handling
--sheet NAME|N— Excel sheet name or 0-based index.--encoding ENC— override auto-detected encoding.--header-row N— 0-based header row.
Recipes
# Basic auto-detect dedup
python -m src.cli customers.csv [--apply]
# Fuzzy name match at 80%
python -m src.cli customers.csv --fuzzy name --threshold 80 --apply
# Multiple strong keys (OR logic)
python -m src.cli donors.csv --key fb_id,ein --apply
# Most-complete row + merge missing fields
python -m src.cli contacts.csv --survivor most-complete --merge --apply
# Most-recent + merge
python -m src.cli contacts.csv --survivor most-recent --date-column updated_at --merge --apply
# Interactive review
python -m src.cli customers.csv --review --apply
# Save / load profile
python -m src.cli customers.csv --fuzzy name --threshold 80 --save-config dedup.json
python -m src.cli new.csv --config dedup.json --apply
# Excel
python -m src.cli data.xlsx --sheet "Sales" --apply
Algorithms
jaro_winkler(default) — best for short strings (names); weights early chars.levenshtein— edit-distance ratio; typos and transpositions.token_set_ratio— best for addresses; ignores word order.
Auto-detection
When no --subset / --fuzzy flags, columns are detected by name:
| Pattern | Algorithm | Threshold | Normalizer | Key |
|---|---|---|---|---|
| exact | 100% | strong | ||
| Phone | exact | 100% | phone | strong |
| Name | jaro_winkler | 85% | name | weak |
| Address | token_set_ratio | 80% | address | weak |
Strategy rules: strong keys → standalone OR; weak keys → AND-paired with each strong key; no strong keys → weak promoted to standalone; no patterns → exact match on all columns.
Output files (with --apply)
| File | Contents |
|---|---|
{stem}_deduplicated.csv |
Cleaned data |
{stem}_removed.csv |
Removed rows |
{stem}_match_groups.csv |
_group_id, _is_survivor, _confidence, _matched_on, _original_row + originals |
Log: logs/dedup_YYYYMMDD_HHMMSS.log.
Clean Text
python -m src.cli_text_clean INPUT_FILE [OPTIONS]
Character-level hygiene. See TECHNICAL.md §10.2 for the spec.
Options
Core
--apply— write output (default: preview).-o, --output PATH— output path (default{input}_cleaned.csv).--preset NAME—minimal/excel-hygiene(default) /paranoid.
Scope
--columns COLS— comma-separated columns to clean (default: all string columns).--skip COLS— exclude these columns.
Per-op overrides (override the active preset)
--no-trim,--no-collapse,--no-nfc,--nfkc,--no-smart-chars,--no-zero-width,--no-bom,--no-control,--no-line-endings.
Case
--case MODE—upper/lower/title/sentence. Or per-column:--case title:name,upper:sku.- Title case preserves all-caps tokens (
USA) and lowercases mid-string particles (of,and).
Audit + config
--full-changelog— write every change (default caps to first 1000).--config PATH/--save-config PATH.
File
--sheet,--encoding,--header-row— same as Find Duplicates.
Presets
| Preset | What it does |
|---|---|
minimal |
Trim + collapse only. |
excel-hygiene (default) |
Trim, collapse, NFC, smart-char fold, zero-width strip, BOM strip, control strip, line-ending normalize. |
paranoid |
excel-hygiene + NFKC compatibility fold (lossy). |
Recipes
# Safe defaults (preview, then apply)
python -m src.cli_text_clean messy.csv [--apply]
# Just trim + collapse, leave Unicode alone
python -m src.cli_text_clean messy.csv --preset minimal --apply
# Title-case names, upper-case SKUs
python -m src.cli_text_clean people.csv --case title:name,upper:sku --apply
# Clean only specific columns
python -m src.cli_text_clean orders.csv --columns vendor,product --apply
# Skip a free-text notes column
python -m src.cli_text_clean tickets.csv --skip notes --apply
Output files (with --apply)
| File | Contents |
|---|---|
{stem}_cleaned.csv |
Cleaned data |
{stem}_changes.csv |
row, column, old, new, ops_applied (capped to 1000; --full-changelog removes cap) |
Log: logs/text_clean_YYYYMMDD_HHMMSS.log.
Analyzer
python -m src.cli_analyze INPUT_FILE [OPTIONS]
Read-only scan; surfaces every detector finding without modifying the file.
Options
--sample-rows N— cap on rows scanned (default 1000).--json— print findings as a JSON array on stdout.--strict— exit non-zero on any warn/error finding.
JSON schema (one object per finding)
{
"id": "smart_punctuation_in_data",
"severity": "warn",
"confidence": "high",
"fix_action": "fold_smart_punctuation",
"pre_applied": false,
"tool": "02_text_cleaner",
"count": 17,
"description": "17 cell(s) contain curly quotes…",
"column": null,
"samples": [{"row": 3, "column": "name", "value": "“Alice”"}]
}
Field meanings
severity—info/warn/error. Onlyerrorblocks the GUI gate.confidence—high(one-click),medium(preview),low(opt-in).fix_action— id of the algorithm insrc/core/fixes.py. Empty for informational-only.pre_applied—truefor fixes already applied during the byte-level read pass.
Detectors
Smart punctuation, NBSP / Unicode whitespace, zero-width chars, dirty headers, whitespace padding, null-like sentinels, mojibake fingerprints, mixed-case email columns, inconsistent date formats, near-duplicate rows, leading-zero IDs, mixed line endings, encoding decode failure, U+FFFD presence.
Add a detector: append entry in analyze.py + matching fix in fixes.py. No other call sites change.