Refactors all 10 docs (README, USER-GUIDE, CLI-REFERENCE, REQUIREMENTS, TECHNICAL, DEVELOPER, BUSINESS, DECISIONS, RECOVERY, docs/README) from prose-heavy to bullet-heavy + table-heavy. Same information density, significantly less reading load. Net: 2600 → 1652 lines (~37% reduction) WHILE adding the new content that landed since v1.6: - Format Standardizer (3rd Ready tool) - 199-row buyer corpus - src/core/errors.py structured hierarchy + ensure_dataframe / ensure_choice / wrap_file_read|write / format_for_user helpers - src/core/_constants.py shared USPS/state lookup tables - Cross-tool audit fixes (NaN matching, removed_df schema, validation, enum-bounds checks, forward-compat config) - Per-domain error_policy across format standardizers - Inconsistent-date-format detector - Excel header-row auto-detection + write_file delimiter param Per-doc changes: - README.md (175 → 71): 9-tool table at top, status column, 3 CLI entry points listed, dropped repeated marketing prose. - docs/README.md (38 → 27): pure index — buyer-facing vs creator-only split + version footer. - USER-GUIDE.md (208 → 118): tool table replaces script descriptions, troubleshooting compressed to bullets, gate explanation tightened. - CLI-REFERENCE.md (451 → 235): collapsed flag tables, removed redundant intro text, kept full recipes section. - REQUIREMENTS.md (146 → 129): 18 numbered sections (was 17), added §18 Error Handling, formatting tightened to single-line entries. - TECHNICAL.md (570 → 350): collapsed §3 build pipeline tables, merged redundant §3.5-3.7 OS sections, added §7 (Error handling) + §11.3 (Format Standardizer spec) + §11.4-11.7 (analyzer / gate / Review page / repair_bytes promoted from §10.2.x sub-numbering). - DEVELOPER.md (285 → 161): module map table replaces per-file prose, extension recipes condensed, new §Errors covers when to use each hierarchy class. - BUSINESS.md (278 → 225): collapsed prose to tables (use cases, competitive landscape, costs, risks); honest-status updated. - DECISIONS.md (269 → 189): scoring rubric + GUI matrix preserved, decision log compressed to single-line entries, added v1.6 entries (Format Standardizer Ready, errors module). - RECOVERY.md (180 → 147): rebuild steps as numbered + tabular, external dependencies as one table, recovery priorities tightened. No information removed; redundancy compressed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7.9 KiB
CLI Reference
Three CLI modules, one per Ready tool:
| Module | Command | Purpose |
|---|---|---|
src.cli |
python -m src.cli FILE |
Deduplicator |
src.cli_text_clean |
python -m src.cli_text_clean FILE |
Text Cleaner |
src.cli_analyze |
python -m src.cli_analyze FILE |
Analyzer (read-only scan) |
Every command is preview-only by default — add --apply to write output.
Deduplicator
python -m src.cli INPUT_FILE [OPTIONS]
Options
Core
--apply— write output files (default: preview).-o, --output PATH— output path (default{input}_deduplicated.csv).
Column selection
-s, --subset COLS— comma-separated columns to match on (default: auto-detect).-k, --key COLS— strong-key columns; each becomes an independent exact-match strategy (fb_id,ein,sku).
Fuzzy matching
--fuzzy COLS— comma-separated columns to fuzzy-match.-a, --algorithm ALG—levenshtein/jaro_winkler(default) /token_set_ratio.-t, --threshold N— similarity 0-100 (default 85).
Normalization
--normalize COL:TYPE— comma-separatedcol:typepairs. Types:email,phone,name,address,string.
| Type | Effect | Example |
|---|---|---|
email |
lowercase, strip Gmail dots, strip +tag |
John.Doe+x@gmail.com → johndoe@gmail.com |
phone |
E.164 (+ ext preserved) | (555) 123-4567 ext 100 → +15551234567;ext=100 |
name |
strip titles + suffixes + particles, case-fold | Dr. Charles de Gaulle Jr. → charles gaulle |
address |
USPS abbrevs + state name → 2-letter, case-fold | 123 Main Street, California → 123 main st ca |
string |
trim + collapse + case-fold | HELLO WORLD → hello world |
Survivor selection
--survivor RULE—first(default) /last/most-complete/most-recent.--date-column COL— required formost-recent.--merge— fill blanks in survivor from removed rows.
Interactive review
--review— prompt y/n/s per match group with side-by-side diff.
Configuration
--config PATH— load all settings from JSON.--save-config PATH— save current settings to JSON.
File handling
--sheet NAME|N— Excel sheet name or 0-based index.--encoding ENC— override auto-detected encoding.--header-row N— 0-based header row.
Recipes
# Basic auto-detect dedup
python -m src.cli customers.csv [--apply]
# Fuzzy name match at 80%
python -m src.cli customers.csv --fuzzy name --threshold 80 --apply
# Multiple strong keys (OR logic)
python -m src.cli donors.csv --key fb_id,ein --apply
# Most-complete row + merge missing fields
python -m src.cli contacts.csv --survivor most-complete --merge --apply
# Most-recent + merge
python -m src.cli contacts.csv --survivor most-recent --date-column updated_at --merge --apply
# Interactive review
python -m src.cli customers.csv --review --apply
# Save / load profile
python -m src.cli customers.csv --fuzzy name --threshold 80 --save-config dedup.json
python -m src.cli new.csv --config dedup.json --apply
# Excel
python -m src.cli data.xlsx --sheet "Sales" --apply
Algorithms
jaro_winkler(default) — best for short strings (names); weights early chars.levenshtein— edit-distance ratio; typos and transpositions.token_set_ratio— best for addresses; ignores word order.
Auto-detection
When no --subset / --fuzzy flags, columns are detected by name:
| Pattern | Algorithm | Threshold | Normalizer | Key |
|---|---|---|---|---|
| exact | 100% | strong | ||
| Phone | exact | 100% | phone | strong |
| Name | jaro_winkler | 85% | name | weak |
| Address | token_set_ratio | 80% | address | weak |
Strategy rules: strong keys → standalone OR; weak keys → AND-paired with each strong key; no strong keys → weak promoted to standalone; no patterns → exact match on all columns.
Output files (with --apply)
| File | Contents |
|---|---|
{stem}_deduplicated.csv |
Cleaned data |
{stem}_removed.csv |
Removed rows |
{stem}_match_groups.csv |
_group_id, _is_survivor, _confidence, _matched_on, _original_row + originals |
Log: logs/dedup_YYYYMMDD_HHMMSS.log.
Text Cleaner
python -m src.cli_text_clean INPUT_FILE [OPTIONS]
Character-level hygiene. See TECHNICAL.md §10.2 for the spec.
Options
Core
--apply— write output (default: preview).-o, --output PATH— output path (default{input}_cleaned.csv).--preset NAME—minimal/excel-hygiene(default) /paranoid.
Scope
--columns COLS— comma-separated columns to clean (default: all string columns).--skip COLS— exclude these columns.
Per-op overrides (override the active preset)
--no-trim,--no-collapse,--no-nfc,--nfkc,--no-smart-chars,--no-zero-width,--no-bom,--no-control,--no-line-endings.
Case
--case MODE—upper/lower/title/sentence. Or per-column:--case title:name,upper:sku.- Title case preserves all-caps tokens (
USA) and lowercases mid-string particles (of,and).
Audit + config
--full-changelog— write every change (default caps to first 1000).--config PATH/--save-config PATH.
File
--sheet,--encoding,--header-row— same as Deduplicator.
Presets
| Preset | What it does |
|---|---|
minimal |
Trim + collapse only. |
excel-hygiene (default) |
Trim, collapse, NFC, smart-char fold, zero-width strip, BOM strip, control strip, line-ending normalize. |
paranoid |
excel-hygiene + NFKC compatibility fold (lossy). |
Recipes
# Safe defaults (preview, then apply)
python -m src.cli_text_clean messy.csv [--apply]
# Just trim + collapse, leave Unicode alone
python -m src.cli_text_clean messy.csv --preset minimal --apply
# Title-case names, upper-case SKUs
python -m src.cli_text_clean people.csv --case title:name,upper:sku --apply
# Clean only specific columns
python -m src.cli_text_clean orders.csv --columns vendor,product --apply
# Skip a free-text notes column
python -m src.cli_text_clean tickets.csv --skip notes --apply
Output files (with --apply)
| File | Contents |
|---|---|
{stem}_cleaned.csv |
Cleaned data |
{stem}_changes.csv |
row, column, old, new, ops_applied (capped to 1000; --full-changelog removes cap) |
Log: logs/text_clean_YYYYMMDD_HHMMSS.log.
Analyzer
python -m src.cli_analyze INPUT_FILE [OPTIONS]
Read-only scan; surfaces every detector finding without modifying the file.
Options
--sample-rows N— cap on rows scanned (default 1000).--json— print findings as a JSON array on stdout.--strict— exit non-zero on any warn/error finding.
JSON schema (one object per finding)
{
"id": "smart_punctuation_in_data",
"severity": "warn",
"confidence": "high",
"fix_action": "fold_smart_punctuation",
"pre_applied": false,
"tool": "02_text_cleaner",
"count": 17,
"description": "17 cell(s) contain curly quotes…",
"column": null,
"samples": [{"row": 3, "column": "name", "value": "“Alice”"}]
}
Field meanings
severity—info/warn/error. Onlyerrorblocks the GUI gate.confidence—high(one-click),medium(preview),low(opt-in).fix_action— id of the algorithm insrc/core/fixes.py. Empty for informational-only.pre_applied—truefor fixes already applied during the byte-level read pass.
Detectors
Smart punctuation, NBSP / Unicode whitespace, zero-width chars, dirty headers, whitespace padding, null-like sentinels, mojibake fingerprints, mixed-case email columns, inconsistent date formats, near-duplicate rows, leading-zero IDs, mixed line endings, encoding decode failure, U+FFFD presence.
Add a detector: append entry in analyze.py + matching fix in fixes.py. No other call sites change.