Files

Michael abb720997e docs: tight, scannable rewrite — every item earns its place

Refactors all 10 docs (README, USER-GUIDE, CLI-REFERENCE, REQUIREMENTS,
TECHNICAL, DEVELOPER, BUSINESS, DECISIONS, RECOVERY, docs/README) from
prose-heavy to bullet-heavy + table-heavy. Same information density,
significantly less reading load.

Net: 2600 → 1652 lines (~37% reduction) WHILE adding the new content
that landed since v1.6:

- Format Standardizer (3rd Ready tool)
- 199-row buyer corpus
- src/core/errors.py structured hierarchy + ensure_dataframe /
  ensure_choice / wrap_file_read|write / format_for_user helpers
- src/core/_constants.py shared USPS/state lookup tables
- Cross-tool audit fixes (NaN matching, removed_df schema, validation,
  enum-bounds checks, forward-compat config)
- Per-domain error_policy across format standardizers
- Inconsistent-date-format detector
- Excel header-row auto-detection + write_file delimiter param

Per-doc changes:

- README.md (175 → 71): 9-tool table at top, status column, 3 CLI
  entry points listed, dropped repeated marketing prose.
- docs/README.md (38 → 27): pure index — buyer-facing vs creator-only
  split + version footer.
- USER-GUIDE.md (208 → 118): tool table replaces script descriptions,
  troubleshooting compressed to bullets, gate explanation tightened.
- CLI-REFERENCE.md (451 → 235): collapsed flag tables, removed
  redundant intro text, kept full recipes section.
- REQUIREMENTS.md (146 → 129): 18 numbered sections (was 17), added
  §18 Error Handling, formatting tightened to single-line entries.
- TECHNICAL.md (570 → 350): collapsed §3 build pipeline tables, merged
  redundant §3.5-3.7 OS sections, added §7 (Error handling) +
  §11.3 (Format Standardizer spec) + §11.4-11.7 (analyzer / gate /
  Review page / repair_bytes promoted from §10.2.x sub-numbering).
- DEVELOPER.md (285 → 161): module map table replaces per-file prose,
  extension recipes condensed, new §Errors covers when to use each
  hierarchy class.
- BUSINESS.md (278 → 225): collapsed prose to tables (use cases,
  competitive landscape, costs, risks); honest-status updated.
- DECISIONS.md (269 → 189): scoring rubric + GUI matrix preserved,
  decision log compressed to single-line entries, added v1.6 entries
  (Format Standardizer Ready, errors module).
- RECOVERY.md (180 → 147): rebuild steps as numbered + tabular,
  external dependencies as one table, recovery priorities tightened.

No information removed; redundancy compressed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-01 02:49:29 +00:00

7.9 KiB

Raw Blame History

CLI Reference

Three CLI modules, one per Ready tool:

Module	Command	Purpose
`src.cli`	`python -m src.cli FILE`	Deduplicator
`src.cli_text_clean`	`python -m src.cli_text_clean FILE`	Text Cleaner
`src.cli_analyze`	`python -m src.cli_analyze FILE`	Analyzer (read-only scan)

Every command is preview-only by default — add --apply to write output.

Deduplicator

python -m src.cli INPUT_FILE [OPTIONS]

Options

Core

--apply — write output files (default: preview).
-o, --output PATH — output path (default {input}_deduplicated.csv).

Column selection

-s, --subset COLS — comma-separated columns to match on (default: auto-detect).
-k, --key COLS — strong-key columns; each becomes an independent exact-match strategy (fb_id, ein, sku).

Fuzzy matching

--fuzzy COLS — comma-separated columns to fuzzy-match.
-a, --algorithm ALG — levenshtein / jaro_winkler (default) / token_set_ratio.
-t, --threshold N — similarity 0-100 (default 85).

Normalization

--normalize COL:TYPE — comma-separated col:type pairs. Types: email, phone, name, address, string.

Type	Effect	Example
`email`	lowercase, strip Gmail dots, strip `+tag`	`John.Doe+x@gmail.com` → `johndoe@gmail.com`
`phone`	E.164 (+ ext preserved)	`(555) 123-4567 ext 100` → `+15551234567;ext=100`
`name`	strip titles + suffixes + particles, case-fold	`Dr. Charles de Gaulle Jr.` → `charles gaulle`
`address`	USPS abbrevs + state name → 2-letter, case-fold	`123 Main Street, California` → `123 main st ca`
`string`	trim + collapse + case-fold	`HELLO WORLD` → `hello world`

Survivor selection

--survivor RULE — first (default) / last / most-complete / most-recent.
--date-column COL — required for most-recent.
--merge — fill blanks in survivor from removed rows.

Interactive review

--review — prompt y/n/s per match group with side-by-side diff.

Configuration

--config PATH — load all settings from JSON.
--save-config PATH — save current settings to JSON.

File handling

--sheet NAME|N — Excel sheet name or 0-based index.
--encoding ENC — override auto-detected encoding.
--header-row N — 0-based header row.

Recipes

# Basic auto-detect dedup
python -m src.cli customers.csv [--apply]

# Fuzzy name match at 80%
python -m src.cli customers.csv --fuzzy name --threshold 80 --apply

# Multiple strong keys (OR logic)
python -m src.cli donors.csv --key fb_id,ein --apply

# Most-complete row + merge missing fields
python -m src.cli contacts.csv --survivor most-complete --merge --apply

# Most-recent + merge
python -m src.cli contacts.csv --survivor most-recent --date-column updated_at --merge --apply

# Interactive review
python -m src.cli customers.csv --review --apply

# Save / load profile
python -m src.cli customers.csv --fuzzy name --threshold 80 --save-config dedup.json
python -m src.cli new.csv       --config dedup.json --apply

# Excel
python -m src.cli data.xlsx --sheet "Sales" --apply

Algorithms

jaro_winkler (default) — best for short strings (names); weights early chars.
levenshtein — edit-distance ratio; typos and transpositions.
token_set_ratio — best for addresses; ignores word order.

Auto-detection

When no --subset / --fuzzy flags, columns are detected by name:

Pattern	Algorithm	Threshold	Normalizer	Key
Email	exact	100%	email	strong
Phone	exact	100%	phone	strong
Name	jaro_winkler	85%	name	weak
Address	token_set_ratio	80%	address	weak

Strategy rules: strong keys → standalone OR; weak keys → AND-paired with each strong key; no strong keys → weak promoted to standalone; no patterns → exact match on all columns.

Output files (with `--apply`)

File	Contents
`{stem}_deduplicated.csv`	Cleaned data
`{stem}_removed.csv`	Removed rows
`{stem}_match_groups.csv`	`_group_id`, `_is_survivor`, `_confidence`, `_matched_on`, `_original_row` + originals

Log: logs/dedup_YYYYMMDD_HHMMSS.log.

Text Cleaner

python -m src.cli_text_clean INPUT_FILE [OPTIONS]

Character-level hygiene. See TECHNICAL.md §10.2 for the spec.

Options

Core

--apply — write output (default: preview).
-o, --output PATH — output path (default {input}_cleaned.csv).
--preset NAME — minimal / excel-hygiene (default) / paranoid.

Scope

--columns COLS — comma-separated columns to clean (default: all string columns).
--skip COLS — exclude these columns.

Per-op overrides (override the active preset)

--no-trim, --no-collapse, --no-nfc, --nfkc, --no-smart-chars, --no-zero-width, --no-bom, --no-control, --no-line-endings.

Case

--case MODE — upper / lower / title / sentence. Or per-column: --case title:name,upper:sku.
Title case preserves all-caps tokens (USA) and lowercases mid-string particles (of, and).

Audit + config

--full-changelog — write every change (default caps to first 1000).
--config PATH / --save-config PATH.

File

--sheet, --encoding, --header-row — same as Deduplicator.

Presets

Preset	What it does
`minimal`	Trim + collapse only.
`excel-hygiene` (default)	Trim, collapse, NFC, smart-char fold, zero-width strip, BOM strip, control strip, line-ending normalize.
`paranoid`	`excel-hygiene` + NFKC compatibility fold (lossy).

Recipes

# Safe defaults (preview, then apply)
python -m src.cli_text_clean messy.csv [--apply]

# Just trim + collapse, leave Unicode alone
python -m src.cli_text_clean messy.csv --preset minimal --apply

# Title-case names, upper-case SKUs
python -m src.cli_text_clean people.csv --case title:name,upper:sku --apply

# Clean only specific columns
python -m src.cli_text_clean orders.csv --columns vendor,product --apply

# Skip a free-text notes column
python -m src.cli_text_clean tickets.csv --skip notes --apply

Output files (with `--apply`)

File	Contents
`{stem}_cleaned.csv`	Cleaned data
`{stem}_changes.csv`	`row`, `column`, `old`, `new`, `ops_applied` (capped to 1000; `--full-changelog` removes cap)

Log: logs/text_clean_YYYYMMDD_HHMMSS.log.

Analyzer

python -m src.cli_analyze INPUT_FILE [OPTIONS]

Read-only scan; surfaces every detector finding without modifying the file.

Options

--sample-rows N — cap on rows scanned (default 1000).
--json — print findings as a JSON array on stdout.
--strict — exit non-zero on any warn/error finding.

JSON schema (one object per finding)

{
  "id": "smart_punctuation_in_data",
  "severity": "warn",
  "confidence": "high",
  "fix_action": "fold_smart_punctuation",
  "pre_applied": false,
  "tool": "02_text_cleaner",
  "count": 17,
  "description": "17 cell(s) contain curly quotes…",
  "column": null,
  "samples": [{"row": 3, "column": "name", "value": "“Alice”"}]
}

Field meanings

severity — info / warn / error. Only error blocks the GUI gate.
confidence — high (one-click), medium (preview), low (opt-in).
fix_action — id of the algorithm in src/core/fixes.py. Empty for informational-only.
pre_applied — true for fixes already applied during the byte-level read pass.

Detectors

Smart punctuation, NBSP / Unicode whitespace, zero-width chars, dirty headers, whitespace padding, null-like sentinels, mojibake fingerprints, mixed-case email columns, inconsistent date formats, near-duplicate rows, leading-zero IDs, mixed line endings, encoding decode failure, U+FFFD presence.

Add a detector: append entry in analyze.py + matching fix in fixes.py. No other call sites change.

7.9 KiB Raw Blame History

CLI Reference

Deduplicator

Options

Core

Column selection

Fuzzy matching

Normalization

Survivor selection

Interactive review

Configuration

File handling

Recipes

Algorithms

Auto-detection

Output files (with --apply)

Text Cleaner

Options

Core

Scope

Per-op overrides (override the active preset)

Case

Audit + config

File

Presets

Recipes

Output files (with --apply)

Analyzer

Options

JSON schema (one object per finding)

Field meanings

Detectors

7.9 KiB

Raw Blame History

Output files (with `--apply`)

Output files (with `--apply`)