Files
datatools-dev/docs/CLI-REFERENCE.md
Michael 82d7fef21e feat(gate): CSV-normalization gate with confidence-tiered findings
Adds a Review & Normalize page that sits between upload and every tool
page. The analyzer now tags each finding with confidence (high/medium/low)
and a fix_action; the gate auto-applies high-confidence fixes, surfaces
medium/low ones for user review, and blocks tool pages on error-level
findings until resolved or waived.

Core (src/core/):
  - analyze.py: Finding gains confidence, fix_action, pre_applied; new
    detectors for encoding_uncertain, encoding_decode_failed; new top-
    level encoding_override parameter.
  - fixes.py: registry of fix algorithms keyed by fix_action id.
  - normalize.py: auto_fix(), apply_decisions(), is_normalized(), and
    the NormalizationResult / Decision dataclasses the gate consumes.
  - io.py: detect_encoding tries strict UTF-8 first; repair_bytes now
    transcodes UTF-16/32 to UTF-8 before NUL-strip (fixes UTF-16 corruption)
    and normalizes line endings (fixes bare-CR parser crash); empty file
    handled gracefully instead of EmptyDataError traceback.

GUI (src/gui/):
  - pages/0_Review.py: gate page with per-finding decision controls,
    encoding override picker (16 codepages + custom), and Advanced output
    options (encoding, delimiter, line terminator) on the download.
  - components.py: require_normalization_gate() helper.
  - pages/1-9: gate guard wired on every tool page.

Test corpora:
  - test-cases/encodings-corpus/: 31 encoded CSV fixtures + 9 reference
    UTF-8 files + manifest, synced from Business/DataTools.
  - test-cases/text-cleaner-corpus/test_data/17: synced malformed input
    (unquoted $1,500.00) for the unquoted-delimiter detector.

Tests (94 new):
  - test_normalize.py (48): finding fields, fix registry, auto_fix scope,
    decision paths, gate idempotency, output-options helper.
  - test_encodings_corpus.py (90, 16 xfailed): parametric detection +
    decode + analyzer-no-crash sweep against the manifest.
  - test_analyze.py: encoding override + encoding_uncertain detectors.
  - test_corpus.py: pre-parse repair in the strict reader.

run_tests.py: new aliases --tool normalize, --tool encodings, --tool gate;
encodings corpus added to --fixtures category.

Docs: USER-GUIDE §3.3 covers the gate workflow, encoding override, and
output options; TECHNICAL §10.2.1-10.2.4 documents the analyzer schema,
gate API, Review page, and pre-parse repair pipeline; CLI-REFERENCE adds
the analyzer JSON schema with the new fields; README links to all of it.

Suite: 765 passed, 17 xfailed (was 458 passed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 20:35:27 +00:00

16 KiB

CLI Reference

Complete command-line reference for the DataTools bundle.

DataTools ships two CLI modules so each script can be invoked independently:

Module Command Purpose
src.cli python -m src.cli INPUT_FILE [OPTIONS] Deduplicator (script 01)
src.cli_text_clean python -m src.cli_text_clean INPUT_FILE [OPTIONS] Text cleaner (script 02)

The deduplicator section is below; the text cleaner reference is in Section: Text Cleaner CLI.

Deduplicator

python -m src.cli INPUT_FILE [OPTIONS]

Arguments

Argument Required Description
INPUT_FILE Yes Path to the CSV, delimited text, or Excel file to deduplicate

Options

Core

Flag Short Default Description
--apply false Write output files. Without this flag, only a preview is shown.
--output -o {input}_deduplicated.csv Output file path.

Column Selection

Flag Short Default Description
--subset -s auto-detect Comma-separated columns to match on. When omitted, columns are auto-detected by name pattern (email, phone, name, address).
--key -k none Comma-separated strong-key columns. Each becomes an independent exact-match strategy. Use for identifiers like fb_id, ein, sku.

Fuzzy Matching

Flag Short Default Description
--fuzzy none Comma-separated columns to fuzzy-match. Other columns in the strategy use exact matching.
--algorithm -a jaro_winkler Fuzzy algorithm: levenshtein, jaro_winkler, or token_set_ratio.
--threshold -t 85 Similarity threshold 0-100. Lower values find more matches but increase false positives.

Normalization

Flag Short Default Description
--normalize auto-detect Column normalizers as col:type pairs, comma-separated. Types: email, phone, name, address, string.

Normalizer details:

Type What it does Example
email Lowercase, strip Gmail dots, strip +tag suffixes John.Doe+tag@gmail.comjohndoe@gmail.com
phone Parse to E.164 format; fallback: digits only (555) 123-4567+15551234567
name Strip titles (Dr., Mr.) and suffixes (Jr., PhD), case-fold Dr. John Smith Jr.john smith
address USPS abbreviations (Street→St, Avenue→Ave), case-fold 123 Main Street, Suite 4123 main st ste 4
string Trim, collapse whitespace, case-fold HELLO WORLD hello world

Survivor Selection

Flag Short Default Description
--survivor first Which row to keep per duplicate group.
--date-column none Date column for the most-recent rule.
--merge false Fill missing fields in the surviving row from removed duplicates.

Survivor rules:

Rule Behavior
first Keep the first row encountered (lowest row number)
last Keep the last row encountered (highest row number)
most-complete Keep the row with the fewest blank/empty cells
most-recent Keep the row with the latest date (requires --date-column)

Interactive Review

Flag Short Default Description
--review false Interactively review each match group. For each group, choose: merge (y), keep both (n), or skip remaining (s).

Configuration

Flag Short Default Description
--config none Load all settings from a saved JSON config file.
--save-config none Save current settings to a JSON config file for reuse.

File Handling

Flag Short Default Description
--sheet first sheet Excel sheet name or 0-based index. Ignored for CSV files.
--encoding auto-detect Override auto-detected file encoding (e.g., utf-8, windows-1252).
--header-row auto-detect 0-based row index for the header row.

Recipes

1. Basic Dedup (Auto-Detect)

Let the engine detect email, phone, name, and address columns automatically.

# Preview
python -m src.cli customers.csv

# Apply
python -m src.cli customers.csv --apply

The engine scans column names for patterns like email, phone, name, address and builds strategies automatically. Strong keys (email, phone) become standalone strategies; weak keys (name, address) are paired with strong keys.

2. Fuzzy Name Matching

Match rows where names are similar but not identical — catches typos, nickname variations, and formatting differences.

# Fuzzy-match on the "name" column at 80% similarity
python -m src.cli customers.csv --fuzzy name --threshold 80 --apply

# Fuzzy-match on multiple columns
python -m src.cli customers.csv --fuzzy name,address --threshold 85 --apply

# Use Levenshtein distance instead of Jaro-Winkler
python -m src.cli customers.csv --fuzzy name --algorithm levenshtein --threshold 80 --apply

Algorithm comparison:

  • jaro_winkler (default) — best for short strings like names; weights early characters more heavily
  • levenshtein — edit-distance ratio; works well for typos and transpositions
  • token_set_ratio — best for addresses and long strings; ignores word order

3. Custom Strong Keys

Use specific identifier columns to find exact duplicates.

# Deduplicate by Facebook ID
python -m src.cli donors.csv --key fb_id --apply

# Multiple strong keys (each is independent — matched with OR)
python -m src.cli donors.csv --key fb_id,ein --apply

Strong keys are OR'd: a match on fb_id alone OR ein alone marks rows as duplicates.

4. Merge Mode

Keep the most complete row and fill any remaining blanks from the duplicates.

# Most complete row + merge missing fields
python -m src.cli contacts.csv --survivor most-complete --merge --apply

# Keep most recent row and merge
python -m src.cli contacts.csv --survivor most-recent --date-column updated_at --merge --apply

How merge works: The survivor row keeps all its non-empty fields. For any blank/null fields, the engine fills from the removed rows (in row order). The result is a single row with maximum data retention.

5. Multi-Column Subset

Match on a specific combination of columns rather than auto-detecting.

# Exact match on email + phone only
python -m src.cli customers.csv --subset email,phone --apply

# Mix exact and fuzzy within a subset
python -m src.cli customers.csv --subset email,name --fuzzy name --threshold 85 --apply

When using --subset, all listed columns must match (AND logic) for a pair to be considered duplicates.

6. Save and Load Config Profiles

Save your settings for repeatable runs on similar files.

# Save settings to a file
python -m src.cli customers.csv --fuzzy name --threshold 80 --merge \
    --survivor most-complete --save-config customer_dedup.json

# Load saved settings
python -m src.cli new_customers.csv --config customer_dedup.json --apply

Config files are JSON. Example:

{
  "strategies": [],
  "survivor_rule": "most_complete",
  "merge": true,
  "default_algorithm": "jaro_winkler",
  "default_threshold": 80.0,
  "fuzzy_columns": ["name"]
}

7. Interactive Review

Step through each match group and decide whether to merge.

python -m src.cli customers.csv --review --apply

For each group, the CLI displays both rows side-by-side and prompts:

============================================================
Match Group 1 — Confidence: 92.3%
Matched on: name, phone
============================================================

  Row 1:
    name: John Smith
    email: john@example.com
    phone: (555) 123-4567

  Row 2:
    name: Jon Smith
    email:
    phone: 555-123-4567

  [y] Merge  [n] Keep both  [s] Skip remaining:
  • y — accept the match; merge/remove duplicate
  • n — reject the match; keep both rows
  • s — skip all remaining groups (keep both for all)

8. Excel Files and Multi-Sheet

Work with Excel files directly — no CSV conversion needed.

# Deduplicate first sheet (default)
python -m src.cli data.xlsx --apply

# Specify sheet by name
python -m src.cli data.xlsx --sheet "Sales Data" --apply

# Specify sheet by index (0-based)
python -m src.cli data.xlsx --sheet 1 --apply

Output is always CSV by default. To write Excel output, use -o:

python -m src.cli data.xlsx -o cleaned.xlsx --apply

Auto-Detection Details

When no --subset or --fuzzy flags are provided, the engine scans column names and builds strategies:

Column pattern Detection regex Algorithm Threshold Normalizer Key type
Email e[-_]?mail exact 100% email strong
Phone phone|telephone|mobile|cell exact 100% phone strong
Name ^(name|full_name|customer_name|...)$ jaro_winkler 85% name weak
Address address|street|addr token_set_ratio 80% address weak

Strategy building rules:

  • Strong keys → standalone OR strategies (email match alone is enough)
  • Weak keys → paired with each strong key via AND (name match requires email or phone match too)
  • No strong keys found → weak keys promoted to standalone
  • No patterns matched → exact match on all columns (equivalent to drop_duplicates)

Output Files

When --apply is set, three files are written:

File Description
{stem}_deduplicated.csv Cleaned DataFrame with duplicates removed
{stem}_removed.csv Rows that were removed
{stem}_match_groups.csv Audit trail with _group_id, _is_survivor, _confidence, _matched_on, _original_row, plus all original columns

Logging

Every run writes a timestamped log to logs/dedup_YYYYMMDD_HHMMSS.log with full debug-level details: strategies used, pair comparisons, survivor decisions, and merge actions.


Text Cleaner CLI

Character-level hygiene for CSV / Excel files: whitespace trim and collapse, smart-character folding, Unicode normalization, BOM strip, control-char strip, line-ending normalization, optional case conversion. See TECHNICAL.md Section 10.2 for the full functional spec.

python -m src.cli_text_clean INPUT_FILE [OPTIONS]

Arguments

Argument Required Description
INPUT_FILE Yes Path to the CSV, TSV, or Excel file to clean

Options

Core

Flag Short Default Description
--apply false Write output files. Without this flag, only a preview is shown.
--output -o {input}_cleaned.csv Output file path.
--preset excel-hygiene Preset bundle of safe defaults. See Presets.

Scope

Flag Default Description
--columns all string columns Comma-separated columns to clean.
--skip none Comma-separated columns to skip even if they look like text. Useful for free-text notes columns you don't want touched.

Per-operation toggles

These override the active preset.

Flag Effect
--no-trim Disable leading/trailing whitespace strip
--no-collapse Disable internal whitespace collapse
--no-nfc Disable Unicode NFC normalization
--nfkc Enable NFKC compatibility fold (lossy: 1, fi)
--no-smart-chars Disable smart-character folding (curly quotes, em/en-dash, NBSP, ellipsis)
--no-zero-width Disable zero-width / invisible character strip
--no-bom Disable leading BOM strip
--no-control Disable control-character strip
--no-line-endings Disable line-ending normalization

Case conversion

Flag Forms Description
--case upper, lower, title, sentence Apply this case to every selected column
--case mode:col[,mode:col] Per-column case (e.g., --case title:name,upper:code)

Title case preserves all-caps tokens (USA stays USA) and lowercases mid-string particles (of, and, the, etc.).

Audit and config

Flag Default Description
--full-changelog false Write every cell change to the audit CSV (default caps to first 1000).
--config none Load options from a saved JSON config file.
--save-config none Save the current options to a JSON config file.

File format / encoding

Flag Default Description
--sheet 0 Excel sheet name or 0-based index.
--encoding auto-detect Override auto-detected file encoding.
--header-row auto-detect 0-based row index for the header.

Presets

Preset What it does
minimal Trim + collapse whitespace only. Nothing else.
excel-hygiene (default) Trim, collapse, NFC, smart-character fold, zero-width strip, BOM strip, control strip, line-ending normalize. NFKC off.
paranoid All of excel-hygiene plus NFKC compatibility fold (lossy).

Output Files

When --apply is set:

File Description
{stem}_cleaned.csv Cleaned DataFrame
{stem}_changes.csv Per-cell audit: row, column, old, new, ops_applied (capped to 1000 rows by default; use --full-changelog for all)

A timestamped log is always written to logs/text_clean_YYYYMMDD_HHMMSS.log.

Recipes

# Preview what would change with the safe defaults
python -m src.cli_text_clean messy.csv

# Apply the safe defaults
python -m src.cli_text_clean messy.csv --apply

# Just the basics — only trim and collapse, leave Unicode/quotes alone
python -m src.cli_text_clean messy.csv --preset minimal --apply

# Title-case the name column, upper-case the SKU column, leave others alone for case
python -m src.cli_text_clean people.csv --case title:name,upper:sku --apply

# Clean only specific columns
python -m src.cli_text_clean orders.csv --columns vendor,product --apply

# Skip a free-text notes column from cleaning
python -m src.cli_text_clean tickets.csv --skip notes --apply

# Save the current settings as a profile and reload it later
python -m src.cli_text_clean messy.csv --preset minimal --case upper --save-config my.json
python -m src.cli_text_clean other.csv --config my.json --apply

Analyzer (upload-time scan)

python -m src.cli_analyze INPUT_FILE [OPTIONS]

  --sample-rows N       Cap on rows scanned (default 1000)
  --json                Print findings as a JSON array on stdout
  --strict              Exit non-zero on any warn/error finding

JSON output schema (one object per finding):

{
  "id": "smart_punctuation_in_data",
  "severity": "warn",
  "confidence": "high",
  "fix_action": "fold_smart_punctuation",
  "pre_applied": false,
  "tool": "02_text_cleaner",
  "count": 17,
  "description": "17 cell(s) contain curly quotes…",
  "column": null,
  "samples": [{"row": 3, "column": "name", "value": "“Alice”"}]
}
  • severityinfo / warn / error. Only error blocks the GUI normalization gate.
  • confidencehigh (round-trip-safe, eligible for one-click auto-fix), medium (preview before applying), low (heuristic, opt-in only).
  • fix_action — stable id naming the algorithm in src/core/fixes.py that resolves the finding. Empty string for informational-only findings.
  • pre_appliedtrue for fixes already applied during the byte-level read pass (BOM strip, NUL strip, line-ending normalize, byte-level smart-quote fold, transcode-to-UTF-8 from UTF-16/32). The GUI gate treats these as already-resolved; the CLI emits them so callers can audit what changed during read.

The detector set covers smart punctuation, NBSP / Unicode whitespace, zero-width characters, dirty headers, whitespace padding, null-like sentinels, mojibake fingerprints (UTF-8-as-cp1252), mixed-case email columns, near-duplicate rows (case-and-padding stripped), leading-zero IDs (Excel hazard), mixed line endings, encoding decode failure (encoding_decode_failed), and U+FFFD presence in the loaded text (encoding_uncertain). New detectors plug in by appending one entry to analyze.py and one matching fix in fixes.py.