Files

Michael 82d7fef21e feat(gate): CSV-normalization gate with confidence-tiered findings

Adds a Review & Normalize page that sits between upload and every tool
page. The analyzer now tags each finding with confidence (high/medium/low)
and a fix_action; the gate auto-applies high-confidence fixes, surfaces
medium/low ones for user review, and blocks tool pages on error-level
findings until resolved or waived.

Core (src/core/):
  - analyze.py: Finding gains confidence, fix_action, pre_applied; new
    detectors for encoding_uncertain, encoding_decode_failed; new top-
    level encoding_override parameter.
  - fixes.py: registry of fix algorithms keyed by fix_action id.
  - normalize.py: auto_fix(), apply_decisions(), is_normalized(), and
    the NormalizationResult / Decision dataclasses the gate consumes.
  - io.py: detect_encoding tries strict UTF-8 first; repair_bytes now
    transcodes UTF-16/32 to UTF-8 before NUL-strip (fixes UTF-16 corruption)
    and normalizes line endings (fixes bare-CR parser crash); empty file
    handled gracefully instead of EmptyDataError traceback.

GUI (src/gui/):
  - pages/0_Review.py: gate page with per-finding decision controls,
    encoding override picker (16 codepages + custom), and Advanced output
    options (encoding, delimiter, line terminator) on the download.
  - components.py: require_normalization_gate() helper.
  - pages/1-9: gate guard wired on every tool page.

Test corpora:
  - test-cases/encodings-corpus/: 31 encoded CSV fixtures + 9 reference
    UTF-8 files + manifest, synced from Business/DataTools.
  - test-cases/text-cleaner-corpus/test_data/17: synced malformed input
    (unquoted $1,500.00) for the unquoted-delimiter detector.

Tests (94 new):
  - test_normalize.py (48): finding fields, fix registry, auto_fix scope,
    decision paths, gate idempotency, output-options helper.
  - test_encodings_corpus.py (90, 16 xfailed): parametric detection +
    decode + analyzer-no-crash sweep against the manifest.
  - test_analyze.py: encoding override + encoding_uncertain detectors.
  - test_corpus.py: pre-parse repair in the strict reader.

run_tests.py: new aliases --tool normalize, --tool encodings, --tool gate;
encodings corpus added to --fixtures category.

Docs: USER-GUIDE §3.3 covers the gate workflow, encoding override, and
output options; TECHNICAL §10.2.1-10.2.4 documents the analyzer schema,
gate API, Review page, and pre-parse repair pipeline; CLI-REFERENCE adds
the analyzer JSON schema with the new fields; README links to all of it.

Suite: 765 passed, 17 xfailed (was 458 passed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-29 20:35:27 +00:00

16 KiB

Raw Blame History

CLI Reference

Complete command-line reference for the DataTools bundle.

DataTools ships two CLI modules so each script can be invoked independently:

Module	Command	Purpose
`src.cli`	`python -m src.cli INPUT_FILE [OPTIONS]`	Deduplicator (script 01)
`src.cli_text_clean`	`python -m src.cli_text_clean INPUT_FILE [OPTIONS]`	Text cleaner (script 02)

The deduplicator section is below; the text cleaner reference is in Section: Text Cleaner CLI.

Deduplicator

python -m src.cli INPUT_FILE [OPTIONS]

Arguments

Argument	Required	Description
`INPUT_FILE`	Yes	Path to the CSV, delimited text, or Excel file to deduplicate

Options

Core

Flag	Short	Default	Description
`--apply`		`false`	Write output files. Without this flag, only a preview is shown.
`--output`	`-o`	`{input}_deduplicated.csv`	Output file path.

Column Selection

Flag	Short	Default	Description
`--subset`	`-s`	auto-detect	Comma-separated columns to match on. When omitted, columns are auto-detected by name pattern (email, phone, name, address).
`--key`	`-k`	none	Comma-separated strong-key columns. Each becomes an independent exact-match strategy. Use for identifiers like `fb_id`, `ein`, `sku`.

Fuzzy Matching

Flag	Short	Default	Description
`--fuzzy`		none	Comma-separated columns to fuzzy-match. Other columns in the strategy use exact matching.
`--algorithm`	`-a`	`jaro_winkler`	Fuzzy algorithm: `levenshtein`, `jaro_winkler`, or `token_set_ratio`.
`--threshold`	`-t`	`85`	Similarity threshold 0-100. Lower values find more matches but increase false positives.

Normalization

Flag	Short	Default	Description
`--normalize`		auto-detect	Column normalizers as `col:type` pairs, comma-separated. Types: `email`, `phone`, `name`, `address`, `string`.

Normalizer details:

Type	What it does	Example
`email`	Lowercase, strip Gmail dots, strip `+tag` suffixes	`John.Doe+tag@gmail.com` → `johndoe@gmail.com`
`phone`	Parse to E.164 format; fallback: digits only	`(555) 123-4567` → `+15551234567`
`name`	Strip titles (Dr., Mr.) and suffixes (Jr., PhD), case-fold	`Dr. John Smith Jr.` → `john smith`
`address`	USPS abbreviations (Street→St, Avenue→Ave), case-fold	`123 Main Street, Suite 4` → `123 main st ste 4`
`string`	Trim, collapse whitespace, case-fold	`HELLO WORLD` → `hello world`

Survivor Selection

Flag	Default	Description
`--survivor`	`first`	Which row to keep per duplicate group.
`--date-column`	none	Date column for the `most-recent` rule.
`--merge`	`false`	Fill missing fields in the surviving row from removed duplicates.

Survivor rules:

Rule	Behavior
`first`	Keep the first row encountered (lowest row number)
`last`	Keep the last row encountered (highest row number)
`most-complete`	Keep the row with the fewest blank/empty cells
`most-recent`	Keep the row with the latest date (requires `--date-column`)

Interactive Review

Flag	Short	Default	Description
`--review`		`false`	Interactively review each match group. For each group, choose: merge (y), keep both (n), or skip remaining (s).

Configuration

Flag	Short	Default	Description
`--config`		none	Load all settings from a saved JSON config file.
`--save-config`		none	Save current settings to a JSON config file for reuse.

File Handling

Flag	Default	Description
`--sheet`	first sheet	Excel sheet name or 0-based index. Ignored for CSV files.
`--encoding`	auto-detect	Override auto-detected file encoding (e.g., `utf-8`, `windows-1252`).
`--header-row`	auto-detect	0-based row index for the header row.

Recipes

1. Basic Dedup (Auto-Detect)

Let the engine detect email, phone, name, and address columns automatically.

# Preview
python -m src.cli customers.csv

# Apply
python -m src.cli customers.csv --apply

The engine scans column names for patterns like email, phone, name, address and builds strategies automatically. Strong keys (email, phone) become standalone strategies; weak keys (name, address) are paired with strong keys.

2. Fuzzy Name Matching

Match rows where names are similar but not identical — catches typos, nickname variations, and formatting differences.

# Fuzzy-match on the "name" column at 80% similarity
python -m src.cli customers.csv --fuzzy name --threshold 80 --apply

# Fuzzy-match on multiple columns
python -m src.cli customers.csv --fuzzy name,address --threshold 85 --apply

# Use Levenshtein distance instead of Jaro-Winkler
python -m src.cli customers.csv --fuzzy name --algorithm levenshtein --threshold 80 --apply

Algorithm comparison:

jaro_winkler (default) — best for short strings like names; weights early characters more heavily
levenshtein — edit-distance ratio; works well for typos and transpositions
token_set_ratio — best for addresses and long strings; ignores word order

3. Custom Strong Keys

Use specific identifier columns to find exact duplicates.

# Deduplicate by Facebook ID
python -m src.cli donors.csv --key fb_id --apply

# Multiple strong keys (each is independent — matched with OR)
python -m src.cli donors.csv --key fb_id,ein --apply

Strong keys are OR'd: a match on fb_id alone OR ein alone marks rows as duplicates.

4. Merge Mode

Keep the most complete row and fill any remaining blanks from the duplicates.

# Most complete row + merge missing fields
python -m src.cli contacts.csv --survivor most-complete --merge --apply

# Keep most recent row and merge
python -m src.cli contacts.csv --survivor most-recent --date-column updated_at --merge --apply

How merge works: The survivor row keeps all its non-empty fields. For any blank/null fields, the engine fills from the removed rows (in row order). The result is a single row with maximum data retention.

5. Multi-Column Subset

Match on a specific combination of columns rather than auto-detecting.

# Exact match on email + phone only
python -m src.cli customers.csv --subset email,phone --apply

# Mix exact and fuzzy within a subset
python -m src.cli customers.csv --subset email,name --fuzzy name --threshold 85 --apply

When using --subset, all listed columns must match (AND logic) for a pair to be considered duplicates.

6. Save and Load Config Profiles

Save your settings for repeatable runs on similar files.

# Save settings to a file
python -m src.cli customers.csv --fuzzy name --threshold 80 --merge \
    --survivor most-complete --save-config customer_dedup.json

# Load saved settings
python -m src.cli new_customers.csv --config customer_dedup.json --apply

Config files are JSON. Example:

{
  "strategies": [],
  "survivor_rule": "most_complete",
  "merge": true,
  "default_algorithm": "jaro_winkler",
  "default_threshold": 80.0,
  "fuzzy_columns": ["name"]
}

7. Interactive Review

Step through each match group and decide whether to merge.

python -m src.cli customers.csv --review --apply

For each group, the CLI displays both rows side-by-side and prompts:

============================================================
Match Group 1 — Confidence: 92.3%
Matched on: name, phone
============================================================

  Row 1:
    name: John Smith
    email: john@example.com
    phone: (555) 123-4567

  Row 2:
    name: Jon Smith
    email:
    phone: 555-123-4567

  [y] Merge  [n] Keep both  [s] Skip remaining:

y — accept the match; merge/remove duplicate
n — reject the match; keep both rows
s — skip all remaining groups (keep both for all)

8. Excel Files and Multi-Sheet

Work with Excel files directly — no CSV conversion needed.

# Deduplicate first sheet (default)
python -m src.cli data.xlsx --apply

# Specify sheet by name
python -m src.cli data.xlsx --sheet "Sales Data" --apply

# Specify sheet by index (0-based)
python -m src.cli data.xlsx --sheet 1 --apply

Output is always CSV by default. To write Excel output, use -o:

python -m src.cli data.xlsx -o cleaned.xlsx --apply

Auto-Detection Details

When no --subset or --fuzzy flags are provided, the engine scans column names and builds strategies:

Column pattern	Detection regex	Algorithm	Threshold	Normalizer	Key type
Email	`e[-_]?mail`	exact	100%	email	strong
Phone	`phone\|telephone\|mobile\|cell`	exact	100%	phone	strong
Name	`^(name\|full_name\|customer_name\|...)$`	jaro_winkler	85%	name	weak
Address	`address\|street\|addr`	token_set_ratio	80%	address	weak

Strategy building rules:

Strong keys → standalone OR strategies (email match alone is enough)
Weak keys → paired with each strong key via AND (name match requires email or phone match too)
No strong keys found → weak keys promoted to standalone
No patterns matched → exact match on all columns (equivalent to drop_duplicates)

Output Files

When --apply is set, three files are written:

File	Description
`{stem}_deduplicated.csv`	Cleaned DataFrame with duplicates removed
`{stem}_removed.csv`	Rows that were removed
`{stem}_match_groups.csv`	Audit trail with `_group_id`, `_is_survivor`, `_confidence`, `_matched_on`, `_original_row`, plus all original columns

Logging

Every run writes a timestamped log to logs/dedup_YYYYMMDD_HHMMSS.log with full debug-level details: strategies used, pair comparisons, survivor decisions, and merge actions.

Text Cleaner CLI

Character-level hygiene for CSV / Excel files: whitespace trim and collapse, smart-character folding, Unicode normalization, BOM strip, control-char strip, line-ending normalization, optional case conversion. See TECHNICAL.md Section 10.2 for the full functional spec.

python -m src.cli_text_clean INPUT_FILE [OPTIONS]

Arguments

Argument	Required	Description
`INPUT_FILE`	Yes	Path to the CSV, TSV, or Excel file to clean

Options

Core

Flag	Short	Default	Description
`--apply`		`false`	Write output files. Without this flag, only a preview is shown.
`--output`	`-o`	`{input}_cleaned.csv`	Output file path.
`--preset`		`excel-hygiene`	Preset bundle of safe defaults. See Presets.

Scope

Flag	Default	Description
`--columns`	all string columns	Comma-separated columns to clean.
`--skip`	none	Comma-separated columns to skip even if they look like text. Useful for free-text notes columns you don't want touched.

Per-operation toggles

These override the active preset.

Flag	Effect
`--no-trim`	Disable leading/trailing whitespace strip
`--no-collapse`	Disable internal whitespace collapse
`--no-nfc`	Disable Unicode NFC normalization
`--nfkc`	Enable NFKC compatibility fold (lossy: `①` → `1`, `ﬁ` → `fi`)
`--no-smart-chars`	Disable smart-character folding (curly quotes, em/en-dash, NBSP, ellipsis)
`--no-zero-width`	Disable zero-width / invisible character strip
`--no-bom`	Disable leading BOM strip
`--no-control`	Disable control-character strip
`--no-line-endings`	Disable line-ending normalization

Case conversion

Flag	Forms	Description
`--case`	`upper`, `lower`, `title`, `sentence`	Apply this case to every selected column
`--case`	`mode:col[,mode:col]`	Per-column case (e.g., `--case title:name,upper:code`)

Title case preserves all-caps tokens (USA stays USA) and lowercases mid-string particles (of, and, the, etc.).

Audit and config

Flag	Default	Description
`--full-changelog`	`false`	Write every cell change to the audit CSV (default caps to first 1000).
`--config`	none	Load options from a saved JSON config file.
`--save-config`	none	Save the current options to a JSON config file.

File format / encoding

Flag	Default	Description
`--sheet`	`0`	Excel sheet name or 0-based index.
`--encoding`	auto-detect	Override auto-detected file encoding.
`--header-row`	auto-detect	0-based row index for the header.

Presets

Preset	What it does
`minimal`	Trim + collapse whitespace only. Nothing else.
`excel-hygiene` (default)	Trim, collapse, NFC, smart-character fold, zero-width strip, BOM strip, control strip, line-ending normalize. NFKC off.
`paranoid`	All of `excel-hygiene` plus NFKC compatibility fold (lossy).

Output Files

When --apply is set:

File	Description
`{stem}_cleaned.csv`	Cleaned DataFrame
`{stem}_changes.csv`	Per-cell audit: `row`, `column`, `old`, `new`, `ops_applied` (capped to 1000 rows by default; use `--full-changelog` for all)

A timestamped log is always written to logs/text_clean_YYYYMMDD_HHMMSS.log.

Recipes

# Preview what would change with the safe defaults
python -m src.cli_text_clean messy.csv

# Apply the safe defaults
python -m src.cli_text_clean messy.csv --apply

# Just the basics — only trim and collapse, leave Unicode/quotes alone
python -m src.cli_text_clean messy.csv --preset minimal --apply

# Title-case the name column, upper-case the SKU column, leave others alone for case
python -m src.cli_text_clean people.csv --case title:name,upper:sku --apply

# Clean only specific columns
python -m src.cli_text_clean orders.csv --columns vendor,product --apply

# Skip a free-text notes column from cleaning
python -m src.cli_text_clean tickets.csv --skip notes --apply

# Save the current settings as a profile and reload it later
python -m src.cli_text_clean messy.csv --preset minimal --case upper --save-config my.json
python -m src.cli_text_clean other.csv --config my.json --apply

Analyzer (upload-time scan)

python -m src.cli_analyze INPUT_FILE [OPTIONS]

  --sample-rows N       Cap on rows scanned (default 1000)
  --json                Print findings as a JSON array on stdout
  --strict              Exit non-zero on any warn/error finding

JSON output schema (one object per finding):

{
  "id": "smart_punctuation_in_data",
  "severity": "warn",
  "confidence": "high",
  "fix_action": "fold_smart_punctuation",
  "pre_applied": false,
  "tool": "02_text_cleaner",
  "count": 17,
  "description": "17 cell(s) contain curly quotes…",
  "column": null,
  "samples": [{"row": 3, "column": "name", "value": "“Alice”"}]
}

severity — info / warn / error. Only error blocks the GUI normalization gate.
confidence — high (round-trip-safe, eligible for one-click auto-fix), medium (preview before applying), low (heuristic, opt-in only).
fix_action — stable id naming the algorithm in src/core/fixes.py that resolves the finding. Empty string for informational-only findings.
pre_applied — true for fixes already applied during the byte-level read pass (BOM strip, NUL strip, line-ending normalize, byte-level smart-quote fold, transcode-to-UTF-8 from UTF-16/32). The GUI gate treats these as already-resolved; the CLI emits them so callers can audit what changed during read.

The detector set covers smart punctuation, NBSP / Unicode whitespace, zero-width characters, dirty headers, whitespace padding, null-like sentinels, mojibake fingerprints (UTF-8-as-cp1252), mixed-case email columns, near-duplicate rows (case-and-padding stripped), leading-zero IDs (Excel hazard), mixed line endings, encoding decode failure (encoding_decode_failed), and U+FFFD presence in the loaded text (encoding_uncertain). New detectors plug in by appending one entry to analyze.py and one matching fix in fixes.py.

16 KiB Raw Blame History

CLI Reference

Deduplicator

Arguments

Options

Core

Column Selection

Fuzzy Matching

Normalization

Survivor Selection

Interactive Review

Configuration

File Handling

Recipes

1. Basic Dedup (Auto-Detect)

2. Fuzzy Name Matching

3. Custom Strong Keys

4. Merge Mode

5. Multi-Column Subset

6. Save and Load Config Profiles

7. Interactive Review

8. Excel Files and Multi-Sheet

Auto-Detection Details

Output Files

Logging

Text Cleaner CLI

Arguments

Options

Core

Scope

Per-operation toggles

Case conversion

Audit and config

File format / encoding

Presets

Output Files

Recipes

Analyzer (upload-time scan)

16 KiB

Raw Blame History