Adds a Review & Normalize page that sits between upload and every tool
page. The analyzer now tags each finding with confidence (high/medium/low)
and a fix_action; the gate auto-applies high-confidence fixes, surfaces
medium/low ones for user review, and blocks tool pages on error-level
findings until resolved or waived.
Core (src/core/):
- analyze.py: Finding gains confidence, fix_action, pre_applied; new
detectors for encoding_uncertain, encoding_decode_failed; new top-
level encoding_override parameter.
- fixes.py: registry of fix algorithms keyed by fix_action id.
- normalize.py: auto_fix(), apply_decisions(), is_normalized(), and
the NormalizationResult / Decision dataclasses the gate consumes.
- io.py: detect_encoding tries strict UTF-8 first; repair_bytes now
transcodes UTF-16/32 to UTF-8 before NUL-strip (fixes UTF-16 corruption)
and normalizes line endings (fixes bare-CR parser crash); empty file
handled gracefully instead of EmptyDataError traceback.
GUI (src/gui/):
- pages/0_Review.py: gate page with per-finding decision controls,
encoding override picker (16 codepages + custom), and Advanced output
options (encoding, delimiter, line terminator) on the download.
- components.py: require_normalization_gate() helper.
- pages/1-9: gate guard wired on every tool page.
Test corpora:
- test-cases/encodings-corpus/: 31 encoded CSV fixtures + 9 reference
UTF-8 files + manifest, synced from Business/DataTools.
- test-cases/text-cleaner-corpus/test_data/17: synced malformed input
(unquoted $1,500.00) for the unquoted-delimiter detector.
Tests (94 new):
- test_normalize.py (48): finding fields, fix registry, auto_fix scope,
decision paths, gate idempotency, output-options helper.
- test_encodings_corpus.py (90, 16 xfailed): parametric detection +
decode + analyzer-no-crash sweep against the manifest.
- test_analyze.py: encoding override + encoding_uncertain detectors.
- test_corpus.py: pre-parse repair in the strict reader.
run_tests.py: new aliases --tool normalize, --tool encodings, --tool gate;
encodings corpus added to --fixtures category.
Docs: USER-GUIDE §3.3 covers the gate workflow, encoding override, and
output options; TECHNICAL §10.2.1-10.2.4 documents the analyzer schema,
gate API, Review page, and pre-parse repair pipeline; CLI-REFERENCE adds
the analyzer JSON schema with the new fields; README links to all of it.
Suite: 765 passed, 17 xfailed (was 458 passed).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
16 KiB
CLI Reference
Complete command-line reference for the DataTools bundle.
DataTools ships two CLI modules so each script can be invoked independently:
| Module | Command | Purpose |
|---|---|---|
src.cli |
python -m src.cli INPUT_FILE [OPTIONS] |
Deduplicator (script 01) |
src.cli_text_clean |
python -m src.cli_text_clean INPUT_FILE [OPTIONS] |
Text cleaner (script 02) |
The deduplicator section is below; the text cleaner reference is in Section: Text Cleaner CLI.
Deduplicator
python -m src.cli INPUT_FILE [OPTIONS]
Arguments
| Argument | Required | Description |
|---|---|---|
INPUT_FILE |
Yes | Path to the CSV, delimited text, or Excel file to deduplicate |
Options
Core
| Flag | Short | Default | Description |
|---|---|---|---|
--apply |
false |
Write output files. Without this flag, only a preview is shown. | |
--output |
-o |
{input}_deduplicated.csv |
Output file path. |
Column Selection
| Flag | Short | Default | Description |
|---|---|---|---|
--subset |
-s |
auto-detect | Comma-separated columns to match on. When omitted, columns are auto-detected by name pattern (email, phone, name, address). |
--key |
-k |
none | Comma-separated strong-key columns. Each becomes an independent exact-match strategy. Use for identifiers like fb_id, ein, sku. |
Fuzzy Matching
| Flag | Short | Default | Description |
|---|---|---|---|
--fuzzy |
none | Comma-separated columns to fuzzy-match. Other columns in the strategy use exact matching. | |
--algorithm |
-a |
jaro_winkler |
Fuzzy algorithm: levenshtein, jaro_winkler, or token_set_ratio. |
--threshold |
-t |
85 |
Similarity threshold 0-100. Lower values find more matches but increase false positives. |
Normalization
| Flag | Short | Default | Description |
|---|---|---|---|
--normalize |
auto-detect | Column normalizers as col:type pairs, comma-separated. Types: email, phone, name, address, string. |
Normalizer details:
| Type | What it does | Example |
|---|---|---|
email |
Lowercase, strip Gmail dots, strip +tag suffixes |
John.Doe+tag@gmail.com → johndoe@gmail.com |
phone |
Parse to E.164 format; fallback: digits only | (555) 123-4567 → +15551234567 |
name |
Strip titles (Dr., Mr.) and suffixes (Jr., PhD), case-fold | Dr. John Smith Jr. → john smith |
address |
USPS abbreviations (Street→St, Avenue→Ave), case-fold | 123 Main Street, Suite 4 → 123 main st ste 4 |
string |
Trim, collapse whitespace, case-fold | HELLO WORLD → hello world |
Survivor Selection
| Flag | Short | Default | Description |
|---|---|---|---|
--survivor |
first |
Which row to keep per duplicate group. | |
--date-column |
none | Date column for the most-recent rule. |
|
--merge |
false |
Fill missing fields in the surviving row from removed duplicates. |
Survivor rules:
| Rule | Behavior |
|---|---|
first |
Keep the first row encountered (lowest row number) |
last |
Keep the last row encountered (highest row number) |
most-complete |
Keep the row with the fewest blank/empty cells |
most-recent |
Keep the row with the latest date (requires --date-column) |
Interactive Review
| Flag | Short | Default | Description |
|---|---|---|---|
--review |
false |
Interactively review each match group. For each group, choose: merge (y), keep both (n), or skip remaining (s). |
Configuration
| Flag | Short | Default | Description |
|---|---|---|---|
--config |
none | Load all settings from a saved JSON config file. | |
--save-config |
none | Save current settings to a JSON config file for reuse. |
File Handling
| Flag | Short | Default | Description |
|---|---|---|---|
--sheet |
first sheet | Excel sheet name or 0-based index. Ignored for CSV files. | |
--encoding |
auto-detect | Override auto-detected file encoding (e.g., utf-8, windows-1252). |
|
--header-row |
auto-detect | 0-based row index for the header row. |
Recipes
1. Basic Dedup (Auto-Detect)
Let the engine detect email, phone, name, and address columns automatically.
# Preview
python -m src.cli customers.csv
# Apply
python -m src.cli customers.csv --apply
The engine scans column names for patterns like email, phone, name, address and builds strategies automatically. Strong keys (email, phone) become standalone strategies; weak keys (name, address) are paired with strong keys.
2. Fuzzy Name Matching
Match rows where names are similar but not identical — catches typos, nickname variations, and formatting differences.
# Fuzzy-match on the "name" column at 80% similarity
python -m src.cli customers.csv --fuzzy name --threshold 80 --apply
# Fuzzy-match on multiple columns
python -m src.cli customers.csv --fuzzy name,address --threshold 85 --apply
# Use Levenshtein distance instead of Jaro-Winkler
python -m src.cli customers.csv --fuzzy name --algorithm levenshtein --threshold 80 --apply
Algorithm comparison:
jaro_winkler(default) — best for short strings like names; weights early characters more heavilylevenshtein— edit-distance ratio; works well for typos and transpositionstoken_set_ratio— best for addresses and long strings; ignores word order
3. Custom Strong Keys
Use specific identifier columns to find exact duplicates.
# Deduplicate by Facebook ID
python -m src.cli donors.csv --key fb_id --apply
# Multiple strong keys (each is independent — matched with OR)
python -m src.cli donors.csv --key fb_id,ein --apply
Strong keys are OR'd: a match on fb_id alone OR ein alone marks rows as duplicates.
4. Merge Mode
Keep the most complete row and fill any remaining blanks from the duplicates.
# Most complete row + merge missing fields
python -m src.cli contacts.csv --survivor most-complete --merge --apply
# Keep most recent row and merge
python -m src.cli contacts.csv --survivor most-recent --date-column updated_at --merge --apply
How merge works: The survivor row keeps all its non-empty fields. For any blank/null fields, the engine fills from the removed rows (in row order). The result is a single row with maximum data retention.
5. Multi-Column Subset
Match on a specific combination of columns rather than auto-detecting.
# Exact match on email + phone only
python -m src.cli customers.csv --subset email,phone --apply
# Mix exact and fuzzy within a subset
python -m src.cli customers.csv --subset email,name --fuzzy name --threshold 85 --apply
When using --subset, all listed columns must match (AND logic) for a pair to be considered duplicates.
6. Save and Load Config Profiles
Save your settings for repeatable runs on similar files.
# Save settings to a file
python -m src.cli customers.csv --fuzzy name --threshold 80 --merge \
--survivor most-complete --save-config customer_dedup.json
# Load saved settings
python -m src.cli new_customers.csv --config customer_dedup.json --apply
Config files are JSON. Example:
{
"strategies": [],
"survivor_rule": "most_complete",
"merge": true,
"default_algorithm": "jaro_winkler",
"default_threshold": 80.0,
"fuzzy_columns": ["name"]
}
7. Interactive Review
Step through each match group and decide whether to merge.
python -m src.cli customers.csv --review --apply
For each group, the CLI displays both rows side-by-side and prompts:
============================================================
Match Group 1 — Confidence: 92.3%
Matched on: name, phone
============================================================
Row 1:
name: John Smith
email: john@example.com
phone: (555) 123-4567
Row 2:
name: Jon Smith
email:
phone: 555-123-4567
[y] Merge [n] Keep both [s] Skip remaining:
- y — accept the match; merge/remove duplicate
- n — reject the match; keep both rows
- s — skip all remaining groups (keep both for all)
8. Excel Files and Multi-Sheet
Work with Excel files directly — no CSV conversion needed.
# Deduplicate first sheet (default)
python -m src.cli data.xlsx --apply
# Specify sheet by name
python -m src.cli data.xlsx --sheet "Sales Data" --apply
# Specify sheet by index (0-based)
python -m src.cli data.xlsx --sheet 1 --apply
Output is always CSV by default. To write Excel output, use -o:
python -m src.cli data.xlsx -o cleaned.xlsx --apply
Auto-Detection Details
When no --subset or --fuzzy flags are provided, the engine scans column names and builds strategies:
| Column pattern | Detection regex | Algorithm | Threshold | Normalizer | Key type |
|---|---|---|---|---|---|
e[-_]?mail |
exact | 100% | strong | ||
| Phone | phone|telephone|mobile|cell |
exact | 100% | phone | strong |
| Name | ^(name|full_name|customer_name|...)$ |
jaro_winkler | 85% | name | weak |
| Address | address|street|addr |
token_set_ratio | 80% | address | weak |
Strategy building rules:
- Strong keys → standalone OR strategies (email match alone is enough)
- Weak keys → paired with each strong key via AND (name match requires email or phone match too)
- No strong keys found → weak keys promoted to standalone
- No patterns matched → exact match on all columns (equivalent to
drop_duplicates)
Output Files
When --apply is set, three files are written:
| File | Description |
|---|---|
{stem}_deduplicated.csv |
Cleaned DataFrame with duplicates removed |
{stem}_removed.csv |
Rows that were removed |
{stem}_match_groups.csv |
Audit trail with _group_id, _is_survivor, _confidence, _matched_on, _original_row, plus all original columns |
Logging
Every run writes a timestamped log to logs/dedup_YYYYMMDD_HHMMSS.log with full debug-level details: strategies used, pair comparisons, survivor decisions, and merge actions.
Text Cleaner CLI
Character-level hygiene for CSV / Excel files: whitespace trim and collapse, smart-character folding, Unicode normalization, BOM strip, control-char strip, line-ending normalization, optional case conversion. See TECHNICAL.md Section 10.2 for the full functional spec.
python -m src.cli_text_clean INPUT_FILE [OPTIONS]
Arguments
| Argument | Required | Description |
|---|---|---|
INPUT_FILE |
Yes | Path to the CSV, TSV, or Excel file to clean |
Options
Core
| Flag | Short | Default | Description |
|---|---|---|---|
--apply |
false |
Write output files. Without this flag, only a preview is shown. | |
--output |
-o |
{input}_cleaned.csv |
Output file path. |
--preset |
excel-hygiene |
Preset bundle of safe defaults. See Presets. |
Scope
| Flag | Default | Description |
|---|---|---|
--columns |
all string columns | Comma-separated columns to clean. |
--skip |
none | Comma-separated columns to skip even if they look like text. Useful for free-text notes columns you don't want touched. |
Per-operation toggles
These override the active preset.
| Flag | Effect |
|---|---|
--no-trim |
Disable leading/trailing whitespace strip |
--no-collapse |
Disable internal whitespace collapse |
--no-nfc |
Disable Unicode NFC normalization |
--nfkc |
Enable NFKC compatibility fold (lossy: ① → 1, fi → fi) |
--no-smart-chars |
Disable smart-character folding (curly quotes, em/en-dash, NBSP, ellipsis) |
--no-zero-width |
Disable zero-width / invisible character strip |
--no-bom |
Disable leading BOM strip |
--no-control |
Disable control-character strip |
--no-line-endings |
Disable line-ending normalization |
Case conversion
| Flag | Forms | Description |
|---|---|---|
--case |
upper, lower, title, sentence |
Apply this case to every selected column |
--case |
mode:col[,mode:col] |
Per-column case (e.g., --case title:name,upper:code) |
Title case preserves all-caps tokens (USA stays USA) and lowercases mid-string particles (of, and, the, etc.).
Audit and config
| Flag | Default | Description |
|---|---|---|
--full-changelog |
false |
Write every cell change to the audit CSV (default caps to first 1000). |
--config |
none | Load options from a saved JSON config file. |
--save-config |
none | Save the current options to a JSON config file. |
File format / encoding
| Flag | Default | Description |
|---|---|---|
--sheet |
0 |
Excel sheet name or 0-based index. |
--encoding |
auto-detect | Override auto-detected file encoding. |
--header-row |
auto-detect | 0-based row index for the header. |
Presets
| Preset | What it does |
|---|---|
minimal |
Trim + collapse whitespace only. Nothing else. |
excel-hygiene (default) |
Trim, collapse, NFC, smart-character fold, zero-width strip, BOM strip, control strip, line-ending normalize. NFKC off. |
paranoid |
All of excel-hygiene plus NFKC compatibility fold (lossy). |
Output Files
When --apply is set:
| File | Description |
|---|---|
{stem}_cleaned.csv |
Cleaned DataFrame |
{stem}_changes.csv |
Per-cell audit: row, column, old, new, ops_applied (capped to 1000 rows by default; use --full-changelog for all) |
A timestamped log is always written to logs/text_clean_YYYYMMDD_HHMMSS.log.
Recipes
# Preview what would change with the safe defaults
python -m src.cli_text_clean messy.csv
# Apply the safe defaults
python -m src.cli_text_clean messy.csv --apply
# Just the basics — only trim and collapse, leave Unicode/quotes alone
python -m src.cli_text_clean messy.csv --preset minimal --apply
# Title-case the name column, upper-case the SKU column, leave others alone for case
python -m src.cli_text_clean people.csv --case title:name,upper:sku --apply
# Clean only specific columns
python -m src.cli_text_clean orders.csv --columns vendor,product --apply
# Skip a free-text notes column from cleaning
python -m src.cli_text_clean tickets.csv --skip notes --apply
# Save the current settings as a profile and reload it later
python -m src.cli_text_clean messy.csv --preset minimal --case upper --save-config my.json
python -m src.cli_text_clean other.csv --config my.json --apply
Analyzer (upload-time scan)
python -m src.cli_analyze INPUT_FILE [OPTIONS]
--sample-rows N Cap on rows scanned (default 1000)
--json Print findings as a JSON array on stdout
--strict Exit non-zero on any warn/error finding
JSON output schema (one object per finding):
{
"id": "smart_punctuation_in_data",
"severity": "warn",
"confidence": "high",
"fix_action": "fold_smart_punctuation",
"pre_applied": false,
"tool": "02_text_cleaner",
"count": 17,
"description": "17 cell(s) contain curly quotes…",
"column": null,
"samples": [{"row": 3, "column": "name", "value": "“Alice”"}]
}
severity—info/warn/error. Onlyerrorblocks the GUI normalization gate.confidence—high(round-trip-safe, eligible for one-click auto-fix),medium(preview before applying),low(heuristic, opt-in only).fix_action— stable id naming the algorithm insrc/core/fixes.pythat resolves the finding. Empty string for informational-only findings.pre_applied—truefor fixes already applied during the byte-level read pass (BOM strip, NUL strip, line-ending normalize, byte-level smart-quote fold, transcode-to-UTF-8 from UTF-16/32). The GUI gate treats these as already-resolved; the CLI emits them so callers can audit what changed during read.
The detector set covers smart punctuation, NBSP / Unicode whitespace, zero-width characters, dirty headers, whitespace padding, null-like sentinels, mojibake fingerprints (UTF-8-as-cp1252), mixed-case email columns, near-duplicate rows (case-and-padding stripped), leading-zero IDs (Excel hazard), mixed line endings, encoding decode failure (encoding_decode_failed), and U+FFFD presence in the loaded text (encoding_uncertain). New detectors plug in by appending one entry to analyze.py and one matching fix in fixes.py.