Files

Michael 54f92ae47e feat: implement text cleaner (script 02) with CLI, GUI, and tests

Builds 02_text_cleaner.py from stub to working: character-level hygiene
for CSV/Excel inputs covering trim, whitespace collapse, smart-character
folding, Unicode NFC/NFKC, BOM strip, zero-width strip, control-char
strip, line-ending normalization, and per-column case conversion. Three
presets (minimal/excel-hygiene/paranoid) keep the buyer surface small.

- src/core/text_clean.py: pure helpers + CleanOptions/CleanResult +
  clean_dataframe with dtype-safe column selection
- src/cli_text_clean.py: Typer CLI mirroring the dedup CLI shape
  (dry-run by default, --apply writes cleaned + changes audit, JSON
  config save/load)
- src/gui/pages/2_Text_Cleaner.py: real Streamlit page with preset
  picker, advanced toggles, preview, before/after metrics, and three
  download buttons
- tests/test_text_clean.py + test_cli_text_clean.py: 92 new tests
  covering edge cases E1-E50 from the spec
- samples/messy_text.csv: demo dataset surfacing UC1, UC3, UC6, UC10
  in 10 rows
- test-cases/uc16-uc26 + ec05-ec09: per-use-case and per-edge-case
  fixtures

Docs: TECHNICAL.md §10.2 (full Tier 1/2/3 spec), DECISIONS.md v1.7
entry locking the spec, CLI-REFERENCE.md gains the text cleaner
section, README.md gains a top-level Text Cleaner block, USER-GUIDE.md
status row 02 promoted Skeleton -> Working.

200/200 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-29 15:14:15 +00:00

15 KiB

Raw Blame History

CLI Reference

Complete command-line reference for the DataTools bundle.

DataTools ships two CLI modules so each script can be invoked independently:

Module	Command	Purpose
`src.cli`	`python -m src.cli INPUT_FILE [OPTIONS]`	Deduplicator (script 01)
`src.cli_text_clean`	`python -m src.cli_text_clean INPUT_FILE [OPTIONS]`	Text cleaner (script 02)

The deduplicator section is below; the text cleaner reference is in Section: Text Cleaner CLI.

Deduplicator

python -m src.cli INPUT_FILE [OPTIONS]

Arguments

Argument	Required	Description
`INPUT_FILE`	Yes	Path to the CSV, delimited text, or Excel file to deduplicate

Options

Core

Flag	Short	Default	Description
`--apply`		`false`	Write output files. Without this flag, only a preview is shown.
`--output`	`-o`	`{input}_deduplicated.csv`	Output file path.

Column Selection

Flag	Short	Default	Description
`--subset`	`-s`	auto-detect	Comma-separated columns to match on. When omitted, columns are auto-detected by name pattern (email, phone, name, address).
`--key`	`-k`	none	Comma-separated strong-key columns. Each becomes an independent exact-match strategy. Use for identifiers like `fb_id`, `ein`, `sku`.

Fuzzy Matching

Flag	Short	Default	Description
`--fuzzy`		none	Comma-separated columns to fuzzy-match. Other columns in the strategy use exact matching.
`--algorithm`	`-a`	`jaro_winkler`	Fuzzy algorithm: `levenshtein`, `jaro_winkler`, or `token_set_ratio`.
`--threshold`	`-t`	`85`	Similarity threshold 0-100. Lower values find more matches but increase false positives.

Normalization

Flag	Short	Default	Description
`--normalize`		auto-detect	Column normalizers as `col:type` pairs, comma-separated. Types: `email`, `phone`, `name`, `address`, `string`.

Normalizer details:

Type	What it does	Example
`email`	Lowercase, strip Gmail dots, strip `+tag` suffixes	`John.Doe+tag@gmail.com` → `johndoe@gmail.com`
`phone`	Parse to E.164 format; fallback: digits only	`(555) 123-4567` → `+15551234567`
`name`	Strip titles (Dr., Mr.) and suffixes (Jr., PhD), case-fold	`Dr. John Smith Jr.` → `john smith`
`address`	USPS abbreviations (Street→St, Avenue→Ave), case-fold	`123 Main Street, Suite 4` → `123 main st ste 4`
`string`	Trim, collapse whitespace, case-fold	`HELLO WORLD` → `hello world`

Survivor Selection

Flag	Default	Description
`--survivor`	`first`	Which row to keep per duplicate group.
`--date-column`	none	Date column for the `most-recent` rule.
`--merge`	`false`	Fill missing fields in the surviving row from removed duplicates.

Survivor rules:

Rule	Behavior
`first`	Keep the first row encountered (lowest row number)
`last`	Keep the last row encountered (highest row number)
`most-complete`	Keep the row with the fewest blank/empty cells
`most-recent`	Keep the row with the latest date (requires `--date-column`)

Interactive Review

Flag	Short	Default	Description
`--review`		`false`	Interactively review each match group. For each group, choose: merge (y), keep both (n), or skip remaining (s).

Configuration

Flag	Short	Default	Description
`--config`		none	Load all settings from a saved JSON config file.
`--save-config`		none	Save current settings to a JSON config file for reuse.

File Handling

Flag	Default	Description
`--sheet`	first sheet	Excel sheet name or 0-based index. Ignored for CSV files.
`--encoding`	auto-detect	Override auto-detected file encoding (e.g., `utf-8`, `windows-1252`).
`--header-row`	auto-detect	0-based row index for the header row.

Recipes

1. Basic Dedup (Auto-Detect)

Let the engine detect email, phone, name, and address columns automatically.

# Preview
python -m src.cli customers.csv

# Apply
python -m src.cli customers.csv --apply

The engine scans column names for patterns like email, phone, name, address and builds strategies automatically. Strong keys (email, phone) become standalone strategies; weak keys (name, address) are paired with strong keys.

2. Fuzzy Name Matching

Match rows where names are similar but not identical — catches typos, nickname variations, and formatting differences.

# Fuzzy-match on the "name" column at 80% similarity
python -m src.cli customers.csv --fuzzy name --threshold 80 --apply

# Fuzzy-match on multiple columns
python -m src.cli customers.csv --fuzzy name,address --threshold 85 --apply

# Use Levenshtein distance instead of Jaro-Winkler
python -m src.cli customers.csv --fuzzy name --algorithm levenshtein --threshold 80 --apply

Algorithm comparison:

jaro_winkler (default) — best for short strings like names; weights early characters more heavily
levenshtein — edit-distance ratio; works well for typos and transpositions
token_set_ratio — best for addresses and long strings; ignores word order

3. Custom Strong Keys

Use specific identifier columns to find exact duplicates.

# Deduplicate by Facebook ID
python -m src.cli donors.csv --key fb_id --apply

# Multiple strong keys (each is independent — matched with OR)
python -m src.cli donors.csv --key fb_id,ein --apply

Strong keys are OR'd: a match on fb_id alone OR ein alone marks rows as duplicates.

4. Merge Mode

Keep the most complete row and fill any remaining blanks from the duplicates.

# Most complete row + merge missing fields
python -m src.cli contacts.csv --survivor most-complete --merge --apply

# Keep most recent row and merge
python -m src.cli contacts.csv --survivor most-recent --date-column updated_at --merge --apply

How merge works: The survivor row keeps all its non-empty fields. For any blank/null fields, the engine fills from the removed rows (in row order). The result is a single row with maximum data retention.

5. Multi-Column Subset

Match on a specific combination of columns rather than auto-detecting.

# Exact match on email + phone only
python -m src.cli customers.csv --subset email,phone --apply

# Mix exact and fuzzy within a subset
python -m src.cli customers.csv --subset email,name --fuzzy name --threshold 85 --apply

When using --subset, all listed columns must match (AND logic) for a pair to be considered duplicates.

6. Save and Load Config Profiles

Save your settings for repeatable runs on similar files.

# Save settings to a file
python -m src.cli customers.csv --fuzzy name --threshold 80 --merge \
    --survivor most-complete --save-config customer_dedup.json

# Load saved settings
python -m src.cli new_customers.csv --config customer_dedup.json --apply

Config files are JSON. Example:

{
  "strategies": [],
  "survivor_rule": "most_complete",
  "merge": true,
  "default_algorithm": "jaro_winkler",
  "default_threshold": 80.0,
  "fuzzy_columns": ["name"]
}

7. Interactive Review

Step through each match group and decide whether to merge.

python -m src.cli customers.csv --review --apply

For each group, the CLI displays both rows side-by-side and prompts:

============================================================
Match Group 1 — Confidence: 92.3%
Matched on: name, phone
============================================================

  Row 1:
    name: John Smith
    email: john@example.com
    phone: (555) 123-4567

  Row 2:
    name: Jon Smith
    email:
    phone: 555-123-4567

  [y] Merge  [n] Keep both  [s] Skip remaining:

y — accept the match; merge/remove duplicate
n — reject the match; keep both rows
s — skip all remaining groups (keep both for all)

8. Excel Files and Multi-Sheet

Work with Excel files directly — no CSV conversion needed.

# Deduplicate first sheet (default)
python -m src.cli data.xlsx --apply

# Specify sheet by name
python -m src.cli data.xlsx --sheet "Sales Data" --apply

# Specify sheet by index (0-based)
python -m src.cli data.xlsx --sheet 1 --apply

Output is always CSV by default. To write Excel output, use -o:

python -m src.cli data.xlsx -o cleaned.xlsx --apply

Auto-Detection Details

When no --subset or --fuzzy flags are provided, the engine scans column names and builds strategies:

Column pattern	Detection regex	Algorithm	Threshold	Normalizer	Key type
Email	`e[-_]?mail`	exact	100%	email	strong
Phone	`phone\|telephone\|mobile\|cell`	exact	100%	phone	strong
Name	`^(name\|full_name\|customer_name\|...)$`	jaro_winkler	85%	name	weak
Address	`address\|street\|addr`	token_set_ratio	80%	address	weak

Strategy building rules:

Strong keys → standalone OR strategies (email match alone is enough)
Weak keys → paired with each strong key via AND (name match requires email or phone match too)
No strong keys found → weak keys promoted to standalone
No patterns matched → exact match on all columns (equivalent to drop_duplicates)

Output Files

When --apply is set, three files are written:

File	Description
`{stem}_deduplicated.csv`	Cleaned DataFrame with duplicates removed
`{stem}_removed.csv`	Rows that were removed
`{stem}_match_groups.csv`	Audit trail with `_group_id`, `_is_survivor`, `_confidence`, `_matched_on`, `_original_row`, plus all original columns

Logging

Every run writes a timestamped log to logs/dedup_YYYYMMDD_HHMMSS.log with full debug-level details: strategies used, pair comparisons, survivor decisions, and merge actions.

Text Cleaner CLI

Character-level hygiene for CSV / Excel files: whitespace trim and collapse, smart-character folding, Unicode normalization, BOM strip, control-char strip, line-ending normalization, optional case conversion. See TECHNICAL.md Section 10.2 for the full functional spec.

python -m src.cli_text_clean INPUT_FILE [OPTIONS]

Arguments

Argument	Required	Description
`INPUT_FILE`	Yes	Path to the CSV, TSV, or Excel file to clean

Options

Core

Flag	Short	Default	Description
`--apply`		`false`	Write output files. Without this flag, only a preview is shown.
`--output`	`-o`	`{input}_cleaned.csv`	Output file path.
`--preset`		`excel-hygiene`	Preset bundle of safe defaults. See Presets.

Scope

Flag	Default	Description
`--columns`	all string columns	Comma-separated columns to clean.
`--skip`	none	Comma-separated columns to skip even if they look like text. Useful for free-text notes columns you don't want touched.

Per-operation toggles

These override the active preset.

Flag	Effect
`--no-trim`	Disable leading/trailing whitespace strip
`--no-collapse`	Disable internal whitespace collapse
`--no-nfc`	Disable Unicode NFC normalization
`--nfkc`	Enable NFKC compatibility fold (lossy: `①` → `1`, `ﬁ` → `fi`)
`--no-smart-chars`	Disable smart-character folding (curly quotes, em/en-dash, NBSP, ellipsis)
`--no-zero-width`	Disable zero-width / invisible character strip
`--no-bom`	Disable leading BOM strip
`--no-control`	Disable control-character strip
`--no-line-endings`	Disable line-ending normalization

Case conversion

Flag	Forms	Description
`--case`	`upper`, `lower`, `title`, `sentence`	Apply this case to every selected column
`--case`	`mode:col[,mode:col]`	Per-column case (e.g., `--case title:name,upper:code`)

Title case preserves all-caps tokens (USA stays USA) and lowercases mid-string particles (of, and, the, etc.).

Audit and config

Flag	Default	Description
`--full-changelog`	`false`	Write every cell change to the audit CSV (default caps to first 1000).
`--config`	none	Load options from a saved JSON config file.
`--save-config`	none	Save the current options to a JSON config file.

File format / encoding

Flag	Default	Description
`--sheet`	`0`	Excel sheet name or 0-based index.
`--encoding`	auto-detect	Override auto-detected file encoding.
`--header-row`	auto-detect	0-based row index for the header.

Presets

Preset	What it does
`minimal`	Trim + collapse whitespace only. Nothing else.
`excel-hygiene` (default)	Trim, collapse, NFC, smart-character fold, zero-width strip, BOM strip, control strip, line-ending normalize. NFKC off.
`paranoid`	All of `excel-hygiene` plus NFKC compatibility fold (lossy).

Output Files

When --apply is set:

File	Description
`{stem}_cleaned.csv`	Cleaned DataFrame
`{stem}_changes.csv`	Per-cell audit: `row`, `column`, `old`, `new`, `ops_applied` (capped to 1000 rows by default; use `--full-changelog` for all)

A timestamped log is always written to logs/text_clean_YYYYMMDD_HHMMSS.log.

Recipes

# Preview what would change with the safe defaults
python -m src.cli_text_clean messy.csv

# Apply the safe defaults
python -m src.cli_text_clean messy.csv --apply

# Just the basics — only trim and collapse, leave Unicode/quotes alone
python -m src.cli_text_clean messy.csv --preset minimal --apply

# Title-case the name column, upper-case the SKU column, leave others alone for case
python -m src.cli_text_clean people.csv --case title:name,upper:sku --apply

# Clean only specific columns
python -m src.cli_text_clean orders.csv --columns vendor,product --apply

# Skip a free-text notes column from cleaning
python -m src.cli_text_clean tickets.csv --skip notes --apply

# Save the current settings as a profile and reload it later
python -m src.cli_text_clean messy.csv --preset minimal --case upper --save-config my.json
python -m src.cli_text_clean other.csv --config my.json --apply

15 KiB Raw Blame History

CLI Reference

Deduplicator

Arguments

Options

Core

Column Selection

Fuzzy Matching

Normalization

Survivor Selection

Interactive Review

Configuration

File Handling

Recipes

1. Basic Dedup (Auto-Detect)

2. Fuzzy Name Matching

3. Custom Strong Keys

4. Merge Mode

5. Multi-Column Subset

6. Save and Load Config Profiles

7. Interactive Review

8. Excel Files and Multi-Sheet

Auto-Detection Details

Output Files

Logging

Text Cleaner CLI

Arguments

Options

Core

Scope

Per-operation toggles

Case conversion

Audit and config

File format / encoding

Presets

Output Files

Recipes

15 KiB

Raw Blame History