Builds 02_text_cleaner.py from stub to working: character-level hygiene for CSV/Excel inputs covering trim, whitespace collapse, smart-character folding, Unicode NFC/NFKC, BOM strip, zero-width strip, control-char strip, line-ending normalization, and per-column case conversion. Three presets (minimal/excel-hygiene/paranoid) keep the buyer surface small. - src/core/text_clean.py: pure helpers + CleanOptions/CleanResult + clean_dataframe with dtype-safe column selection - src/cli_text_clean.py: Typer CLI mirroring the dedup CLI shape (dry-run by default, --apply writes cleaned + changes audit, JSON config save/load) - src/gui/pages/2_Text_Cleaner.py: real Streamlit page with preset picker, advanced toggles, preview, before/after metrics, and three download buttons - tests/test_text_clean.py + test_cli_text_clean.py: 92 new tests covering edge cases E1-E50 from the spec - samples/messy_text.csv: demo dataset surfacing UC1, UC3, UC6, UC10 in 10 rows - test-cases/uc16-uc26 + ec05-ec09: per-use-case and per-edge-case fixtures Docs: TECHNICAL.md §10.2 (full Tier 1/2/3 spec), DECISIONS.md v1.7 entry locking the spec, CLI-REFERENCE.md gains the text cleaner section, README.md gains a top-level Text Cleaner block, USER-GUIDE.md status row 02 promoted Skeleton -> Working. 200/200 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
415 lines
15 KiB
Markdown
415 lines
15 KiB
Markdown
# CLI Reference
|
|
|
|
Complete command-line reference for the DataTools bundle.
|
|
|
|
DataTools ships two CLI modules so each script can be invoked independently:
|
|
|
|
| Module | Command | Purpose |
|
|
|---|---|---|
|
|
| `src.cli` | `python -m src.cli INPUT_FILE [OPTIONS]` | Deduplicator (script 01) |
|
|
| `src.cli_text_clean` | `python -m src.cli_text_clean INPUT_FILE [OPTIONS]` | Text cleaner (script 02) |
|
|
|
|
The deduplicator section is below; the text cleaner reference is in [Section: Text Cleaner CLI](#text-cleaner-cli).
|
|
|
|
## Deduplicator
|
|
|
|
```
|
|
python -m src.cli INPUT_FILE [OPTIONS]
|
|
```
|
|
|
|
## Arguments
|
|
|
|
| Argument | Required | Description |
|
|
|----------|----------|-------------|
|
|
| `INPUT_FILE` | Yes | Path to the CSV, delimited text, or Excel file to deduplicate |
|
|
|
|
## Options
|
|
|
|
### Core
|
|
|
|
| Flag | Short | Default | Description |
|
|
|------|-------|---------|-------------|
|
|
| `--apply` | | `false` | Write output files. Without this flag, only a preview is shown. |
|
|
| `--output` | `-o` | `{input}_deduplicated.csv` | Output file path. |
|
|
|
|
### Column Selection
|
|
|
|
| Flag | Short | Default | Description |
|
|
|------|-------|---------|-------------|
|
|
| `--subset` | `-s` | auto-detect | Comma-separated columns to match on. When omitted, columns are auto-detected by name pattern (email, phone, name, address). |
|
|
| `--key` | `-k` | none | Comma-separated strong-key columns. Each becomes an independent exact-match strategy. Use for identifiers like `fb_id`, `ein`, `sku`. |
|
|
|
|
### Fuzzy Matching
|
|
|
|
| Flag | Short | Default | Description |
|
|
|------|-------|---------|-------------|
|
|
| `--fuzzy` | | none | Comma-separated columns to fuzzy-match. Other columns in the strategy use exact matching. |
|
|
| `--algorithm` | `-a` | `jaro_winkler` | Fuzzy algorithm: `levenshtein`, `jaro_winkler`, or `token_set_ratio`. |
|
|
| `--threshold` | `-t` | `85` | Similarity threshold 0-100. Lower values find more matches but increase false positives. |
|
|
|
|
### Normalization
|
|
|
|
| Flag | Short | Default | Description |
|
|
|------|-------|---------|-------------|
|
|
| `--normalize` | | auto-detect | Column normalizers as `col:type` pairs, comma-separated. Types: `email`, `phone`, `name`, `address`, `string`. |
|
|
|
|
**Normalizer details:**
|
|
|
|
| Type | What it does | Example |
|
|
|------|-------------|---------|
|
|
| `email` | Lowercase, strip Gmail dots, strip `+tag` suffixes | `John.Doe+tag@gmail.com` → `johndoe@gmail.com` |
|
|
| `phone` | Parse to E.164 format; fallback: digits only | `(555) 123-4567` → `+15551234567` |
|
|
| `name` | Strip titles (Dr., Mr.) and suffixes (Jr., PhD), case-fold | `Dr. John Smith Jr.` → `john smith` |
|
|
| `address` | USPS abbreviations (Street→St, Avenue→Ave), case-fold | `123 Main Street, Suite 4` → `123 main st ste 4` |
|
|
| `string` | Trim, collapse whitespace, case-fold | ` HELLO WORLD ` → `hello world` |
|
|
|
|
### Survivor Selection
|
|
|
|
| Flag | Short | Default | Description |
|
|
|------|-------|---------|-------------|
|
|
| `--survivor` | | `first` | Which row to keep per duplicate group. |
|
|
| `--date-column` | | none | Date column for the `most-recent` rule. |
|
|
| `--merge` | | `false` | Fill missing fields in the surviving row from removed duplicates. |
|
|
|
|
**Survivor rules:**
|
|
|
|
| Rule | Behavior |
|
|
|------|----------|
|
|
| `first` | Keep the first row encountered (lowest row number) |
|
|
| `last` | Keep the last row encountered (highest row number) |
|
|
| `most-complete` | Keep the row with the fewest blank/empty cells |
|
|
| `most-recent` | Keep the row with the latest date (requires `--date-column`) |
|
|
|
|
### Interactive Review
|
|
|
|
| Flag | Short | Default | Description |
|
|
|------|-------|---------|-------------|
|
|
| `--review` | | `false` | Interactively review each match group. For each group, choose: merge (y), keep both (n), or skip remaining (s). |
|
|
|
|
### Configuration
|
|
|
|
| Flag | Short | Default | Description |
|
|
|------|-------|---------|-------------|
|
|
| `--config` | | none | Load all settings from a saved JSON config file. |
|
|
| `--save-config` | | none | Save current settings to a JSON config file for reuse. |
|
|
|
|
### File Handling
|
|
|
|
| Flag | Short | Default | Description |
|
|
|------|-------|---------|-------------|
|
|
| `--sheet` | | first sheet | Excel sheet name or 0-based index. Ignored for CSV files. |
|
|
| `--encoding` | | auto-detect | Override auto-detected file encoding (e.g., `utf-8`, `windows-1252`). |
|
|
| `--header-row` | | auto-detect | 0-based row index for the header row. |
|
|
|
|
---
|
|
|
|
## Recipes
|
|
|
|
### 1. Basic Dedup (Auto-Detect)
|
|
|
|
Let the engine detect email, phone, name, and address columns automatically.
|
|
|
|
```bash
|
|
# Preview
|
|
python -m src.cli customers.csv
|
|
|
|
# Apply
|
|
python -m src.cli customers.csv --apply
|
|
```
|
|
|
|
The engine scans column names for patterns like `email`, `phone`, `name`, `address` and builds strategies automatically. Strong keys (email, phone) become standalone strategies; weak keys (name, address) are paired with strong keys.
|
|
|
|
### 2. Fuzzy Name Matching
|
|
|
|
Match rows where names are similar but not identical — catches typos, nickname variations, and formatting differences.
|
|
|
|
```bash
|
|
# Fuzzy-match on the "name" column at 80% similarity
|
|
python -m src.cli customers.csv --fuzzy name --threshold 80 --apply
|
|
|
|
# Fuzzy-match on multiple columns
|
|
python -m src.cli customers.csv --fuzzy name,address --threshold 85 --apply
|
|
|
|
# Use Levenshtein distance instead of Jaro-Winkler
|
|
python -m src.cli customers.csv --fuzzy name --algorithm levenshtein --threshold 80 --apply
|
|
```
|
|
|
|
**Algorithm comparison:**
|
|
- `jaro_winkler` (default) — best for short strings like names; weights early characters more heavily
|
|
- `levenshtein` — edit-distance ratio; works well for typos and transpositions
|
|
- `token_set_ratio` — best for addresses and long strings; ignores word order
|
|
|
|
### 3. Custom Strong Keys
|
|
|
|
Use specific identifier columns to find exact duplicates.
|
|
|
|
```bash
|
|
# Deduplicate by Facebook ID
|
|
python -m src.cli donors.csv --key fb_id --apply
|
|
|
|
# Multiple strong keys (each is independent — matched with OR)
|
|
python -m src.cli donors.csv --key fb_id,ein --apply
|
|
```
|
|
|
|
Strong keys are OR'd: a match on `fb_id` alone OR `ein` alone marks rows as duplicates.
|
|
|
|
### 4. Merge Mode
|
|
|
|
Keep the most complete row and fill any remaining blanks from the duplicates.
|
|
|
|
```bash
|
|
# Most complete row + merge missing fields
|
|
python -m src.cli contacts.csv --survivor most-complete --merge --apply
|
|
|
|
# Keep most recent row and merge
|
|
python -m src.cli contacts.csv --survivor most-recent --date-column updated_at --merge --apply
|
|
```
|
|
|
|
**How merge works:** The survivor row keeps all its non-empty fields. For any blank/null fields, the engine fills from the removed rows (in row order). The result is a single row with maximum data retention.
|
|
|
|
### 5. Multi-Column Subset
|
|
|
|
Match on a specific combination of columns rather than auto-detecting.
|
|
|
|
```bash
|
|
# Exact match on email + phone only
|
|
python -m src.cli customers.csv --subset email,phone --apply
|
|
|
|
# Mix exact and fuzzy within a subset
|
|
python -m src.cli customers.csv --subset email,name --fuzzy name --threshold 85 --apply
|
|
```
|
|
|
|
When using `--subset`, all listed columns must match (AND logic) for a pair to be considered duplicates.
|
|
|
|
### 6. Save and Load Config Profiles
|
|
|
|
Save your settings for repeatable runs on similar files.
|
|
|
|
```bash
|
|
# Save settings to a file
|
|
python -m src.cli customers.csv --fuzzy name --threshold 80 --merge \
|
|
--survivor most-complete --save-config customer_dedup.json
|
|
|
|
# Load saved settings
|
|
python -m src.cli new_customers.csv --config customer_dedup.json --apply
|
|
```
|
|
|
|
Config files are JSON. Example:
|
|
|
|
```json
|
|
{
|
|
"strategies": [],
|
|
"survivor_rule": "most_complete",
|
|
"merge": true,
|
|
"default_algorithm": "jaro_winkler",
|
|
"default_threshold": 80.0,
|
|
"fuzzy_columns": ["name"]
|
|
}
|
|
```
|
|
|
|
### 7. Interactive Review
|
|
|
|
Step through each match group and decide whether to merge.
|
|
|
|
```bash
|
|
python -m src.cli customers.csv --review --apply
|
|
```
|
|
|
|
For each group, the CLI displays both rows side-by-side and prompts:
|
|
|
|
```
|
|
============================================================
|
|
Match Group 1 — Confidence: 92.3%
|
|
Matched on: name, phone
|
|
============================================================
|
|
|
|
Row 1:
|
|
name: John Smith
|
|
email: john@example.com
|
|
phone: (555) 123-4567
|
|
|
|
Row 2:
|
|
name: Jon Smith
|
|
email:
|
|
phone: 555-123-4567
|
|
|
|
[y] Merge [n] Keep both [s] Skip remaining:
|
|
```
|
|
|
|
- **y** — accept the match; merge/remove duplicate
|
|
- **n** — reject the match; keep both rows
|
|
- **s** — skip all remaining groups (keep both for all)
|
|
|
|
### 8. Excel Files and Multi-Sheet
|
|
|
|
Work with Excel files directly — no CSV conversion needed.
|
|
|
|
```bash
|
|
# Deduplicate first sheet (default)
|
|
python -m src.cli data.xlsx --apply
|
|
|
|
# Specify sheet by name
|
|
python -m src.cli data.xlsx --sheet "Sales Data" --apply
|
|
|
|
# Specify sheet by index (0-based)
|
|
python -m src.cli data.xlsx --sheet 1 --apply
|
|
```
|
|
|
|
Output is always CSV by default. To write Excel output, use `-o`:
|
|
|
|
```bash
|
|
python -m src.cli data.xlsx -o cleaned.xlsx --apply
|
|
```
|
|
|
|
---
|
|
|
|
## Auto-Detection Details
|
|
|
|
When no `--subset` or `--fuzzy` flags are provided, the engine scans column names and builds strategies:
|
|
|
|
| Column pattern | Detection regex | Algorithm | Threshold | Normalizer | Key type |
|
|
|---------------|----------------|-----------|-----------|------------|----------|
|
|
| Email | `e[-_]?mail` | exact | 100% | email | strong |
|
|
| Phone | `phone\|telephone\|mobile\|cell` | exact | 100% | phone | strong |
|
|
| Name | `^(name\|full_name\|customer_name\|...)$` | jaro_winkler | 85% | name | weak |
|
|
| Address | `address\|street\|addr` | token_set_ratio | 80% | address | weak |
|
|
|
|
**Strategy building rules:**
|
|
- Strong keys → standalone OR strategies (email match alone is enough)
|
|
- Weak keys → paired with each strong key via AND (name match requires email or phone match too)
|
|
- No strong keys found → weak keys promoted to standalone
|
|
- No patterns matched → exact match on all columns (equivalent to `drop_duplicates`)
|
|
|
|
## Output Files
|
|
|
|
When `--apply` is set, three files are written:
|
|
|
|
| File | Description |
|
|
|------|-------------|
|
|
| `{stem}_deduplicated.csv` | Cleaned DataFrame with duplicates removed |
|
|
| `{stem}_removed.csv` | Rows that were removed |
|
|
| `{stem}_match_groups.csv` | Audit trail with `_group_id`, `_is_survivor`, `_confidence`, `_matched_on`, `_original_row`, plus all original columns |
|
|
|
|
## Logging
|
|
|
|
Every run writes a timestamped log to `logs/dedup_YYYYMMDD_HHMMSS.log` with full debug-level details: strategies used, pair comparisons, survivor decisions, and merge actions.
|
|
|
|
---
|
|
|
|
# Text Cleaner CLI
|
|
|
|
Character-level hygiene for CSV / Excel files: whitespace trim and collapse, smart-character folding, Unicode normalization, BOM strip, control-char strip, line-ending normalization, optional case conversion. See TECHNICAL.md Section 10.2 for the full functional spec.
|
|
|
|
```
|
|
python -m src.cli_text_clean INPUT_FILE [OPTIONS]
|
|
```
|
|
|
|
## Arguments
|
|
|
|
| Argument | Required | Description |
|
|
|----------|----------|-------------|
|
|
| `INPUT_FILE` | Yes | Path to the CSV, TSV, or Excel file to clean |
|
|
|
|
## Options
|
|
|
|
### Core
|
|
|
|
| Flag | Short | Default | Description |
|
|
|------|-------|---------|-------------|
|
|
| `--apply` | | `false` | Write output files. Without this flag, only a preview is shown. |
|
|
| `--output` | `-o` | `{input}_cleaned.csv` | Output file path. |
|
|
| `--preset` | | `excel-hygiene` | Preset bundle of safe defaults. See [Presets](#presets). |
|
|
|
|
### Scope
|
|
|
|
| Flag | Default | Description |
|
|
|------|---------|-------------|
|
|
| `--columns` | all string columns | Comma-separated columns to clean. |
|
|
| `--skip` | none | Comma-separated columns to skip even if they look like text. Useful for free-text notes columns you don't want touched. |
|
|
|
|
### Per-operation toggles
|
|
|
|
These override the active preset.
|
|
|
|
| Flag | Effect |
|
|
|------|--------|
|
|
| `--no-trim` | Disable leading/trailing whitespace strip |
|
|
| `--no-collapse` | Disable internal whitespace collapse |
|
|
| `--no-nfc` | Disable Unicode NFC normalization |
|
|
| `--nfkc` | Enable NFKC compatibility fold (lossy: `①` → `1`, `fi` → `fi`) |
|
|
| `--no-smart-chars` | Disable smart-character folding (curly quotes, em/en-dash, NBSP, ellipsis) |
|
|
| `--no-zero-width` | Disable zero-width / invisible character strip |
|
|
| `--no-bom` | Disable leading BOM strip |
|
|
| `--no-control` | Disable control-character strip |
|
|
| `--no-line-endings` | Disable line-ending normalization |
|
|
|
|
### Case conversion
|
|
|
|
| Flag | Forms | Description |
|
|
|------|-------|-------------|
|
|
| `--case` | `upper`, `lower`, `title`, `sentence` | Apply this case to every selected column |
|
|
| `--case` | `mode:col[,mode:col]` | Per-column case (e.g., `--case title:name,upper:code`) |
|
|
|
|
Title case preserves all-caps tokens (`USA` stays `USA`) and lowercases mid-string particles (`of`, `and`, `the`, etc.).
|
|
|
|
### Audit and config
|
|
|
|
| Flag | Default | Description |
|
|
|------|---------|-------------|
|
|
| `--full-changelog` | `false` | Write every cell change to the audit CSV (default caps to first 1000). |
|
|
| `--config` | none | Load options from a saved JSON config file. |
|
|
| `--save-config` | none | Save the current options to a JSON config file. |
|
|
|
|
### File format / encoding
|
|
|
|
| Flag | Default | Description |
|
|
|------|---------|-------------|
|
|
| `--sheet` | `0` | Excel sheet name or 0-based index. |
|
|
| `--encoding` | auto-detect | Override auto-detected file encoding. |
|
|
| `--header-row` | auto-detect | 0-based row index for the header. |
|
|
|
|
## Presets
|
|
|
|
| Preset | What it does |
|
|
|---|---|
|
|
| `minimal` | Trim + collapse whitespace only. Nothing else. |
|
|
| `excel-hygiene` (default) | Trim, collapse, NFC, smart-character fold, zero-width strip, BOM strip, control strip, line-ending normalize. NFKC off. |
|
|
| `paranoid` | All of `excel-hygiene` plus NFKC compatibility fold (lossy). |
|
|
|
|
## Output Files
|
|
|
|
When `--apply` is set:
|
|
|
|
| File | Description |
|
|
|------|-------------|
|
|
| `{stem}_cleaned.csv` | Cleaned DataFrame |
|
|
| `{stem}_changes.csv` | Per-cell audit: `row`, `column`, `old`, `new`, `ops_applied` (capped to 1000 rows by default; use `--full-changelog` for all) |
|
|
|
|
A timestamped log is always written to `logs/text_clean_YYYYMMDD_HHMMSS.log`.
|
|
|
|
## Recipes
|
|
|
|
```bash
|
|
# Preview what would change with the safe defaults
|
|
python -m src.cli_text_clean messy.csv
|
|
|
|
# Apply the safe defaults
|
|
python -m src.cli_text_clean messy.csv --apply
|
|
|
|
# Just the basics — only trim and collapse, leave Unicode/quotes alone
|
|
python -m src.cli_text_clean messy.csv --preset minimal --apply
|
|
|
|
# Title-case the name column, upper-case the SKU column, leave others alone for case
|
|
python -m src.cli_text_clean people.csv --case title:name,upper:sku --apply
|
|
|
|
# Clean only specific columns
|
|
python -m src.cli_text_clean orders.csv --columns vendor,product --apply
|
|
|
|
# Skip a free-text notes column from cleaning
|
|
python -m src.cli_text_clean tickets.csv --skip notes --apply
|
|
|
|
# Save the current settings as a profile and reload it later
|
|
python -m src.cli_text_clean messy.csv --preset minimal --case upper --save-config my.json
|
|
python -m src.cli_text_clean other.csv --config my.json --apply
|
|
```
|