feat: implement text cleaner (script 02) with CLI, GUI, and tests

Builds 02_text_cleaner.py from stub to working: character-level hygiene
for CSV/Excel inputs covering trim, whitespace collapse, smart-character
folding, Unicode NFC/NFKC, BOM strip, zero-width strip, control-char
strip, line-ending normalization, and per-column case conversion. Three
presets (minimal/excel-hygiene/paranoid) keep the buyer surface small.

- src/core/text_clean.py: pure helpers + CleanOptions/CleanResult +
  clean_dataframe with dtype-safe column selection
- src/cli_text_clean.py: Typer CLI mirroring the dedup CLI shape
  (dry-run by default, --apply writes cleaned + changes audit, JSON
  config save/load)
- src/gui/pages/2_Text_Cleaner.py: real Streamlit page with preset
  picker, advanced toggles, preview, before/after metrics, and three
  download buttons
- tests/test_text_clean.py + test_cli_text_clean.py: 92 new tests
  covering edge cases E1-E50 from the spec
- samples/messy_text.csv: demo dataset surfacing UC1, UC3, UC6, UC10
  in 10 rows
- test-cases/uc16-uc26 + ec05-ec09: per-use-case and per-edge-case
  fixtures

Docs: TECHNICAL.md §10.2 (full Tier 1/2/3 spec), DECISIONS.md v1.7
entry locking the spec, CLI-REFERENCE.md gains the text cleaner
section, README.md gains a top-level Text Cleaner block, USER-GUIDE.md
status row 02 promoted Skeleton -> Working.

200/200 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-29 15:14:15 +00:00
parent b2ca04e6f4
commit 54f92ae47e
28 changed files with 2093 additions and 58 deletions

View File

@@ -1,6 +1,13 @@
# DataTools Deduplicator
# DataTools
Find and remove duplicate rows in CSV, delimited text, and Excel files — with fuzzy matching, smart normalization, and interactive review.
A bundle of Python data-cleaning tools for CSV and Excel files. Two scripts ship today; more are in build.
| # | Tool | What it does |
|---|---|---|
| 01 | **Deduplicator** | Find and remove duplicate rows with exact + fuzzy matching, smart normalization, and interactive review. |
| 02 | **Text Cleaner** | Trim whitespace, fold smart quotes, strip invisible / control characters, normalize Unicode, normalize line endings, optional case conversion. |
## Deduplicator
## Features
@@ -107,6 +114,41 @@ When `--apply` is used, three files are produced:
| `{input}_removed.csv` | Rows that were removed |
| `{input}_match_groups.csv` | Audit trail: group ID, confidence, matched columns, survivor flag |
## Text Cleaner
Character-level hygiene for messy CSV / Excel input. Solves the dirty-data failure modes that silently break VLOOKUPs, dedup runs, and downstream imports:
- Trailing / leading whitespace and tabs in cells
- Non-breaking spaces (`U+00A0`) hiding inside text where regular spaces should be
- Smart quotes pasted from Word (`"` `"` `'` `'``"` `"` `'` `'`)
- Em / en dashes, ellipsis, other typographic Unicode
- Zero-width and bidi-mark characters (`U+200B`, `U+200C`, `U+200D`, etc.)
- BOMs from Excel "Save As CSV UTF-8"
- Mixed line endings (`\r\n`, bare `\r`) inside multi-line cells
- Control characters (`U+0000`-`U+001F` minus `\t \n \r`)
- Optional Unicode NFC / NFKC normalization
- Optional per-column case conversion (UPPER / lower / smart Title / Sentence)
```bash
# Preview what would change (dry-run)
python -m src.cli_text_clean samples/messy_text.csv
# Apply the safe defaults
python -m src.cli_text_clean samples/messy_text.csv --apply
# Title-case the name column, upper-case the SKU column
python -m src.cli_text_clean products.csv --case title:name,upper:sku --apply
# Just trim and collapse — nothing fancy
python -m src.cli_text_clean messy.csv --preset minimal --apply
```
Three presets: `minimal` (trim + collapse only), `excel-hygiene` (default; everything safe ON), `paranoid` (adds lossy NFKC fold).
Outputs `{input}_cleaned.csv` plus a per-cell `{input}_changes.csv` audit (row, column, old, new, ops applied).
See [docs/CLI-REFERENCE.md](docs/CLI-REFERENCE.md#text-cleaner-cli) for every flag.
## Documentation
- [CLI Reference](docs/CLI-REFERENCE.md) — every flag with examples and recipe sections

View File

@@ -1,6 +1,17 @@
# CLI Reference
Complete command-line reference for the DataTools Deduplicator.
Complete command-line reference for the DataTools bundle.
DataTools ships two CLI modules so each script can be invoked independently:
| Module | Command | Purpose |
|---|---|---|
| `src.cli` | `python -m src.cli INPUT_FILE [OPTIONS]` | Deduplicator (script 01) |
| `src.cli_text_clean` | `python -m src.cli_text_clean INPUT_FILE [OPTIONS]` | Text cleaner (script 02) |
The deduplicator section is below; the text cleaner reference is in [Section: Text Cleaner CLI](#text-cleaner-cli).
## Deduplicator
```
python -m src.cli INPUT_FILE [OPTIONS]
@@ -282,3 +293,122 @@ When `--apply` is set, three files are written:
## Logging
Every run writes a timestamped log to `logs/dedup_YYYYMMDD_HHMMSS.log` with full debug-level details: strategies used, pair comparisons, survivor decisions, and merge actions.
---
# Text Cleaner CLI
Character-level hygiene for CSV / Excel files: whitespace trim and collapse, smart-character folding, Unicode normalization, BOM strip, control-char strip, line-ending normalization, optional case conversion. See TECHNICAL.md Section 10.2 for the full functional spec.
```
python -m src.cli_text_clean INPUT_FILE [OPTIONS]
```
## Arguments
| Argument | Required | Description |
|----------|----------|-------------|
| `INPUT_FILE` | Yes | Path to the CSV, TSV, or Excel file to clean |
## Options
### Core
| Flag | Short | Default | Description |
|------|-------|---------|-------------|
| `--apply` | | `false` | Write output files. Without this flag, only a preview is shown. |
| `--output` | `-o` | `{input}_cleaned.csv` | Output file path. |
| `--preset` | | `excel-hygiene` | Preset bundle of safe defaults. See [Presets](#presets). |
### Scope
| Flag | Default | Description |
|------|---------|-------------|
| `--columns` | all string columns | Comma-separated columns to clean. |
| `--skip` | none | Comma-separated columns to skip even if they look like text. Useful for free-text notes columns you don't want touched. |
### Per-operation toggles
These override the active preset.
| Flag | Effect |
|------|--------|
| `--no-trim` | Disable leading/trailing whitespace strip |
| `--no-collapse` | Disable internal whitespace collapse |
| `--no-nfc` | Disable Unicode NFC normalization |
| `--nfkc` | Enable NFKC compatibility fold (lossy: `①``1`, `fi``fi`) |
| `--no-smart-chars` | Disable smart-character folding (curly quotes, em/en-dash, NBSP, ellipsis) |
| `--no-zero-width` | Disable zero-width / invisible character strip |
| `--no-bom` | Disable leading BOM strip |
| `--no-control` | Disable control-character strip |
| `--no-line-endings` | Disable line-ending normalization |
### Case conversion
| Flag | Forms | Description |
|------|-------|-------------|
| `--case` | `upper`, `lower`, `title`, `sentence` | Apply this case to every selected column |
| `--case` | `mode:col[,mode:col]` | Per-column case (e.g., `--case title:name,upper:code`) |
Title case preserves all-caps tokens (`USA` stays `USA`) and lowercases mid-string particles (`of`, `and`, `the`, etc.).
### Audit and config
| Flag | Default | Description |
|------|---------|-------------|
| `--full-changelog` | `false` | Write every cell change to the audit CSV (default caps to first 1000). |
| `--config` | none | Load options from a saved JSON config file. |
| `--save-config` | none | Save the current options to a JSON config file. |
### File format / encoding
| Flag | Default | Description |
|------|---------|-------------|
| `--sheet` | `0` | Excel sheet name or 0-based index. |
| `--encoding` | auto-detect | Override auto-detected file encoding. |
| `--header-row` | auto-detect | 0-based row index for the header. |
## Presets
| Preset | What it does |
|---|---|
| `minimal` | Trim + collapse whitespace only. Nothing else. |
| `excel-hygiene` (default) | Trim, collapse, NFC, smart-character fold, zero-width strip, BOM strip, control strip, line-ending normalize. NFKC off. |
| `paranoid` | All of `excel-hygiene` plus NFKC compatibility fold (lossy). |
## Output Files
When `--apply` is set:
| File | Description |
|------|-------------|
| `{stem}_cleaned.csv` | Cleaned DataFrame |
| `{stem}_changes.csv` | Per-cell audit: `row`, `column`, `old`, `new`, `ops_applied` (capped to 1000 rows by default; use `--full-changelog` for all) |
A timestamped log is always written to `logs/text_clean_YYYYMMDD_HHMMSS.log`.
## Recipes
```bash
# Preview what would change with the safe defaults
python -m src.cli_text_clean messy.csv
# Apply the safe defaults
python -m src.cli_text_clean messy.csv --apply
# Just the basics — only trim and collapse, leave Unicode/quotes alone
python -m src.cli_text_clean messy.csv --preset minimal --apply
# Title-case the name column, upper-case the SKU column, leave others alone for case
python -m src.cli_text_clean people.csv --case title:name,upper:sku --apply
# Clean only specific columns
python -m src.cli_text_clean orders.csv --columns vendor,product --apply
# Skip a free-text notes column from cleaning
python -m src.cli_text_clean tickets.csv --skip notes --apply
# Save the current settings as a profile and reload it later
python -m src.cli_text_clean messy.csv --preset minimal --case upper --save-config my.json
python -m src.cli_text_clean other.csv --config my.json --apply
```

View File

@@ -250,6 +250,7 @@ Own-domain SEO is treated as a long-term compounding asset (6-18 months to tract
| April 28, 2026 (v1.3) | **Add hosted browser demo as secondary distribution surface and conversion lever** | Direct consequence of Streamlit choice. See Section 5 and BUSINESS.md Section 7. |
| April 28, 2026 (v1.4) | **Re-apply 03/05 script boundary work dropped during v1.3 merge (silent drift recovery)** | Stream B v1.2 content (sharpened 03/05 descriptions in USER-GUIDE, run-order rule, TECHNICAL.md Section 9 boundary spec, RECOVERY.md pointer) was overwritten when Stream A's parallel v1.3 Streamlit work was saved to project. Restoring per the doc's own no-silent-drift rule. 03 owns "what's not there" (missing values, sentinel codes, imputation), 05 owns "what shouldn't be there" (statistical outliers, domain rules, winsorization). 03 runs before 05 because outlier statistics on data containing NaN or sentinel codes are mathematically poisoned. See TECHNICAL.md Section 9. |
| April 28, 2026 (v1.5) | **Add `02_text_cleaner.py` as new script; renumber 02-08 → 03-09** | Audit revealed character-level hygiene (whitespace trimming, multi-space collapse, Unicode normalization, BOM handling, line-ending normalization, special-character handling) had no clear owner. Was implicitly scattered: `01_deduplicator` normalizes internally for matching only (doesn't write back), `02_format_standardizer` (now 03) implies it but its named scope is dates/currencies/names/phones/addresses, `03_missing_value_handler` (now 04) only handles whitespace-only as disguised null. A buyer with trailing-space pollution had no obvious script to run. Per Section 4a (functional scope principle: one-stop shopping for the workflow), this was a real gap. Added as 02 because text cleaning is a pre-processing step that should run before format standardization, missing-value handling, and outlier detection. Kept 01 (deduplicator) at position 1 as the lead/working/marketing-flagship script; numbering does not strictly equal pipeline order, the orchestrator manages execution order. Renumber consequence: TECHNICAL.md Section 9 boundary references updated 03→04, 05→06; orchestrator references updated 08→09. New contested case documented in Section 9.3: whitespace-only cells (02 trims first, leaving empty string; 04 then detects empty strings as disguised null). Master orchestrator now 09. |
| April 29, 2026 (v1.7) | **Adopt `02_text_cleaner.py` Tier 1/2/3 functional spec; lock `excel-hygiene` as default preset** | Promotes character-level hygiene from a stub to a buildable v1 target. Strategic framing: Excel/Power Query/OpenRefine fail this category for non-technical buyers; the gap is "one-click correctness for dirty-CSV failure modes that cause silent VLOOKUP misses." Spec covers 10 toggleable ops (trim, collapse, NFC, smart-char fold, zero-width strip, BOM strip, control strip, line-ending normalize, NFKC opt-in, per-column case), per-column scope control, dry-run-by-default, per-cell change audit, idempotency, three presets (`minimal`/`excel-hygiene`/`paranoid`), and JSON config save/load. Output shape mirrors deduplicator: `{input}_cleaned.csv`, `{input}_changes.csv`, `logs/text_clean_{ts}.log`. Boundary with adjacent scripts re-asserted: 02 trims whitespace-only cells to empty (04 then detects empty as null per Section 9.3); 02 is *write-time* and stays distinct from `01_deduplicator`'s match-time `normalize_string` helper. Smart-character fold defaults ON in `excel-hygiene` because demo value is highest there and dry-run preview makes the change visible before commit. NFKC stays opt-in (lossy). `ftfy` mojibake repair deferred to Tier 2 to avoid the 5MB dep without buyer demand. CLI ships as separate `src/cli_text_clean.py` module per the one-CLI-per-script pattern in TECHNICAL Section 3.2. Full spec in TECHNICAL.md Section 10.2. |
| April 28, 2026 (v1.6) | **Fold conversation-history content into docs: deduplicator functional spec, lead bundle use cases, competitive landscape, full GUI framework comparison matrix, concrete 04/06 boundary examples, expanded Streamlit-to-SaaS reasoning** | None of this represents new decisions; all of it represents prior analysis that lived only in chat history and was at risk of evaporating. Per the doc's own no-silent-drift rule (Section 8) and the v1.4 recovery story, valuable analysis must be promoted to docs to survive. Specifically: TECHNICAL.md gains Section 10 (per-script functional specs, starting with the deduplicator's 36-item tiered spec) which is the buildable target for the v1 launch GUI port; this also makes the gap between "currently working" (exact + basic fuzzy) and "v1 launch best-of-class" (Tier 1) explicit so the docs don't quietly overstate where the code is. Section 9.3 gains three concrete distinguishing examples (bank-export blank fees / $1M outlier / "999=refused") that prove 04 and 06 are distinct concerns. BUSINESS.md gains Section 4a (Lead Bundle Deep Dive: 15 use cases by persona, 6-row competitive landscape table, market gap statement) which feeds landing page copy and demo design. Section 4c gains a 10-dimension scored framework matrix and per-option summaries (locks the rejection reasoning against re-litigation), plus expanded point 4 on Streamlit-to-SaaS migration cost. RECOVERY.md updated to reference Section 10 in rebuild and priority steps. No structural decisions changed; this is pure capture work. |
---

View File

@@ -430,6 +430,81 @@ This section captures the full functional spec for each script, beyond the one-l
35. Schedule / cron integration.
36. Direct Shopify / Klaviyo / Mailchimp API integration to dedupe in place. This would be a real differentiator for the Shopify niche specifically and is probably the right v2 direction if early sales validate the niche.
### 10.2 - 10.9 (Future)
### 10.2 `02_text_cleaner.py` - Character-level hygiene
Functional specs for scripts 02 through 09 to be added when each script enters active build. The deduplicator spec is the template; specs for other scripts should follow the same Tier 1 / Tier 2 / Tier 3 structure with explicit strategic framing (what's the market gap this script fills, given that some of its functionality is available free elsewhere).
**Current implementation status**: Stub only. `src/gui/pages/2_Text_Cleaner.py` is a placeholder UI with disabled controls. No `src/core/text_clean.py`, no CLI, no tests. Tier 1 below is the v1 launch target; nothing in this section is built yet.
**Strategic framing**: Excel and the OS provide effectively nothing here. Find/Replace fixes one character at a time. Power Query's "Clean" strips control chars but ignores BOMs, smart quotes, NBSPs, and zero-width chars. OpenRefine has the operations buried under "Common transforms" where the buyer never finds them. Pandas users `df.applymap(str.strip)` and miss everything else.
The market gap this script fills: **one-click correctness for the dirty-CSV failure modes that cause "why won't this VLOOKUP match?"** Trailing spaces, NBSP-in-place-of-space, smart quotes pasted from Word, mojibake, BOMs from Excel's "Save As CSV UTF-8". The buyer doesn't know they need this script until it fixes a problem they have spent two hours on. Demo value is high: the before/after diff sells itself.
**Boundary clarification** (cross-references Section 9):
- 02 owns whitespace, Unicode normalization, smart-character folding, BOM strip, line-ending normalization, zero-width strip, control-char strip, case ops. Writes cleaned values back to disk.
- 03 (format standardizer) owns dates, currencies, names, phones, addresses.
- 04 (missing values) owns disguised nulls (`N/A`, `-`, `unknown`, sentinel codes). Whitespace-only cells: 02 trims first to empty string; 04 then detects empty as null (per Section 9.3).
- 01 (deduplicator) has its own `normalize_string` helper for *match-time* case-folding. That is a match-time policy and stays distinct from 02's *write-time* policy. The two will not be merged; 02 may use lower-level helpers but does not aggressively case-fold cleaned output by default.
#### Tier 1: Must-ship for v1 to be best-of-class
**Operations** (each independently toggleable; defaults given for the `excel-hygiene` preset)
1. Whitespace trim - leading/trailing on every cell. Default ON.
2. Internal whitespace collapse - multi-space and tabs-in-cells to single space. Default ON.
3. Unicode NFC normalization - combining-character forms folded to canonical (e.g., `e + U+0301` to single `é`). Default ON.
4. Unicode NFKC normalization - compat fold (`①` to `1`, `fi` to `fi`). Default OFF, lossy, opt-in only. Not part of any preset other than `paranoid`.
5. Smart-character folding - curly quotes to ASCII, em/en-dash to hyphen, ellipsis `…` to `...`, NBSP `U+00A0` to space. Default ON.
6. Zero-width / invisible character strip - `U+200B`, `U+200C`, `U+200D`, `U+2060`, mid-string `U+FEFF`. Default ON.
7. BOM strip - `U+FEFF` at the start of the first cell of the first column (covers the case where the I/O layer didn't catch it). Default ON.
8. Control character strip - `U+0000`-`U+001F` and `U+007F`, *except* preserve `\t`, `\n`, `\r`. Default ON.
9. Line-ending normalization - within multi-line cells, `\r\n` and bare `\r` to `\n`. Default ON.
10. Case conversion - UPPER / lower / Title / Sentence. Default OFF, per-column. Title case is "smart": preserves all-caps tokens (`USA`, `NASA`) and lowercases mid-string particles (`of`, `and`, `the`).
**Scope control**
11. Per-column selection - by default operate on string-typed columns only; numeric / datetime columns pass through untouched. User can pick columns explicitly via `--columns`.
12. Skip-list - exclude specific columns via `--skip` even if they match the string-dtype filter (e.g., free-text notes columns).
**Trust and audit**
13. Dry-run preview by default. Output shows N cells that would change in column X. `--apply` writes. Non-negotiable for trust. Same standard as the deduplicator.
14. Per-cell change log: `{input}_changes.csv` with (row, column, old, new, ops_applied). Capped to first N rows by default to avoid 50MB audit files; `--full-changelog` removes the cap.
15. Three output files on `--apply`: `{input}_cleaned.csv`, `{input}_changes.csv`, `logs/text_clean_{ts}.log`. Mirrors the deduplicator output shape.
16. Original input file is never modified.
17. Idempotency: `clean(clean(x)) == clean(x)` for every individual op and every preset. Asserted as a property test.
**Configuration**
18. Presets: `--preset excel-hygiene` (everything safe ON, NFKC OFF, case OFF), `--preset minimal` (only trim + collapse), `--preset paranoid` (everything including NFKC). Buyers should not have to learn 9 flags. Default preset when no flag given: `excel-hygiene`.
19. Save / load JSON config. Same shape and reuse pattern as `DeduplicationConfig`.
**UX**
20. `--help` written for non-technical users with concrete examples, not a flag dump. Per DECISIONS.md Section 4b.
21. Progress bar for files over ~10K rows.
22. Error messages name the row, column, and value that caused the problem. No raw stack traces.
23. Sample data (`samples/messy_text.csv`) demonstrates: smart quotes from Excel, NBSP-vs-space, BOM, mixed line endings, zero-width chars. The before/after diff is the demo.
#### Tier 2: Worth-considering for v1.1
24. Custom regex find/replace - power-user escape hatch, per-column.
25. Diacritic strip (`José` to `Jose`). Lossy; opt-in only.
26. Mojibake auto-repair - detect `é` to `é` patterns (UTF-8 read as Latin-1 then re-encoded) and fix. Standard tool: `ftfy`. Promote to Tier 1 if early buyers report this.
27. Punctuation normalization - all Unicode dash/quote/space variants folded; runs of punctuation collapsed.
28. Profile detector - scan a file and recommend which ops to enable based on what's actually present. Lowers config friction further.
#### Tier 3: Optional / later
29. Locale-aware case conversion (Turkish dotted/dotless `i`, German `ß`).
30. Custom character-class strip rules (regex-class).
31. Streaming / chunked write for very large files (defer until a buyer reports it).
#### Open decisions captured at spec time
- Smart-character folding default ON in `excel-hygiene` accepted as the right tradeoff: highest-impact use case, dry-run preview makes the change visible before commit.
- NFKC stays Tier 1 but OFF by default and excluded from `excel-hygiene`. Lossy by design.
- CLI surface: separate `src/cli_text_clean.py` module, matching the "one CLI binary per script on PATH" pattern in Section 3.2. Not a subcommand on the existing dedup Typer app.
- `ftfy` dependency deferred to Tier 2 (~5MB). Revisit if mojibake reports come in.
### 10.3 - 10.9 (Future)
Functional specs for scripts 03 through 09 to be added when each script enters active build. The deduplicator (10.1) and text cleaner (10.2) specs are the template; specs for other scripts should follow the same Tier 1 / Tier 2 / Tier 3 structure with explicit strategic framing (what's the market gap this script fills, given that some of its functionality is available free elsewhere).

View File

@@ -63,7 +63,7 @@ If you prefer the command line, every script also ships as a CLI tool. See Secti
| # | Script | Purpose | Status |
|---|---|---|---|
| 01 | `01_deduplicator.py` | Smart duplicate removal: exact match + basic fuzzy, configurable subset columns, full logs | Working |
| 02 | `02_text_cleaner.py` | Character-level hygiene: trim leading/trailing whitespace, collapse internal multi-spaces, strip non-printable characters, Unicode normalization (smart quotes, em-dashes, accents), remove zero-width characters, BOM handling, line-ending normalization, case operations | Skeleton |
| 02 | `02_text_cleaner.py` | Character-level hygiene: trim leading/trailing whitespace, collapse internal multi-spaces, strip non-printable characters, Unicode normalization (smart quotes, em-dashes, accents), remove zero-width characters, BOM handling, line-ending normalization, case operations | Working |
| 03 | `03_format_standardizer.py` | Standardize dates, currencies, names, phone numbers, addresses | Skeleton |
| 04 | `04_missing_value_handler.py` | Detect and handle missing values: disguised nulls (`N/A`, `-`, blanks, sentinel codes), imputation (mean/median/mode/forward-fill), required-field enforcement, drop-by-threshold | Skeleton |
| 05 | `05_column_mapper_enforcer.py` | Rename columns, enforce a target schema | Skeleton |

13
samples/messy_text.csv Normal file
View File

@@ -0,0 +1,13 @@
customer_name,email,vendor,memo
Alice Johnson,alice@example.com,ACME Corp ,Welcome aboard
Bob Smith,bob@example.com,ACME Corp,Returning customer
Charlie Brown,charlie@example.com,Globex,Net 30
Diana Prince,diana@example.com,Globex,VIP
Edward Norton,ed@example.com,“Best Pet Supplies”,Order#42 - rush
Frank Castle,frank@example.com,Stark—Industries,"Line 1
Line 2
Line 3"
grace HOPPER ,grace@example.com,Globex,Loves long memos…
Henry Ford,henry@example.com,Ford Motor,Industrial
Iris West,iris@example.com,S.T.A.R. Labs,Notewith-bell
Jane Doe,jane@example.com,Acme,Standard
1 customer_name email vendor memo
2 Alice Johnson alice@example.com ACME Corp Welcome aboard
3 Bob Smith bob@example.com ACME Corp Returning customer
4 Charlie Brown charlie@example.com Globex Net 30
5 Diana Prince diana​@example.com Globex VIP
6 Edward Norton ed@example.com “Best Pet Supplies” Order#42 - rush
7 Frank Castle frank@example.com Stark—Industries Line 1 Line 2 Line 3
8 grace HOPPER grace@example.com Globex Loves long memos…
9 Henry Ford henry@example.com Ford Motor Industrial
10 Iris West iris@example.com S.T.A.R. Labs Notewith-bell
11 Jane Doe jane@example.com Acme Standard

373
src/cli_text_clean.py Normal file
View File

@@ -0,0 +1,373 @@
"""CLI for the DataTools text cleaner (script 02).
Usage:
python -m src.cli_text_clean input.csv # dry-run preview
python -m src.cli_text_clean input.csv --apply # write cleaned file
python -m src.cli_text_clean input.csv --preset minimal --apply
python -m src.cli_text_clean input.csv --case upper:name --apply
python -m src.cli_text_clean --help # full help
"""
from __future__ import annotations
import sys
from datetime import datetime
from pathlib import Path
from typing import Optional
import typer
from loguru import logger
app = typer.Typer(
name="text-clean",
help=(
"Clean and normalize text content in CSV and Excel files.\n\n"
"By default, runs in preview mode — shows what would change without "
"modifying anything. Add --apply to write the output.\n\n"
"Examples:\n\n"
" # Preview what would change\n"
" python -m src.cli_text_clean messy.csv\n\n"
" # Apply the safe defaults (excel-hygiene preset)\n"
" python -m src.cli_text_clean messy.csv --apply\n\n"
" # Minimal: only trim and collapse whitespace\n"
" python -m src.cli_text_clean messy.csv --preset minimal --apply\n\n"
" # Title-case the 'name' column, leave others alone for case\n"
" python -m src.cli_text_clean people.csv --case title:name --apply\n\n"
" # Clean only specific columns\n"
" python -m src.cli_text_clean orders.csv --columns vendor,product --apply\n\n"
" # Skip a free-text column from cleaning\n"
" python -m src.cli_text_clean tickets.csv --skip notes --apply\n"
),
add_completion=False,
no_args_is_help=True,
)
# ---------------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------------
def _setup_logging(log_dir: Path) -> Path:
"""Configure loguru to write a timestamped log file. Returns the log path."""
log_dir.mkdir(parents=True, exist_ok=True)
ts = datetime.now().strftime("%Y%m%d_%H%M%S")
log_path = log_dir / f"text_clean_{ts}.log"
logger.remove()
logger.add(sys.stderr, level="WARNING", format="{message}")
logger.add(
str(log_path),
level="DEBUG",
format="{time:YYYY-MM-DD HH:mm:ss} | {level:<8} | {message}",
)
return log_path
def _parse_case(raw: Optional[str]) -> tuple[Optional[str], dict[str, str]]:
"""Parse --case argument.
Forms:
--case upper -> ("upper", {}) (apply to all selected)
--case title:name -> (None, {"name": "title"})
--case upper:code,title:name -> (None, {...})
"""
if not raw:
return None, {}
if ":" not in raw:
# Bare mode applies to all selected columns
return raw.strip(), {}
per_col: dict[str, str] = {}
for piece in raw.split(","):
piece = piece.strip()
if not piece:
continue
if ":" not in piece:
raise typer.BadParameter(
f"Invalid --case piece: '{piece}'. "
f"Expected 'mode' or 'mode:col[,mode:col...]' "
f"(e.g., 'upper' or 'title:name,upper:code')."
)
mode, col = piece.split(":", 1)
per_col[col.strip()] = mode.strip()
return None, per_col
def _split_csv_arg(raw: Optional[str]) -> Optional[list[str]]:
if raw is None:
return None
return [c.strip() for c in raw.split(",") if c.strip()]
# ---------------------------------------------------------------------------
# Main command
# ---------------------------------------------------------------------------
@app.command()
def clean(
input_file: str = typer.Argument(
...,
help="Path to the CSV or Excel file to clean.",
),
output: Optional[str] = typer.Option(
None, "--output", "-o",
help="Output file path. Default: {input}_cleaned.csv",
),
apply: bool = typer.Option(
False, "--apply",
help="Write the output files. Without this flag, only a preview is shown.",
),
preset: str = typer.Option(
"excel-hygiene", "--preset",
help="Preset: minimal, excel-hygiene, or paranoid.",
),
columns: Optional[str] = typer.Option(
None, "--columns",
help="Comma-separated columns to clean (default: all string columns).",
),
skip: Optional[str] = typer.Option(
None, "--skip",
help="Comma-separated columns to skip even if they look like text.",
),
case: Optional[str] = typer.Option(
None, "--case",
help=(
"Case conversion. Bare mode 'upper'|'lower'|'title'|'sentence' applies to "
"all selected columns. Per-column form: 'mode:col[,mode:col]' "
"(e.g., 'title:name,upper:code')."
),
),
no_trim: bool = typer.Option(False, "--no-trim", help="Disable whitespace trim."),
no_collapse: bool = typer.Option(
False, "--no-collapse", help="Disable internal whitespace collapse.",
),
no_nfc: bool = typer.Option(False, "--no-nfc", help="Disable Unicode NFC normalization."),
nfkc: bool = typer.Option(
False, "--nfkc",
help="Enable NFKC compat fold (lossy: ① → 1, fi → fi). Default off.",
),
no_smart_chars: bool = typer.Option(
False, "--no-smart-chars",
help="Disable smart-character folding (curly quotes, em/en-dash, NBSP).",
),
no_zero_width: bool = typer.Option(
False, "--no-zero-width", help="Disable zero-width / invisible char strip.",
),
no_bom: bool = typer.Option(False, "--no-bom", help="Disable BOM strip."),
no_control: bool = typer.Option(
False, "--no-control", help="Disable control-character strip.",
),
no_line_endings: bool = typer.Option(
False, "--no-line-endings", help="Disable line-ending normalization.",
),
full_changelog: bool = typer.Option(
False, "--full-changelog",
help="Write every cell change to the audit CSV (default caps to first 1000).",
),
config: Optional[str] = typer.Option(
None, "--config",
help="Load options from a saved JSON config file.",
),
save_config: Optional[str] = typer.Option(
None, "--save-config",
help="Save current options to a JSON config file.",
),
sheet: Optional[str] = typer.Option(
None, "--sheet",
help="Excel sheet name or index (default: first sheet).",
),
encoding_override: Optional[str] = typer.Option(
None, "--encoding",
help="Override auto-detected file encoding.",
),
header_row: Optional[int] = typer.Option(
None, "--header-row",
help="0-based row index for the header (default: auto-detect).",
),
):
"""Clean and normalize text in a CSV or Excel file."""
from src.core.io import read_file, write_file
from src.core.text_clean import (
CleanOptions,
PRESETS,
clean_dataframe,
)
import pandas as pd
# ------------------------------------------------------------------
# Validate inputs
# ------------------------------------------------------------------
input_path = Path(input_file)
if not input_path.exists():
typer.echo(f"Error: File not found: {input_path}", err=True)
raise typer.Exit(1)
if preset not in PRESETS:
typer.echo(
f"Error: Unknown preset '{preset}'. "
f"Choose from: {', '.join(sorted(PRESETS))}.",
err=True,
)
raise typer.Exit(1)
log_path = _setup_logging(Path("logs"))
# ------------------------------------------------------------------
# Build CleanOptions
# ------------------------------------------------------------------
if config:
cfg_path = Path(config)
if not cfg_path.exists():
typer.echo(f"Error: Config file not found: {cfg_path}", err=True)
raise typer.Exit(1)
options = CleanOptions.from_file(cfg_path)
logger.info("Loaded config from {}", cfg_path)
else:
options = CleanOptions.from_preset(preset)
# CLI overrides on top of preset/config
if no_trim:
options.trim = False
if no_collapse:
options.collapse_whitespace = False
if no_nfc:
options.nfc = False
if nfkc:
options.nfkc = True
if no_smart_chars:
options.fold_smart_chars = False
if no_zero_width:
options.strip_zero_width = False
if no_bom:
options.strip_bom = False
if no_control:
options.strip_control = False
if no_line_endings:
options.normalize_line_endings = False
cols_list = _split_csv_arg(columns)
if cols_list is not None:
options.columns = cols_list
skip_list = _split_csv_arg(skip)
if skip_list:
options.skip_columns = skip_list
bare_case, per_col_case = _parse_case(case)
if bare_case:
options.case = bare_case # type: ignore[assignment]
if per_col_case:
options.case_columns = {**options.case_columns, **per_col_case} # type: ignore[dict-item]
# ------------------------------------------------------------------
# Save config if requested (after CLI merge so the file reflects intent)
# ------------------------------------------------------------------
if save_config:
saved = options.to_file(save_config)
typer.echo(f"Config saved to {saved}")
# ------------------------------------------------------------------
# Read input
# ------------------------------------------------------------------
typer.echo(f"Reading {input_path.name}...")
try:
sheet_arg: str | int | None = None
if sheet is not None:
try:
sheet_arg = int(sheet)
except ValueError:
sheet_arg = sheet
df = read_file(
input_path,
encoding=encoding_override,
header_row=header_row,
sheet_name=sheet_arg if sheet_arg is not None else 0,
)
if not isinstance(df, pd.DataFrame):
df = pd.concat(list(df), ignore_index=True)
except Exception as e:
typer.echo(f"Error reading file: {e}", err=True)
raise typer.Exit(1)
typer.echo(f" {len(df)} rows, {len(df.columns)} columns")
# ------------------------------------------------------------------
# Run pipeline
# ------------------------------------------------------------------
typer.echo("Cleaning text...")
try:
result = clean_dataframe(df, options)
except ValueError as e:
typer.echo(f"Error: {e}", err=True)
raise typer.Exit(1)
_print_results(result, input_path, options)
# ------------------------------------------------------------------
# Write output
# ------------------------------------------------------------------
if apply:
stem = input_path.stem
out_path = Path(output) if output else input_path.parent / f"{stem}_cleaned.csv"
write_file(result.cleaned_df, out_path)
typer.echo(f"\nCleaned file: {out_path}")
if not result.changes.empty:
changes_path = input_path.parent / f"{stem}_changes.csv"
audit_df = result.changes
cap = 1000
if not full_changelog and len(audit_df) > cap:
typer.echo(
f"Note: changelog capped at {cap} rows. "
f"Use --full-changelog to write all {len(audit_df)} changes."
)
audit_df = audit_df.head(cap)
write_file(audit_df, changes_path)
typer.echo(f"Changes audit: {changes_path}")
else:
typer.echo("\nThis was a preview. Add --apply to write the output files.")
typer.echo(f"Log: {log_path}")
# ---------------------------------------------------------------------------
# Output formatting
# ---------------------------------------------------------------------------
def _print_results(result, input_path: Path, options) -> None:
pct = (result.cells_changed / result.cells_total * 100.0) if result.cells_total else 0.0
typer.echo(f"\n{''*50}")
typer.echo(f" File: {input_path.name}")
typer.echo(f" Columns processed: {len(result.columns_processed)}")
typer.echo(f" Cells scanned: {result.cells_total}")
typer.echo(f" Cells changed: {result.cells_changed} ({pct:.1f}%)")
typer.echo(f"{''*50}")
if result.cells_changed and not result.changes.empty:
# Per-column change counts
counts = result.changes["column"].value_counts()
typer.echo("\nChanges by column:")
for col, n in counts.head(10).items():
typer.echo(f" {col}: {n} cell(s)")
if len(counts) > 10:
typer.echo(f" ... and {len(counts) - 10} more columns")
# Show first few examples
typer.echo("\nFirst examples:")
for _, row in result.changes.head(5).iterrows():
old = repr(row["old"])[:40]
new = repr(row["new"])[:40]
typer.echo(
f" Row {row['row'] + 1}, {row['column']}: {old}{new} "
f"[{row['ops_applied']}]"
)
# ---------------------------------------------------------------------------
# __main__
# ---------------------------------------------------------------------------
def main():
app()
if __name__ == "__main__":
main()

View File

@@ -59,6 +59,25 @@ from .config import (
DeduplicationConfig,
StrategyConfig,
)
from .text_clean import (
CleanOptions,
CleanResult,
PRESETS,
apply_case,
clean_dataframe,
clean_value,
collapse_whitespace,
fold_smart_chars,
normalize_line_endings,
sentence_case,
smart_title_case,
strip_bom,
strip_control,
strip_zero_width,
to_nfc,
to_nfkc,
trim,
)
__all__ = [
# Core
@@ -90,4 +109,22 @@ __all__ = [
"DeduplicationConfig",
"StrategyConfig",
"ColumnStrategyConfig",
# Text cleaning
"CleanOptions",
"CleanResult",
"PRESETS",
"clean_dataframe",
"clean_value",
"trim",
"collapse_whitespace",
"to_nfc",
"to_nfkc",
"fold_smart_chars",
"strip_zero_width",
"strip_bom",
"strip_control",
"normalize_line_endings",
"smart_title_case",
"sentence_case",
"apply_case",
]

489
src/core/text_clean.py Normal file
View File

@@ -0,0 +1,489 @@
"""Character-level text hygiene for DataFrames.
Operations are independently toggleable, idempotent, and safe to compose.
Each per-string helper is ``str -> str``. Numeric, datetime, and boolean
columns pass through ``clean_dataframe`` untouched; only string cells are
modified.
See TECHNICAL.md Section 10.2 for the full functional spec.
"""
from __future__ import annotations
import json
import re
import unicodedata
from dataclasses import asdict, dataclass, field
from pathlib import Path
from typing import Any, Callable, Iterable, Literal, Optional
import pandas as pd
from pandas.api import types as pdtypes
# ---------------------------------------------------------------------------
# Per-string helpers
# ---------------------------------------------------------------------------
# Smart-character map (curly quotes, dashes, ellipsis, NBSP, narrow NBSP).
_SMART_CHARS: dict[str, str] = {
"": "'", # LEFT SINGLE QUOTATION MARK
"": "'", # RIGHT SINGLE QUOTATION MARK
"": "'", # SINGLE LOW-9 QUOTATION MARK
"": "'", # SINGLE HIGH-REVERSED-9 QUOTATION MARK
"": '"', # LEFT DOUBLE QUOTATION MARK
"": '"', # RIGHT DOUBLE QUOTATION MARK
"": '"', # DOUBLE LOW-9 QUOTATION MARK
"": '"', # DOUBLE HIGH-REVERSED-9 QUOTATION MARK
"": "-", # EN DASH
"": "-", # EM DASH
"": "-", # HORIZONTAL BAR
"": "-", # MINUS SIGN
"": "...", # HORIZONTAL ELLIPSIS
" ": " ", # NO-BREAK SPACE
"": " ", # NARROW NO-BREAK SPACE
"": " ", # THIN SPACE
"": " ", # HAIR SPACE
"": " ", # EN SPACE
"": " ", # EM SPACE
" ": " ", # IDEOGRAPHIC SPACE
}
_SMART_TRANS = str.maketrans(_SMART_CHARS)
# Zero-width / invisible characters. ``U+FEFF`` (BOM/ZWNBSP) is included; if
# it appears at the *very start* of the first cell of the first column, the
# BOM-strip op handles it; elsewhere it is treated as a zero-width char.
_ZERO_WIDTH = (
"" # ZERO WIDTH SPACE
"" # ZERO WIDTH NON-JOINER
"" # ZERO WIDTH JOINER
"" # WORD JOINER
"" # LEFT-TO-RIGHT MARK
"" # RIGHT-TO-LEFT MARK
"" # ZERO WIDTH NO-BREAK SPACE / BOM
)
_ZERO_WIDTH_RE = re.compile(f"[{_ZERO_WIDTH}]")
# Control characters: U+0000-U+001F and U+007F, but preserve \t \n \r.
_CONTROL_RE = re.compile(r"[\x00-\x08\x0B\x0C\x0E-\x1F\x7F]")
# Any run of *horizontal* whitespace (spaces, tabs, form/vertical feeds).
# Newlines and carriage returns are excluded so multi-line cells keep their
# line structure; the line-ending op normalizes the actual line terminators.
_WHITESPACE_RUN_RE = re.compile(r"[^\S\n\r]+")
def trim(s: str) -> str:
"""Strip leading/trailing whitespace."""
if not isinstance(s, str):
return s
return s.strip()
def collapse_whitespace(s: str) -> str:
"""Collapse runs of whitespace to a single space.
Preserves leading/trailing whitespace boundaries (use ``trim`` to remove
them). Tabs and other whitespace inside the string become a single
regular space.
"""
if not isinstance(s, str):
return s
return _WHITESPACE_RUN_RE.sub(" ", s)
def to_nfc(s: str) -> str:
"""Apply Unicode NFC (canonical composition)."""
if not isinstance(s, str):
return s
return unicodedata.normalize("NFC", s)
def to_nfkc(s: str) -> str:
"""Apply Unicode NFKC (compatibility composition). Lossy."""
if not isinstance(s, str):
return s
return unicodedata.normalize("NFKC", s)
def fold_smart_chars(s: str) -> str:
"""Fold curly quotes, em/en-dashes, ellipsis, NBSP variants to ASCII."""
if not isinstance(s, str):
return s
return s.translate(_SMART_TRANS)
def strip_zero_width(s: str) -> str:
"""Remove zero-width and bidi-mark characters."""
if not isinstance(s, str):
return s
return _ZERO_WIDTH_RE.sub("", s)
def strip_bom(s: str) -> str:
"""Remove a leading ``U+FEFF`` (BOM) from the start of the string."""
if not isinstance(s, str):
return s
return s.lstrip("")
def strip_control(s: str) -> str:
"""Remove control characters except ``\\t \\n \\r``."""
if not isinstance(s, str):
return s
return _CONTROL_RE.sub("", s)
def normalize_line_endings(s: str) -> str:
"""Normalize ``\\r\\n`` and bare ``\\r`` to ``\\n``."""
if not isinstance(s, str):
return s
return s.replace("\r\n", "\n").replace("\r", "\n")
# Smart title-case helpers
_TITLE_LOWERCASE_PARTICLES = {
"a", "an", "and", "as", "at", "but", "by", "en", "for", "if", "in", "nor",
"of", "on", "or", "per", "the", "to", "v", "v.", "vs", "vs.", "via",
}
def _is_all_caps_token(token: str) -> bool:
"""A token is all-caps when it has at least one cased char and no lowercase."""
has_letter = any(c.isalpha() for c in token)
has_lower = any(c.islower() for c in token)
return has_letter and not has_lower and len(token) >= 2
def smart_title_case(s: str) -> str:
"""Title-case that preserves all-caps tokens and lowercases mid-string particles.
- ``USA`` stays ``USA``.
- ``of``, ``and``, ``the``, etc. stay lowercase except as the first/last word.
- Apostrophes inside words don't restart capitalization (``O'Neil``).
"""
if not isinstance(s, str) or not s:
return s
tokens = s.split(" ")
out: list[str] = []
last_idx = len(tokens) - 1
for i, tok in enumerate(tokens):
if not tok:
out.append(tok)
continue
if _is_all_caps_token(tok):
out.append(tok)
continue
lowered = tok.lower()
if 0 < i < last_idx and lowered in _TITLE_LOWERCASE_PARTICLES:
out.append(lowered)
continue
# Capitalize first cased character; preserve apostrophes/hyphens
chars = list(tok)
capitalized = False
for j, c in enumerate(chars):
if c.isalpha():
if not capitalized:
chars[j] = c.upper()
capitalized = True
else:
chars[j] = c.lower()
out.append("".join(chars))
return " ".join(out)
def sentence_case(s: str) -> str:
"""Lowercase, then capitalize the first cased letter after each ``. ! ?``."""
if not isinstance(s, str) or not s:
return s
lowered = s.lower()
chars = list(lowered)
capitalize_next = True
for i, c in enumerate(chars):
if c in ".!?":
capitalize_next = True
continue
if capitalize_next and c.isalpha():
chars[i] = c.upper()
capitalize_next = False
elif c.strip():
# Any non-whitespace, non-letter (e.g., quote, paren) doesn't
# consume the "next letter" trigger.
if c.isalpha():
capitalize_next = False
return "".join(chars)
CaseMode = Literal["upper", "lower", "title", "sentence"]
def apply_case(s: str, mode: CaseMode) -> str:
if not isinstance(s, str):
return s
if mode == "upper":
return s.upper()
if mode == "lower":
return s.lower()
if mode == "title":
return smart_title_case(s)
if mode == "sentence":
return sentence_case(s)
raise ValueError(f"Unknown case mode: {mode}")
# ---------------------------------------------------------------------------
# Options / result dataclasses
# ---------------------------------------------------------------------------
PRESETS: dict[str, dict[str, Any]] = {
"minimal": {
"trim": True,
"collapse_whitespace": True,
"nfc": False,
"nfkc": False,
"fold_smart_chars": False,
"strip_zero_width": False,
"strip_bom": False,
"strip_control": False,
"normalize_line_endings": False,
},
"excel-hygiene": {
"trim": True,
"collapse_whitespace": True,
"nfc": True,
"nfkc": False,
"fold_smart_chars": True,
"strip_zero_width": True,
"strip_bom": True,
"strip_control": True,
"normalize_line_endings": True,
},
"paranoid": {
"trim": True,
"collapse_whitespace": True,
"nfc": True,
"nfkc": True,
"fold_smart_chars": True,
"strip_zero_width": True,
"strip_bom": True,
"strip_control": True,
"normalize_line_endings": True,
},
}
@dataclass
class CleanOptions:
"""Toggles for character-level cleaning operations.
Defaults match the ``excel-hygiene`` preset.
"""
# Operations
trim: bool = True
collapse_whitespace: bool = True
nfc: bool = True
nfkc: bool = False
fold_smart_chars: bool = True
strip_zero_width: bool = True
strip_bom: bool = True
strip_control: bool = True
normalize_line_endings: bool = True
# Case conversion: either a single mode applied to all selected columns,
# or a dict mapping column name -> mode for per-column control.
case: Optional[CaseMode] = None
case_columns: dict[str, CaseMode] = field(default_factory=dict)
# Scope control
columns: Optional[list[str]] = None # None = all string-typed columns
skip_columns: list[str] = field(default_factory=list)
@classmethod
def from_preset(cls, name: str) -> CleanOptions:
if name not in PRESETS:
raise ValueError(
f"Unknown preset '{name}'. "
f"Available: {', '.join(sorted(PRESETS))}."
)
return cls(**PRESETS[name])
@classmethod
def from_dict(cls, data: dict) -> CleanOptions:
known = {f for f in cls.__dataclass_fields__}
kwargs = {k: v for k, v in data.items() if k in known}
return cls(**kwargs)
def to_dict(self) -> dict:
return asdict(self)
def to_file(self, path: str | Path) -> Path:
out = Path(path)
out.write_text(json.dumps(self.to_dict(), indent=2))
return out
@classmethod
def from_file(cls, path: str | Path) -> CleanOptions:
return cls.from_dict(json.loads(Path(path).read_text()))
@dataclass
class CleanResult:
"""Output of ``clean_dataframe``."""
cleaned_df: pd.DataFrame
changes: pd.DataFrame # cols: row, column, old, new, ops_applied
cells_changed: int
cells_total: int
columns_processed: list[str]
# ---------------------------------------------------------------------------
# Cell-level pipeline
# ---------------------------------------------------------------------------
def _build_pipeline(options: CleanOptions) -> list[tuple[str, Callable[[str], str]]]:
"""Return ordered (op_name, fn) pairs for the cell-level pipeline.
Order is meaningful:
1. BOM strip first so a leading FEFF doesn't survive into other ops.
2. Line-ending normalize before whitespace ops so \\r\\n collapses cleanly.
3. Control-char strip before whitespace ops.
4. Smart-char fold before NFC/NFKC (folded ASCII is already NFC-stable).
5. NFC then NFKC (NFKC subsumes NFC if both set; we still run NFC first
so the result is identical to NFKC alone — kept explicit for logging).
6. Zero-width strip after Unicode normalization (NFKC can introduce
decomposed forms whose combining marks must not be stripped).
7. Whitespace collapse, then trim, last.
"""
ops: list[tuple[str, Callable[[str], str]]] = []
if options.strip_bom:
ops.append(("strip_bom", strip_bom))
if options.normalize_line_endings:
ops.append(("normalize_line_endings", normalize_line_endings))
if options.strip_control:
ops.append(("strip_control", strip_control))
if options.fold_smart_chars:
ops.append(("fold_smart_chars", fold_smart_chars))
if options.nfc:
ops.append(("nfc", to_nfc))
if options.nfkc:
ops.append(("nfkc", to_nfkc))
if options.strip_zero_width:
ops.append(("strip_zero_width", strip_zero_width))
if options.collapse_whitespace:
ops.append(("collapse_whitespace", collapse_whitespace))
if options.trim:
ops.append(("trim", trim))
return ops
def clean_value(value: Any, options: CleanOptions) -> tuple[Any, list[str]]:
"""Apply the configured pipeline to a single cell.
Returns ``(cleaned_value, ops_applied)``. Non-strings and missing values
pass through unchanged with an empty ``ops_applied`` list.
"""
if value is None or (isinstance(value, float) and pd.isna(value)):
return value, []
if not isinstance(value, str):
return value, []
pipeline = _build_pipeline(options)
cur = value
applied: list[str] = []
for name, fn in pipeline:
new = fn(cur)
if new != cur:
applied.append(name)
cur = new
return cur, applied
# ---------------------------------------------------------------------------
# DataFrame-level entry point
# ---------------------------------------------------------------------------
def _select_columns(df: pd.DataFrame, options: CleanOptions) -> list[str]:
"""Pick the columns the pipeline should operate on.
- If ``options.columns`` is explicit, use it (after validating).
- Otherwise default to columns whose pandas dtype is object/string.
- Always exclude ``options.skip_columns``.
"""
if options.columns is not None:
missing = [c for c in options.columns if c not in df.columns]
if missing:
raise ValueError(
f"Columns not found in input: {missing}. "
f"Available: {list(df.columns)}"
)
chosen: Iterable[str] = options.columns
else:
chosen = [
c for c in df.columns
if pdtypes.is_object_dtype(df[c]) or pdtypes.is_string_dtype(df[c])
]
skip = set(options.skip_columns)
return [c for c in chosen if c not in skip]
def clean_dataframe(df: pd.DataFrame, options: Optional[CleanOptions] = None) -> CleanResult:
"""Apply text-cleaning ops to selected columns of *df*.
Numeric, datetime, and boolean columns are skipped by default. The input
DataFrame is not mutated; a copy is returned in ``CleanResult.cleaned_df``.
"""
options = options or CleanOptions()
out = df.copy()
columns = _select_columns(out, options)
case_per_col: dict[str, CaseMode] = dict(options.case_columns)
if options.case is not None:
for c in columns:
case_per_col.setdefault(c, options.case)
change_records: list[dict[str, Any]] = []
cells_changed = 0
cells_total = 0
for col in columns:
series = out[col]
new_values: list[Any] = []
col_case = case_per_col.get(col)
for row_idx, original in enumerate(series.tolist()):
cells_total += 1
cleaned, ops_applied = clean_value(original, options)
if col_case is not None and isinstance(cleaned, str):
cased = apply_case(cleaned, col_case)
if cased != cleaned:
ops_applied.append(f"case:{col_case}")
cleaned = cased
if ops_applied and cleaned != original:
cells_changed += 1
change_records.append({
"row": row_idx,
"column": col,
"old": original,
"new": cleaned,
"ops_applied": ",".join(ops_applied),
})
new_values.append(cleaned)
out[col] = new_values
changes_df = pd.DataFrame(
change_records,
columns=["row", "column", "old", "new", "ops_applied"],
)
return CleanResult(
cleaned_df=out,
changes=changes_df,
cells_changed=cells_changed,
cells_total=cells_total,
columns_processed=columns,
)

View File

@@ -1,10 +1,13 @@
"""DataTools Text Cleaner — stub page."""
"""DataTools Text Cleaner — Streamlit page."""
from __future__ import annotations
import io
import json
import sys
from pathlib import Path
import pandas as pd
import streamlit as st
_project_root = Path(__file__).resolve().parent.parent.parent.parent
@@ -12,82 +15,236 @@ if str(_project_root) not in sys.path:
sys.path.insert(0, str(_project_root))
from src.gui.components import hide_streamlit_chrome
from src.core.text_clean import (
PRESETS,
CleanOptions,
clean_dataframe,
)
hide_streamlit_chrome()
# ---------------------------------------------------------------------------
# Header
# ---------------------------------------------------------------------------
st.title("✂️ Text Cleaner")
st.caption("Clean and normalize text content across your data.")
st.info("This tool is under development.")
st.caption(
"Trim whitespace, fold smart quotes, strip invisible characters, and "
"normalize line endings. Runs locally — your data never leaves this computer."
)
# ---------------------------------------------------------------------------
# What this tool will do
# ---------------------------------------------------------------------------
st.markdown("""
**Features:**
- Trim leading/trailing whitespace
- Collapse multiple spaces into one
- Unicode normalization (NFC/NFKC)
- Strip non-printable / control characters
- Remove BOM (byte order mark)
- Normalize line endings (CRLF → LF)
- Case conversion (upper, lower, title, sentence)
""")
st.divider()
# ---------------------------------------------------------------------------
# File upload (functional)
# File upload
# ---------------------------------------------------------------------------
uploaded = st.file_uploader(
"Upload CSV or Excel file",
type=["csv", "tsv", "xlsx", "xls"],
help="Upload a file to preview. Processing is not yet available.",
key="textclean_file_upload",
)
if uploaded is not None:
import pandas as pd
try:
if uploaded.name.endswith((".xlsx", ".xls")):
df = pd.read_excel(uploaded)
else:
df = pd.read_csv(uploaded)
st.subheader(f"Preview: {uploaded.name}")
st.caption(f"{len(df)} rows, {len(df.columns)} columns")
st.dataframe(df.head(10), use_container_width=True)
except Exception as e:
st.error(f"Failed to read file: {e}")
if uploaded is None:
st.info("Upload a CSV, TSV, or Excel file to begin.")
st.stop()
# ---------------------------------------------------------------------------
# Placeholder options
# ---------------------------------------------------------------------------
st.subheader("Operations")
@st.cache_data(show_spinner=False)
def _read_uploaded(name: str, data: bytes) -> pd.DataFrame:
"""Read the uploaded bytes into a DataFrame, treating all cells as strings."""
suffix = Path(name).suffix.lower()
bio = io.BytesIO(data)
if suffix in (".xlsx", ".xls"):
return pd.read_excel(bio, dtype=str, keep_default_na=False)
# CSV / TSV — try utf-8 then utf-8-sig then latin-1 as a fallback
for enc in ("utf-8", "utf-8-sig", "latin-1"):
try:
bio.seek(0)
sep = "\t" if suffix == ".tsv" else ","
return pd.read_csv(
bio, dtype=str, keep_default_na=False,
encoding=enc, sep=sep, on_bad_lines="warn",
)
except UnicodeDecodeError:
continue
bio.seek(0)
return pd.read_csv(bio, dtype=str, keep_default_na=False, encoding="latin-1")
st.checkbox("Trim whitespace", value=True, disabled=True)
st.checkbox("Collapse multiple spaces", value=True, disabled=True)
st.checkbox("Unicode normalization (NFC)", value=False, disabled=True)
st.checkbox("Strip non-printable characters", value=False, disabled=True)
st.checkbox("Remove BOM", value=False, disabled=True)
st.checkbox("Normalize line endings", value=False, disabled=True)
st.selectbox("Case conversion", ["None", "UPPER", "lower", "Title Case", "Sentence case"], disabled=True)
try:
df = _read_uploaded(uploaded.name, uploaded.getvalue())
except Exception as e:
st.error(f"Failed to read file: {e}")
st.stop()
st.subheader(f"Preview: {uploaded.name}")
st.caption(f"{len(df)} rows, {len(df.columns)} columns")
st.dataframe(df.head(10), use_container_width=True)
st.divider()
st.button("Clean Text", type="primary", use_container_width=True, disabled=True)
# ---------------------------------------------------------------------------
# Footer
# Options
# ---------------------------------------------------------------------------
st.divider()
st.caption(
"Runs locally. Your data never leaves this computer. "
"| DataTools v3.0"
st.subheader("Options")
preset_label = st.radio(
"Preset",
["excel-hygiene (recommended)", "minimal", "paranoid"],
index=0,
horizontal=True,
help=(
"excel-hygiene: trim, collapse whitespace, fold smart quotes, strip "
"invisible chars, normalize line endings, NFC. "
"minimal: only trim and collapse. "
"paranoid: everything including NFKC compat fold (lossy)."
),
)
preset_key = preset_label.split(" ", 1)[0]
options = CleanOptions.from_preset(preset_key)
with st.expander("Advanced options"):
col_a, col_b = st.columns(2)
with col_a:
options.trim = st.checkbox("Trim leading/trailing whitespace", value=options.trim)
options.collapse_whitespace = st.checkbox(
"Collapse internal whitespace", value=options.collapse_whitespace,
)
options.normalize_line_endings = st.checkbox(
"Normalize line endings (\\r\\n → \\n)", value=options.normalize_line_endings,
)
options.strip_control = st.checkbox(
"Strip control characters", value=options.strip_control,
)
options.strip_bom = st.checkbox("Strip BOM", value=options.strip_bom)
with col_b:
options.fold_smart_chars = st.checkbox(
"Fold smart characters (curly quotes, em-dash, NBSP)",
value=options.fold_smart_chars,
)
options.strip_zero_width = st.checkbox(
"Strip zero-width / invisible characters", value=options.strip_zero_width,
)
options.nfc = st.checkbox("Unicode NFC normalization", value=options.nfc)
options.nfkc = st.checkbox(
"Unicode NFKC compat fold (lossy: ① → 1, fi → fi)",
value=options.nfkc,
)
st.markdown("**Scope**")
string_cols = [
c for c in df.columns
if pd.api.types.is_object_dtype(df[c]) or pd.api.types.is_string_dtype(df[c])
]
selected_cols = st.multiselect(
"Columns to clean (default: all string columns)",
options=list(df.columns),
default=string_cols,
)
skip_cols = st.multiselect(
"Columns to skip even if they look like text",
options=list(df.columns),
default=[],
)
options.columns = selected_cols if selected_cols else None
options.skip_columns = list(skip_cols)
st.markdown("**Case conversion**")
case_global = st.selectbox(
"Apply case conversion to selected columns",
["None", "UPPER", "lower", "Title", "Sentence"],
index=0,
)
case_map = {
"UPPER": "upper", "lower": "lower",
"Title": "title", "Sentence": "sentence",
}
if case_global != "None":
options.case = case_map[case_global] # type: ignore[assignment]
# ---------------------------------------------------------------------------
# Run
# ---------------------------------------------------------------------------
st.divider()
if st.button("Clean Text", type="primary", use_container_width=True):
with st.spinner("Cleaning..."):
try:
result = clean_dataframe(df, options)
except ValueError as e:
st.error(str(e))
st.stop()
st.session_state["textclean_result"] = result
st.session_state["textclean_input_name"] = uploaded.name
result = st.session_state.get("textclean_result")
if result is None:
st.stop()
# ---------------------------------------------------------------------------
# Results
# ---------------------------------------------------------------------------
st.subheader("Results")
pct = (result.cells_changed / result.cells_total * 100.0) if result.cells_total else 0.0
m1, m2, m3, m4 = st.columns(4)
m1.metric("Cells scanned", result.cells_total)
m2.metric("Cells changed", result.cells_changed)
m3.metric("% changed", f"{pct:.1f}%")
m4.metric("Columns processed", len(result.columns_processed))
if result.cells_changed:
counts = result.changes["column"].value_counts()
st.markdown("**Changes by column**")
st.dataframe(
counts.rename("cells_changed").to_frame(),
use_container_width=True,
)
st.markdown("**Examples (first 25 changes)**")
examples = result.changes.head(25).copy()
examples["row"] = examples["row"] + 1
st.dataframe(examples, use_container_width=True, hide_index=True)
st.markdown("**Cleaned preview (first 10 rows)**")
st.dataframe(result.cleaned_df.head(10), use_container_width=True)
# ---------------------------------------------------------------------------
# Downloads
# ---------------------------------------------------------------------------
st.divider()
stem = Path(st.session_state.get("textclean_input_name", "input")).stem
dl_a, dl_b, dl_c = st.columns(3)
with dl_a:
cleaned_bytes = result.cleaned_df.to_csv(index=False).encode("utf-8-sig")
st.download_button(
"Download cleaned CSV",
data=cleaned_bytes,
file_name=f"{stem}_cleaned.csv",
mime="text/csv",
)
with dl_b:
if not result.changes.empty:
changes_bytes = result.changes.to_csv(index=False).encode("utf-8-sig")
st.download_button(
"Download changes audit",
data=changes_bytes,
file_name=f"{stem}_changes.csv",
mime="text/csv",
)
with dl_c:
config_bytes = json.dumps(options.to_dict(), indent=2).encode("utf-8")
st.download_button(
"Download config JSON",
data=config_bytes,
file_name="text_clean_config.json",
mime="application/json",
)
st.divider()
st.caption("Runs locally. Your data never leaves this computer. | DataTools v3.0")

View File

@@ -0,0 +1,8 @@
id,address
1,"123 Main St
Apt 4B
NYC NY 10001"
2,"456 Oak Ave
Suite 200
LA CA 90001"
3,"789 Pine Rd
1 id address
2 1 123 Main St Apt 4B NYC NY 10001
3 2 456 Oak Ave Suite 200 LA CA 90001
4 3 789 Pine Rd Unit 5 SF CA 94101

Binary file not shown.
1 name note
2 Alice normal
3 Bob tab here
4 Charlie null�byte

View File

@@ -0,0 +1,5 @@
name,translation
Café,Cafe
éclair,eclair
你好,Hello (CN)
שלום,Hello (HE)
1 name translation
2 Café Cafe
3 éclair eclair
4 你好 Hello (CN)
5 שלום Hello (HE)

View File

@@ -0,0 +1,5 @@
x,y,z
1,1.1,10
2,2.2,20
3,3.3,30
4,4.4,40
1 x y z
2 1 1.1 10
3 2 2.2 20
4 3 3.3 30
5 4 4.4 40

View File

@@ -0,0 +1,6 @@
field
single curly
“double curly”
low-9 x high-reversed-9
em — en minus horizontal ―
ellipsis… narrownbsp 
1 field
2 ‘single curly’
3 “double curly”
4 low-9 ‘x’ high-reversed-9
5 em — en – minus − horizontal ―
6 ellipsis… narrow nbsp 

View File

@@ -0,0 +1,5 @@
first_name,last_name,phone
John ,Smith,555-1234
Jane,Doe ,555-5678
 Bob,Jones,555-9012
Alice,Brown,555-3456
1 first_name last_name phone
2 John  Smith 555-1234
3 Jane Doe  555-5678
4 Bob Jones 555-9012
5 Alice Brown 555-3456

View File

@@ -0,0 +1,4 @@
sku,title,description
DOG-001,“Best Dog Collar”,High quality…
CAT-002,Cat Toy — Premium,Its the best
FISH-003,Fish Food Tropical,Use dont overfeed
1 sku title description
2 DOG-001 “Best Dog Collar” High quality…
3 CAT-002 Cat Toy — Premium It’s the best
4 FISH-003 Fish Food – Tropical Use don’t overfeed

View File

@@ -0,0 +1,4 @@
customer_id,name,amount
1001,Alice,100.0
1002,Bob,200.0
1003,Charlie,300.0
1 customer_id name amount
2 1001 Alice 100.0
3 1002 Bob 200.0
4 1003 Charlie 300.0

View File

@@ -0,0 +1,4 @@
sku,qty
ABC-123,10
XYZ-456,20
QQQ-789,30
1 sku qty
2 ABC​-123 10
3 XYZ-456​ 20
4 QQQ-789 30

View File

@@ -0,0 +1,7 @@
date,amount,memo
2024-01-15,-1500.0,"Payment
Monthly recurring
Net 30"
2024-01-16,-250.0,Single line memo
2024-01-17,-89.99,"Standard
purchase"
1 date amount memo
2 2024-01-15 -1500.0 Payment Monthly recurring Net 30
3 2024-01-16 -250.0 Single line memo
4 2024-01-17 -89.99 Standard purchase

View File

@@ -0,0 +1,6 @@
vendor,ein
ACME Corp ,12-3456789
ACME Corp,12-3456789
ACME Corp ,12-3456789
Globex Inc,98-7654321
Globex Inc ,98-7654321
1 vendor ein
2 ACME Corp 12-3456789
3 ACME Corp 12-3456789
4 ACME Corp 12-3456789
5 Globex Inc 98-7654321
6 Globex Inc 98-7654321

View File

@@ -0,0 +1,4 @@
company,city
Café Roma,Boston
Très Belle,Montréal
Naïve Studios,São Paulo
1 company city
2 Café Roma Boston
3 Très Belle Montréal
4 Naïve Studios São Paulo

View File

@@ -0,0 +1,4 @@
task,owner
Phase 1 — Discovery,Alice
Phase 2 — Design,Bob
Q1 Q2,Charlie
1 task owner
2 Phase 1 — Discovery Alice
3 Phase 2 — Design Bob
4 Q1 – Q2 Charlie

View File

@@ -0,0 +1,6 @@
response_id,agreement,category
1,YES,Tech
2,yes,TECH
3,Yes,tech
4,yEs,Tech
5,yes, Tech
1 response_id agreement category
2 1 YES Tech
3 2 yes TECH
4 3 Yes tech
5 4 yEs Tech
6 5 yes Tech

View File

@@ -0,0 +1,4 @@
email,source
alice@test.com,Facebook
bob@test.com,Google
charlie@test.com,Organic
1 email source
2 alice​@test.com Facebook
3 bob@test‎.com Google
4 charlie@test.com Organic

View File

@@ -0,0 +1,6 @@
email,platform
alice@a.com,FB
"alice@a.com
",Google
"alice@a.com
",Organic
1 email platform
2 alice@a.com FB
3 alice@a.com Google
4 alice@a.com Organic
5 bob@a.com FB

View File

@@ -0,0 +1,158 @@
"""Integration tests for the text-cleaner CLI."""
from __future__ import annotations
from pathlib import Path
import pandas as pd
import pytest
from typer.testing import CliRunner
from src.cli_text_clean import app
runner = CliRunner()
@pytest.fixture
def messy_csv(tmp_path):
df = pd.DataFrame({
"name": [" Alice ", "“Bob”", "Charlie"],
"city": ["NYC", " LA ", "SF"],
"qty": [1, 2, 3],
})
path = tmp_path / "messy.csv"
df.to_csv(path, index=False)
return path
class TestPreview:
def test_default_is_preview(self, messy_csv):
result = runner.invoke(app, [str(messy_csv)])
assert result.exit_code == 0
assert "preview" in result.output.lower()
assert "Cells changed" in result.output
def test_no_files_written_in_preview(self, messy_csv):
result = runner.invoke(app, [str(messy_csv)])
assert result.exit_code == 0
assert not (messy_csv.parent / f"{messy_csv.stem}_cleaned.csv").exists()
def test_file_not_found(self):
result = runner.invoke(app, ["/tmp/does_not_exist_xyz.csv"])
assert result.exit_code != 0
assert "not found" in result.output.lower()
class TestApply:
def test_apply_writes_cleaned_file(self, messy_csv): # E47
result = runner.invoke(app, [str(messy_csv), "--apply"])
assert result.exit_code == 0
cleaned = messy_csv.parent / f"{messy_csv.stem}_cleaned.csv"
assert cleaned.exists()
df = pd.read_csv(cleaned)
assert df["name"].iloc[0] == "Alice"
def test_apply_writes_changes_audit(self, messy_csv):
result = runner.invoke(app, [str(messy_csv), "--apply"])
assert result.exit_code == 0
changes = messy_csv.parent / f"{messy_csv.stem}_changes.csv"
assert changes.exists()
def test_no_audit_when_no_changes(self, tmp_path):
clean = tmp_path / "clean.csv"
pd.DataFrame({"a": ["x", "y"]}).to_csv(clean, index=False)
result = runner.invoke(app, [str(clean), "--apply"])
assert result.exit_code == 0
assert not (tmp_path / "clean_changes.csv").exists()
def test_custom_output_path(self, messy_csv, tmp_path):
out = tmp_path / "renamed.csv"
result = runner.invoke(app, [str(messy_csv), "--apply", "-o", str(out)])
assert result.exit_code == 0
assert out.exists()
class TestPresets:
def test_minimal_does_not_fold_smart_chars(self, messy_csv):
result = runner.invoke(app, [str(messy_csv), "--apply", "--preset", "minimal"])
assert result.exit_code == 0
cleaned = messy_csv.parent / f"{messy_csv.stem}_cleaned.csv"
df = pd.read_csv(cleaned)
# Smart quotes preserved under minimal preset
assert "" in df["name"].iloc[1] or "" in df["name"].iloc[1]
def test_excel_hygiene_default_folds_smart_chars(self, messy_csv):
result = runner.invoke(app, [str(messy_csv), "--apply"])
assert result.exit_code == 0
cleaned = messy_csv.parent / f"{messy_csv.stem}_cleaned.csv"
df = pd.read_csv(cleaned)
assert df["name"].iloc[1] == '"Bob"'
def test_unknown_preset_errors(self, messy_csv):
result = runner.invoke(app, [str(messy_csv), "--preset", "weird"])
assert result.exit_code != 0
assert "Unknown preset" in result.output
class TestColumnSelection:
def test_columns_flag(self, messy_csv):
result = runner.invoke(
app, [str(messy_csv), "--apply", "--columns", "name"],
)
assert result.exit_code == 0
cleaned = messy_csv.parent / f"{messy_csv.stem}_cleaned.csv"
df = pd.read_csv(cleaned)
assert df["name"].iloc[0] == "Alice"
# city should be untouched (still has spaces)
assert df["city"].iloc[1] == " LA "
def test_skip_flag(self, messy_csv):
result = runner.invoke(
app, [str(messy_csv), "--apply", "--skip", "name"],
)
assert result.exit_code == 0
cleaned = messy_csv.parent / f"{messy_csv.stem}_cleaned.csv"
df = pd.read_csv(cleaned)
# name should still have spaces
assert df["name"].iloc[0].startswith(" ")
class TestCaseFlag:
def test_bare_case_applies_to_all(self, tmp_path):
path = tmp_path / "names.csv"
pd.DataFrame({"a": ["alice"], "b": ["bob"]}).to_csv(path, index=False)
result = runner.invoke(app, [str(path), "--apply", "--case", "upper"])
assert result.exit_code == 0
df = pd.read_csv(tmp_path / "names_cleaned.csv")
assert df["a"].iloc[0] == "ALICE"
assert df["b"].iloc[0] == "BOB"
def test_per_column_case(self, tmp_path):
path = tmp_path / "names.csv"
pd.DataFrame({"name": ["alice"], "code": ["abc"]}).to_csv(path, index=False)
result = runner.invoke(
app, [str(path), "--apply", "--case", "title:name,upper:code"],
)
assert result.exit_code == 0
df = pd.read_csv(tmp_path / "names_cleaned.csv")
assert df["name"].iloc[0] == "Alice"
assert df["code"].iloc[0] == "ABC"
class TestConfigRoundTrip:
def test_save_and_load(self, messy_csv, tmp_path):
cfg = tmp_path / "opts.json"
result1 = runner.invoke(
app,
[str(messy_csv), "--save-config", str(cfg), "--preset", "minimal", "--no-trim"],
)
assert result1.exit_code == 0
assert cfg.exists()
# Reload and apply
result2 = runner.invoke(app, [str(messy_csv), "--apply", "--config", str(cfg)])
assert result2.exit_code == 0
cleaned = messy_csv.parent / f"{messy_csv.stem}_cleaned.csv"
df = pd.read_csv(cleaned)
# With --no-trim, leading spaces survive
assert df["name"].iloc[0].startswith(" ")

482
tests/test_text_clean.py Normal file
View File

@@ -0,0 +1,482 @@
"""Tests for src/core/text_clean.py.
Covers edge cases E1-E50 from TECHNICAL.md Section 10.2 plan.
"""
from __future__ import annotations
import json
import numpy as np
import pandas as pd
import pytest
from src.core.text_clean import (
CleanOptions,
PRESETS,
apply_case,
clean_dataframe,
clean_value,
collapse_whitespace,
fold_smart_chars,
normalize_line_endings,
sentence_case,
smart_title_case,
strip_bom,
strip_control,
strip_zero_width,
to_nfc,
to_nfkc,
trim,
)
# ---------------------------------------------------------------------------
# Per-string helpers
# ---------------------------------------------------------------------------
class TestTrim:
def test_strips_leading_and_trailing(self):
assert trim(" hello ") == "hello"
def test_preserves_internal_spaces(self):
assert trim(" a b ") == "a b"
def test_empty_string(self):
assert trim("") == ""
def test_idempotent(self):
assert trim(trim(" x ")) == trim(" x ")
class TestCollapseWhitespace:
def test_multiple_spaces(self):
assert collapse_whitespace("a b") == "a b"
def test_tab_inside_cell(self): # E2
assert collapse_whitespace("a\tb") == "a b"
def test_mixed_tabs_and_spaces(self): # E3
assert collapse_whitespace("a \t \t b") == "a b"
def test_idempotent(self):
assert collapse_whitespace(collapse_whitespace("a b")) == collapse_whitespace("a b")
class TestNFC:
def test_combining_acute(self): # E6
decomposed = "" # e + combining acute
composed = "é" # é
assert to_nfc(decomposed) == composed
def test_idempotent(self):
s = "café"
assert to_nfc(to_nfc(s)) == to_nfc(s)
class TestNFKC:
def test_circled_digit(self): # E7
assert to_nfkc("") == "1"
def test_ligature(self): # E7
assert to_nfkc("") == "fi"
def test_idempotent(self):
assert to_nfkc(to_nfkc("①fi")) == to_nfkc("①fi")
class TestSmartChars:
def test_curly_quotes(self): # E11
assert fold_smart_chars("hi") == "'hi'"
assert fold_smart_chars("“hi”") == '"hi"'
def test_dashes(self): # E12
assert fold_smart_chars("a—b") == "a-b"
assert fold_smart_chars("ab") == "a-b"
def test_ellipsis(self): # E13
assert fold_smart_chars("wait…") == "wait..."
def test_nbsp(self): # E14
assert fold_smart_chars("a b") == "a b"
def test_idempotent(self):
s = "“hi” — a b"
assert fold_smart_chars(fold_smart_chars(s)) == fold_smart_chars(s)
class TestZeroWidth:
def test_zwsp_midword(self): # E16
assert strip_zero_width("foobar") == "foobar"
def test_bidi_marks_stripped(self): # E17
assert strip_zero_width("abc") == "abc"
def test_word_joiner(self): # E18
assert strip_zero_width("ab") == "ab"
def test_mid_string_feff(self): # E22
assert strip_zero_width("foobar") == "foobar"
class TestStripBOM:
def test_leading_bom(self):
assert strip_bom("hello") == "hello"
def test_no_bom(self):
assert strip_bom("hello") == "hello"
def test_idempotent(self):
assert strip_bom(strip_bom("x")) == strip_bom("x")
class TestStripControl:
def test_null_byte(self): # E20
assert strip_control("a\x00b") == "ab"
def test_preserves_tab_newline_cr(self): # E19
assert strip_control("a\tb\nc\rd") == "a\tb\nc\rd"
def test_strips_other_control(self):
# 0x01..0x1F minus tab/newline/CR/VT/FF? we keep \t \n \r only.
assert strip_control("a\x01b\x07c\x1fd") == "abcd"
def test_strips_del(self):
assert strip_control("a\x7fb") == "ab"
class TestLineEndings:
def test_crlf(self): # E23
assert normalize_line_endings("a\r\nb") == "a\nb"
def test_bare_cr(self): # E24
assert normalize_line_endings("a\rb") == "a\nb"
def test_idempotent(self):
assert (
normalize_line_endings(normalize_line_endings("a\r\nb\rc"))
== normalize_line_endings("a\r\nb\rc")
)
class TestSmartTitleCase:
def test_preserves_acronym(self): # E26
assert smart_title_case("USA report") == "USA Report"
assert smart_title_case("nasa launch") == "Nasa Launch" # already lower
assert smart_title_case("NASA launch") == "NASA Launch"
def test_lowercases_particles_midstring(self): # E27
assert smart_title_case("the lord of the rings") == "The Lord of the Rings"
assert smart_title_case("a tale of two cities") == "A Tale of Two Cities"
def test_keeps_first_and_last_capitalized(self):
# "of" at the end stays capitalized
result = smart_title_case("kingdom of")
assert result == "Kingdom Of"
def test_apostrophe(self):
assert smart_title_case("o'neil") == "O'neil"
class TestSentenceCase:
def test_basic(self): # E28
assert sentence_case("hello. how are you? fine!") == "Hello. How are you? Fine!"
def test_preserves_punctuation(self):
assert sentence_case("WHAT? OK.") == "What? Ok."
class TestApplyCase:
def test_modes(self):
assert apply_case("Hello World", "upper") == "HELLO WORLD"
assert apply_case("Hello World", "lower") == "hello world"
assert apply_case("hello world", "title") == "Hello World"
assert apply_case("hello. world.", "sentence") == "Hello. World."
def test_unknown_mode_raises(self):
with pytest.raises(ValueError):
apply_case("x", "weird") # type: ignore[arg-type]
# ---------------------------------------------------------------------------
# clean_value composition
# ---------------------------------------------------------------------------
class TestCleanValue:
def test_default_excel_hygiene(self):
opts = CleanOptions()
out, ops = clean_value("“Hello world” ", opts)
assert out == '"Hello world"'
assert "fold_smart_chars" in ops
assert "trim" in ops
def test_pure_whitespace_to_empty(self): # E1
opts = CleanOptions()
out, ops = clean_value(" ", opts)
assert out == ""
def test_nbsp_only_cell(self): # E5
opts = CleanOptions()
out, _ = clean_value(" ", opts)
assert out == ""
def test_non_string_passthrough(self): # E32
opts = CleanOptions()
for val in (None, 42, 3.14, True, np.nan):
out, ops = clean_value(val, opts)
# NaN compares unequal to itself; check pd.isna for that case
if isinstance(val, float) and pd.isna(val):
assert pd.isna(out)
else:
assert out == val
assert ops == []
def test_empty_string(self):
opts = CleanOptions()
out, ops = clean_value("", opts)
assert out == ""
assert ops == []
def test_only_unchanged_ops_not_logged(self):
opts = CleanOptions(trim=True, collapse_whitespace=True, nfc=False, nfkc=False,
fold_smart_chars=False, strip_zero_width=False,
strip_bom=False, strip_control=False,
normalize_line_endings=False)
out, ops = clean_value("hello", opts)
assert out == "hello"
assert ops == []
class TestIdempotency:
"""E40 — applying the pipeline twice yields the same result as once."""
@pytest.mark.parametrize("preset", list(PRESETS.keys()))
def test_preset_idempotent(self, preset):
opts = CleanOptions.from_preset(preset)
cases = [
"“Hello world” ",
" \t multi space \r\n ",
"café",
"éclair",
"leading-bom",
"USA and the Rings",
"a\x00b\x01c",
"",
" ",
]
for s in cases:
once, _ = clean_value(s, opts)
twice, _ = clean_value(once, opts)
assert once == twice, f"not idempotent on {s!r} (preset {preset})"
# ---------------------------------------------------------------------------
# clean_dataframe
# ---------------------------------------------------------------------------
class TestCleanDataframe:
def test_only_string_columns_touched(self): # E31, E33, E35
df = pd.DataFrame({
"name": [" Alice ", "Bob"],
"age": [30, 25],
"joined": pd.to_datetime(["2024-01-01", "2024-02-01"]),
"active": [True, False],
})
result = clean_dataframe(df)
assert result.cleaned_df["name"].tolist() == ["Alice", "Bob"]
assert result.cleaned_df["age"].tolist() == [30, 25]
assert result.cleaned_df["active"].tolist() == [True, False]
assert "name" in result.columns_processed
assert "age" not in result.columns_processed
def test_explicit_columns(self): # E41
df = pd.DataFrame({"a": [" x "], "b": [" y "]})
result = clean_dataframe(df, CleanOptions(columns=["a"]))
assert result.cleaned_df["a"].iloc[0] == "x"
assert result.cleaned_df["b"].iloc[0] == " y "
assert result.columns_processed == ["a"]
def test_skip_columns(self): # E42
df = pd.DataFrame({"name": [" A "], "notes": [" free text "]})
result = clean_dataframe(df, CleanOptions(skip_columns=["notes"]))
assert result.cleaned_df["name"].iloc[0] == "A"
assert result.cleaned_df["notes"].iloc[0] == " free text "
def test_unknown_column_raises(self):
df = pd.DataFrame({"a": ["x"]})
with pytest.raises(ValueError):
clean_dataframe(df, CleanOptions(columns=["missing"]))
def test_empty_dataframe(self): # E43
df = pd.DataFrame()
result = clean_dataframe(df)
assert result.cells_changed == 0
assert result.cells_total == 0
assert result.cleaned_df.empty
def test_single_column_file(self): # E44
df = pd.DataFrame({"only": [" hello "]})
result = clean_dataframe(df)
assert result.cleaned_df["only"].iloc[0] == "hello"
def test_all_numeric_no_op(self): # E45
df = pd.DataFrame({"a": [1, 2], "b": [3.0, 4.0]})
result = clean_dataframe(df)
assert result.columns_processed == []
assert result.cells_changed == 0
def test_mixed_object_column_strings_only(self): # E34
df = pd.DataFrame({"mix": [" hello ", 42, None]})
result = clean_dataframe(df)
assert result.cleaned_df["mix"].iloc[0] == "hello"
assert result.cleaned_df["mix"].iloc[1] == 42
assert result.cleaned_df["mix"].iloc[2] is None
def test_nan_preserved(self): # E32
df = pd.DataFrame({"a": [" x ", np.nan]})
result = clean_dataframe(df)
assert result.cleaned_df["a"].iloc[0] == "x"
assert pd.isna(result.cleaned_df["a"].iloc[1])
def test_changes_audit_count(self): # E48
df = pd.DataFrame({"a": [" x ", "y", " z"]})
result = clean_dataframe(df)
assert result.cells_changed == 2
assert len(result.changes) == 2
assert set(result.changes["row"].tolist()) == {0, 2}
def test_does_not_mutate_input(self):
df = pd.DataFrame({"a": [" x "]})
original = df.copy()
clean_dataframe(df)
assert df.equals(original)
def test_per_column_case_via_case_columns(self):
df = pd.DataFrame({"name": ["alice"], "code": ["abc"]})
result = clean_dataframe(df, CleanOptions(case_columns={"code": "upper"}))
assert result.cleaned_df["name"].iloc[0] == "alice"
assert result.cleaned_df["code"].iloc[0] == "ABC"
def test_global_case_applied_to_selected_only(self):
df = pd.DataFrame({"name": ["alice"], "notes": ["bob"]})
result = clean_dataframe(
df, CleanOptions(columns=["name"], case="upper"),
)
assert result.cleaned_df["name"].iloc[0] == "ALICE"
assert result.cleaned_df["notes"].iloc[0] == "bob"
# ---------------------------------------------------------------------------
# Presets and config round-trip
# ---------------------------------------------------------------------------
class TestPresets:
def test_minimal_only_trim_collapse(self):
opts = CleanOptions.from_preset("minimal")
assert opts.trim is True
assert opts.collapse_whitespace is True
assert opts.nfc is False
assert opts.fold_smart_chars is False
def test_excel_hygiene_smart_chars_on_nfkc_off(self):
opts = CleanOptions.from_preset("excel-hygiene")
assert opts.fold_smart_chars is True
assert opts.nfc is True
assert opts.nfkc is False
def test_paranoid_includes_nfkc(self):
opts = CleanOptions.from_preset("paranoid")
assert opts.nfkc is True
def test_unknown_preset_raises(self):
with pytest.raises(ValueError):
CleanOptions.from_preset("does-not-exist")
class TestConfigRoundTrip:
def test_dict_roundtrip(self): # E49
opts = CleanOptions(
trim=False, nfc=True, columns=["a", "b"], skip_columns=["c"],
case="upper",
)
recovered = CleanOptions.from_dict(opts.to_dict())
assert recovered == opts
def test_file_roundtrip(self, tmp_path):
path = tmp_path / "opts.json"
opts = CleanOptions(case_columns={"code": "upper"}, fold_smart_chars=False)
opts.to_file(path)
loaded = CleanOptions.from_file(path)
assert loaded == opts
def test_unknown_keys_ignored(self): # E50
data = {"trim": True, "totally_made_up_key": 42}
opts = CleanOptions.from_dict(data)
assert opts.trim is True
# ---------------------------------------------------------------------------
# Use-case smoke tests (whole-pipeline)
# ---------------------------------------------------------------------------
class TestUseCases:
def test_excel_save_as_csv_utf8_bom(self):
# UC3: BOM at start of first cell
df = pd.DataFrame({"name": ["Alice", "Bob"], "city": ["NYC", "LA"]})
result = clean_dataframe(df)
assert result.cleaned_df["name"].iloc[0] == "Alice"
def test_word_smart_quotes_in_product_titles(self):
# UC2
df = pd.DataFrame({"title": ["“Best Dog Collar”", "Cat Toy — Red"]})
result = clean_dataframe(df)
assert result.cleaned_df["title"].iloc[0] == '"Best Dog Collar"'
assert result.cleaned_df["title"].iloc[1] == "Cat Toy - Red"
def test_nbsp_in_email_field(self):
# UC10: invisible Unicode hiding in emails
df = pd.DataFrame({"email": ["alice@test.com", "bob @test.com"]})
result = clean_dataframe(df)
# ZWSP stripped; NBSP folded to space then collapsed but trim won't remove
# internal space. So "bob @test.com" remains. That's correct: the cleaner
# doesn't know that's an email — script 03 owns email format. Just confirm
# the invisible char is gone.
assert "" not in result.cleaned_df["email"].iloc[0]
assert " " not in result.cleaned_df["email"].iloc[1]
def test_quickbooks_trailing_spaces(self):
# UC6: VLOOKUP fails because of trailing spaces
df = pd.DataFrame({"vendor": ["ACME Corp ", "ACME Corp"]})
result = clean_dataframe(df)
assert result.cleaned_df["vendor"].iloc[0] == result.cleaned_df["vendor"].iloc[1]
def test_bank_export_crlf_in_memo(self):
# UC5: \r\n inside multi-line memo cells
df = pd.DataFrame({"memo": ["line one\r\nline two\r\nline three"]})
result = clean_dataframe(df)
assert "\r" not in result.cleaned_df["memo"].iloc[0]
assert result.cleaned_df["memo"].iloc[0].count("\n") == 2
# ---------------------------------------------------------------------------
# Reporting / dtype edge cases
# ---------------------------------------------------------------------------
class TestReporting:
def test_changes_columns_present(self):
df = pd.DataFrame({"a": [" x "]})
result = clean_dataframe(df)
assert list(result.changes.columns) == [
"row", "column", "old", "new", "ops_applied",
]
def test_changes_empty_when_no_changes(self):
df = pd.DataFrame({"a": ["x", "y"]})
result = clean_dataframe(df)
assert result.cells_changed == 0
assert result.changes.empty
def test_cells_total_counts_only_processed_columns(self):
df = pd.DataFrame({"a": ["x", "y", "z"], "n": [1, 2, 3]})
result = clean_dataframe(df)
assert result.cells_total == 3 # only "a" is processed