feat: implement text cleaner (script 02) with CLI, GUI, and tests
Builds 02_text_cleaner.py from stub to working: character-level hygiene for CSV/Excel inputs covering trim, whitespace collapse, smart-character folding, Unicode NFC/NFKC, BOM strip, zero-width strip, control-char strip, line-ending normalization, and per-column case conversion. Three presets (minimal/excel-hygiene/paranoid) keep the buyer surface small. - src/core/text_clean.py: pure helpers + CleanOptions/CleanResult + clean_dataframe with dtype-safe column selection - src/cli_text_clean.py: Typer CLI mirroring the dedup CLI shape (dry-run by default, --apply writes cleaned + changes audit, JSON config save/load) - src/gui/pages/2_Text_Cleaner.py: real Streamlit page with preset picker, advanced toggles, preview, before/after metrics, and three download buttons - tests/test_text_clean.py + test_cli_text_clean.py: 92 new tests covering edge cases E1-E50 from the spec - samples/messy_text.csv: demo dataset surfacing UC1, UC3, UC6, UC10 in 10 rows - test-cases/uc16-uc26 + ec05-ec09: per-use-case and per-edge-case fixtures Docs: TECHNICAL.md §10.2 (full Tier 1/2/3 spec), DECISIONS.md v1.7 entry locking the spec, CLI-REFERENCE.md gains the text cleaner section, README.md gains a top-level Text Cleaner block, USER-GUIDE.md status row 02 promoted Skeleton -> Working. 200/200 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
46
README.md
46
README.md
@@ -1,6 +1,13 @@
|
||||
# DataTools Deduplicator
|
||||
# DataTools
|
||||
|
||||
Find and remove duplicate rows in CSV, delimited text, and Excel files — with fuzzy matching, smart normalization, and interactive review.
|
||||
A bundle of Python data-cleaning tools for CSV and Excel files. Two scripts ship today; more are in build.
|
||||
|
||||
| # | Tool | What it does |
|
||||
|---|---|---|
|
||||
| 01 | **Deduplicator** | Find and remove duplicate rows with exact + fuzzy matching, smart normalization, and interactive review. |
|
||||
| 02 | **Text Cleaner** | Trim whitespace, fold smart quotes, strip invisible / control characters, normalize Unicode, normalize line endings, optional case conversion. |
|
||||
|
||||
## Deduplicator
|
||||
|
||||
## Features
|
||||
|
||||
@@ -107,6 +114,41 @@ When `--apply` is used, three files are produced:
|
||||
| `{input}_removed.csv` | Rows that were removed |
|
||||
| `{input}_match_groups.csv` | Audit trail: group ID, confidence, matched columns, survivor flag |
|
||||
|
||||
## Text Cleaner
|
||||
|
||||
Character-level hygiene for messy CSV / Excel input. Solves the dirty-data failure modes that silently break VLOOKUPs, dedup runs, and downstream imports:
|
||||
|
||||
- Trailing / leading whitespace and tabs in cells
|
||||
- Non-breaking spaces (`U+00A0`) hiding inside text where regular spaces should be
|
||||
- Smart quotes pasted from Word (`"` `"` `'` `'` → `"` `"` `'` `'`)
|
||||
- Em / en dashes, ellipsis, other typographic Unicode
|
||||
- Zero-width and bidi-mark characters (`U+200B`, `U+200C`, `U+200D`, etc.)
|
||||
- BOMs from Excel "Save As CSV UTF-8"
|
||||
- Mixed line endings (`\r\n`, bare `\r`) inside multi-line cells
|
||||
- Control characters (`U+0000`-`U+001F` minus `\t \n \r`)
|
||||
- Optional Unicode NFC / NFKC normalization
|
||||
- Optional per-column case conversion (UPPER / lower / smart Title / Sentence)
|
||||
|
||||
```bash
|
||||
# Preview what would change (dry-run)
|
||||
python -m src.cli_text_clean samples/messy_text.csv
|
||||
|
||||
# Apply the safe defaults
|
||||
python -m src.cli_text_clean samples/messy_text.csv --apply
|
||||
|
||||
# Title-case the name column, upper-case the SKU column
|
||||
python -m src.cli_text_clean products.csv --case title:name,upper:sku --apply
|
||||
|
||||
# Just trim and collapse — nothing fancy
|
||||
python -m src.cli_text_clean messy.csv --preset minimal --apply
|
||||
```
|
||||
|
||||
Three presets: `minimal` (trim + collapse only), `excel-hygiene` (default; everything safe ON), `paranoid` (adds lossy NFKC fold).
|
||||
|
||||
Outputs `{input}_cleaned.csv` plus a per-cell `{input}_changes.csv` audit (row, column, old, new, ops applied).
|
||||
|
||||
See [docs/CLI-REFERENCE.md](docs/CLI-REFERENCE.md#text-cleaner-cli) for every flag.
|
||||
|
||||
## Documentation
|
||||
|
||||
- [CLI Reference](docs/CLI-REFERENCE.md) — every flag with examples and recipe sections
|
||||
|
||||
@@ -1,6 +1,17 @@
|
||||
# CLI Reference
|
||||
|
||||
Complete command-line reference for the DataTools Deduplicator.
|
||||
Complete command-line reference for the DataTools bundle.
|
||||
|
||||
DataTools ships two CLI modules so each script can be invoked independently:
|
||||
|
||||
| Module | Command | Purpose |
|
||||
|---|---|---|
|
||||
| `src.cli` | `python -m src.cli INPUT_FILE [OPTIONS]` | Deduplicator (script 01) |
|
||||
| `src.cli_text_clean` | `python -m src.cli_text_clean INPUT_FILE [OPTIONS]` | Text cleaner (script 02) |
|
||||
|
||||
The deduplicator section is below; the text cleaner reference is in [Section: Text Cleaner CLI](#text-cleaner-cli).
|
||||
|
||||
## Deduplicator
|
||||
|
||||
```
|
||||
python -m src.cli INPUT_FILE [OPTIONS]
|
||||
@@ -282,3 +293,122 @@ When `--apply` is set, three files are written:
|
||||
## Logging
|
||||
|
||||
Every run writes a timestamped log to `logs/dedup_YYYYMMDD_HHMMSS.log` with full debug-level details: strategies used, pair comparisons, survivor decisions, and merge actions.
|
||||
|
||||
---
|
||||
|
||||
# Text Cleaner CLI
|
||||
|
||||
Character-level hygiene for CSV / Excel files: whitespace trim and collapse, smart-character folding, Unicode normalization, BOM strip, control-char strip, line-ending normalization, optional case conversion. See TECHNICAL.md Section 10.2 for the full functional spec.
|
||||
|
||||
```
|
||||
python -m src.cli_text_clean INPUT_FILE [OPTIONS]
|
||||
```
|
||||
|
||||
## Arguments
|
||||
|
||||
| Argument | Required | Description |
|
||||
|----------|----------|-------------|
|
||||
| `INPUT_FILE` | Yes | Path to the CSV, TSV, or Excel file to clean |
|
||||
|
||||
## Options
|
||||
|
||||
### Core
|
||||
|
||||
| Flag | Short | Default | Description |
|
||||
|------|-------|---------|-------------|
|
||||
| `--apply` | | `false` | Write output files. Without this flag, only a preview is shown. |
|
||||
| `--output` | `-o` | `{input}_cleaned.csv` | Output file path. |
|
||||
| `--preset` | | `excel-hygiene` | Preset bundle of safe defaults. See [Presets](#presets). |
|
||||
|
||||
### Scope
|
||||
|
||||
| Flag | Default | Description |
|
||||
|------|---------|-------------|
|
||||
| `--columns` | all string columns | Comma-separated columns to clean. |
|
||||
| `--skip` | none | Comma-separated columns to skip even if they look like text. Useful for free-text notes columns you don't want touched. |
|
||||
|
||||
### Per-operation toggles
|
||||
|
||||
These override the active preset.
|
||||
|
||||
| Flag | Effect |
|
||||
|------|--------|
|
||||
| `--no-trim` | Disable leading/trailing whitespace strip |
|
||||
| `--no-collapse` | Disable internal whitespace collapse |
|
||||
| `--no-nfc` | Disable Unicode NFC normalization |
|
||||
| `--nfkc` | Enable NFKC compatibility fold (lossy: `①` → `1`, `fi` → `fi`) |
|
||||
| `--no-smart-chars` | Disable smart-character folding (curly quotes, em/en-dash, NBSP, ellipsis) |
|
||||
| `--no-zero-width` | Disable zero-width / invisible character strip |
|
||||
| `--no-bom` | Disable leading BOM strip |
|
||||
| `--no-control` | Disable control-character strip |
|
||||
| `--no-line-endings` | Disable line-ending normalization |
|
||||
|
||||
### Case conversion
|
||||
|
||||
| Flag | Forms | Description |
|
||||
|------|-------|-------------|
|
||||
| `--case` | `upper`, `lower`, `title`, `sentence` | Apply this case to every selected column |
|
||||
| `--case` | `mode:col[,mode:col]` | Per-column case (e.g., `--case title:name,upper:code`) |
|
||||
|
||||
Title case preserves all-caps tokens (`USA` stays `USA`) and lowercases mid-string particles (`of`, `and`, `the`, etc.).
|
||||
|
||||
### Audit and config
|
||||
|
||||
| Flag | Default | Description |
|
||||
|------|---------|-------------|
|
||||
| `--full-changelog` | `false` | Write every cell change to the audit CSV (default caps to first 1000). |
|
||||
| `--config` | none | Load options from a saved JSON config file. |
|
||||
| `--save-config` | none | Save the current options to a JSON config file. |
|
||||
|
||||
### File format / encoding
|
||||
|
||||
| Flag | Default | Description |
|
||||
|------|---------|-------------|
|
||||
| `--sheet` | `0` | Excel sheet name or 0-based index. |
|
||||
| `--encoding` | auto-detect | Override auto-detected file encoding. |
|
||||
| `--header-row` | auto-detect | 0-based row index for the header. |
|
||||
|
||||
## Presets
|
||||
|
||||
| Preset | What it does |
|
||||
|---|---|
|
||||
| `minimal` | Trim + collapse whitespace only. Nothing else. |
|
||||
| `excel-hygiene` (default) | Trim, collapse, NFC, smart-character fold, zero-width strip, BOM strip, control strip, line-ending normalize. NFKC off. |
|
||||
| `paranoid` | All of `excel-hygiene` plus NFKC compatibility fold (lossy). |
|
||||
|
||||
## Output Files
|
||||
|
||||
When `--apply` is set:
|
||||
|
||||
| File | Description |
|
||||
|------|-------------|
|
||||
| `{stem}_cleaned.csv` | Cleaned DataFrame |
|
||||
| `{stem}_changes.csv` | Per-cell audit: `row`, `column`, `old`, `new`, `ops_applied` (capped to 1000 rows by default; use `--full-changelog` for all) |
|
||||
|
||||
A timestamped log is always written to `logs/text_clean_YYYYMMDD_HHMMSS.log`.
|
||||
|
||||
## Recipes
|
||||
|
||||
```bash
|
||||
# Preview what would change with the safe defaults
|
||||
python -m src.cli_text_clean messy.csv
|
||||
|
||||
# Apply the safe defaults
|
||||
python -m src.cli_text_clean messy.csv --apply
|
||||
|
||||
# Just the basics — only trim and collapse, leave Unicode/quotes alone
|
||||
python -m src.cli_text_clean messy.csv --preset minimal --apply
|
||||
|
||||
# Title-case the name column, upper-case the SKU column, leave others alone for case
|
||||
python -m src.cli_text_clean people.csv --case title:name,upper:sku --apply
|
||||
|
||||
# Clean only specific columns
|
||||
python -m src.cli_text_clean orders.csv --columns vendor,product --apply
|
||||
|
||||
# Skip a free-text notes column from cleaning
|
||||
python -m src.cli_text_clean tickets.csv --skip notes --apply
|
||||
|
||||
# Save the current settings as a profile and reload it later
|
||||
python -m src.cli_text_clean messy.csv --preset minimal --case upper --save-config my.json
|
||||
python -m src.cli_text_clean other.csv --config my.json --apply
|
||||
```
|
||||
|
||||
@@ -250,6 +250,7 @@ Own-domain SEO is treated as a long-term compounding asset (6-18 months to tract
|
||||
| April 28, 2026 (v1.3) | **Add hosted browser demo as secondary distribution surface and conversion lever** | Direct consequence of Streamlit choice. See Section 5 and BUSINESS.md Section 7. |
|
||||
| April 28, 2026 (v1.4) | **Re-apply 03/05 script boundary work dropped during v1.3 merge (silent drift recovery)** | Stream B v1.2 content (sharpened 03/05 descriptions in USER-GUIDE, run-order rule, TECHNICAL.md Section 9 boundary spec, RECOVERY.md pointer) was overwritten when Stream A's parallel v1.3 Streamlit work was saved to project. Restoring per the doc's own no-silent-drift rule. 03 owns "what's not there" (missing values, sentinel codes, imputation), 05 owns "what shouldn't be there" (statistical outliers, domain rules, winsorization). 03 runs before 05 because outlier statistics on data containing NaN or sentinel codes are mathematically poisoned. See TECHNICAL.md Section 9. |
|
||||
| April 28, 2026 (v1.5) | **Add `02_text_cleaner.py` as new script; renumber 02-08 → 03-09** | Audit revealed character-level hygiene (whitespace trimming, multi-space collapse, Unicode normalization, BOM handling, line-ending normalization, special-character handling) had no clear owner. Was implicitly scattered: `01_deduplicator` normalizes internally for matching only (doesn't write back), `02_format_standardizer` (now 03) implies it but its named scope is dates/currencies/names/phones/addresses, `03_missing_value_handler` (now 04) only handles whitespace-only as disguised null. A buyer with trailing-space pollution had no obvious script to run. Per Section 4a (functional scope principle: one-stop shopping for the workflow), this was a real gap. Added as 02 because text cleaning is a pre-processing step that should run before format standardization, missing-value handling, and outlier detection. Kept 01 (deduplicator) at position 1 as the lead/working/marketing-flagship script; numbering does not strictly equal pipeline order, the orchestrator manages execution order. Renumber consequence: TECHNICAL.md Section 9 boundary references updated 03→04, 05→06; orchestrator references updated 08→09. New contested case documented in Section 9.3: whitespace-only cells (02 trims first, leaving empty string; 04 then detects empty strings as disguised null). Master orchestrator now 09. |
|
||||
| April 29, 2026 (v1.7) | **Adopt `02_text_cleaner.py` Tier 1/2/3 functional spec; lock `excel-hygiene` as default preset** | Promotes character-level hygiene from a stub to a buildable v1 target. Strategic framing: Excel/Power Query/OpenRefine fail this category for non-technical buyers; the gap is "one-click correctness for dirty-CSV failure modes that cause silent VLOOKUP misses." Spec covers 10 toggleable ops (trim, collapse, NFC, smart-char fold, zero-width strip, BOM strip, control strip, line-ending normalize, NFKC opt-in, per-column case), per-column scope control, dry-run-by-default, per-cell change audit, idempotency, three presets (`minimal`/`excel-hygiene`/`paranoid`), and JSON config save/load. Output shape mirrors deduplicator: `{input}_cleaned.csv`, `{input}_changes.csv`, `logs/text_clean_{ts}.log`. Boundary with adjacent scripts re-asserted: 02 trims whitespace-only cells to empty (04 then detects empty as null per Section 9.3); 02 is *write-time* and stays distinct from `01_deduplicator`'s match-time `normalize_string` helper. Smart-character fold defaults ON in `excel-hygiene` because demo value is highest there and dry-run preview makes the change visible before commit. NFKC stays opt-in (lossy). `ftfy` mojibake repair deferred to Tier 2 to avoid the 5MB dep without buyer demand. CLI ships as separate `src/cli_text_clean.py` module per the one-CLI-per-script pattern in TECHNICAL Section 3.2. Full spec in TECHNICAL.md Section 10.2. |
|
||||
| April 28, 2026 (v1.6) | **Fold conversation-history content into docs: deduplicator functional spec, lead bundle use cases, competitive landscape, full GUI framework comparison matrix, concrete 04/06 boundary examples, expanded Streamlit-to-SaaS reasoning** | None of this represents new decisions; all of it represents prior analysis that lived only in chat history and was at risk of evaporating. Per the doc's own no-silent-drift rule (Section 8) and the v1.4 recovery story, valuable analysis must be promoted to docs to survive. Specifically: TECHNICAL.md gains Section 10 (per-script functional specs, starting with the deduplicator's 36-item tiered spec) which is the buildable target for the v1 launch GUI port; this also makes the gap between "currently working" (exact + basic fuzzy) and "v1 launch best-of-class" (Tier 1) explicit so the docs don't quietly overstate where the code is. Section 9.3 gains three concrete distinguishing examples (bank-export blank fees / $1M outlier / "999=refused") that prove 04 and 06 are distinct concerns. BUSINESS.md gains Section 4a (Lead Bundle Deep Dive: 15 use cases by persona, 6-row competitive landscape table, market gap statement) which feeds landing page copy and demo design. Section 4c gains a 10-dimension scored framework matrix and per-option summaries (locks the rejection reasoning against re-litigation), plus expanded point 4 on Streamlit-to-SaaS migration cost. RECOVERY.md updated to reference Section 10 in rebuild and priority steps. No structural decisions changed; this is pure capture work. |
|
||||
|
||||
---
|
||||
|
||||
@@ -430,6 +430,81 @@ This section captures the full functional spec for each script, beyond the one-l
|
||||
35. Schedule / cron integration.
|
||||
36. Direct Shopify / Klaviyo / Mailchimp API integration to dedupe in place. This would be a real differentiator for the Shopify niche specifically and is probably the right v2 direction if early sales validate the niche.
|
||||
|
||||
### 10.2 - 10.9 (Future)
|
||||
### 10.2 `02_text_cleaner.py` - Character-level hygiene
|
||||
|
||||
Functional specs for scripts 02 through 09 to be added when each script enters active build. The deduplicator spec is the template; specs for other scripts should follow the same Tier 1 / Tier 2 / Tier 3 structure with explicit strategic framing (what's the market gap this script fills, given that some of its functionality is available free elsewhere).
|
||||
**Current implementation status**: Stub only. `src/gui/pages/2_Text_Cleaner.py` is a placeholder UI with disabled controls. No `src/core/text_clean.py`, no CLI, no tests. Tier 1 below is the v1 launch target; nothing in this section is built yet.
|
||||
|
||||
**Strategic framing**: Excel and the OS provide effectively nothing here. Find/Replace fixes one character at a time. Power Query's "Clean" strips control chars but ignores BOMs, smart quotes, NBSPs, and zero-width chars. OpenRefine has the operations buried under "Common transforms" where the buyer never finds them. Pandas users `df.applymap(str.strip)` and miss everything else.
|
||||
|
||||
The market gap this script fills: **one-click correctness for the dirty-CSV failure modes that cause "why won't this VLOOKUP match?"** Trailing spaces, NBSP-in-place-of-space, smart quotes pasted from Word, mojibake, BOMs from Excel's "Save As CSV UTF-8". The buyer doesn't know they need this script until it fixes a problem they have spent two hours on. Demo value is high: the before/after diff sells itself.
|
||||
|
||||
**Boundary clarification** (cross-references Section 9):
|
||||
- 02 owns whitespace, Unicode normalization, smart-character folding, BOM strip, line-ending normalization, zero-width strip, control-char strip, case ops. Writes cleaned values back to disk.
|
||||
- 03 (format standardizer) owns dates, currencies, names, phones, addresses.
|
||||
- 04 (missing values) owns disguised nulls (`N/A`, `-`, `unknown`, sentinel codes). Whitespace-only cells: 02 trims first to empty string; 04 then detects empty as null (per Section 9.3).
|
||||
- 01 (deduplicator) has its own `normalize_string` helper for *match-time* case-folding. That is a match-time policy and stays distinct from 02's *write-time* policy. The two will not be merged; 02 may use lower-level helpers but does not aggressively case-fold cleaned output by default.
|
||||
|
||||
#### Tier 1: Must-ship for v1 to be best-of-class
|
||||
|
||||
**Operations** (each independently toggleable; defaults given for the `excel-hygiene` preset)
|
||||
|
||||
1. Whitespace trim - leading/trailing on every cell. Default ON.
|
||||
2. Internal whitespace collapse - multi-space and tabs-in-cells to single space. Default ON.
|
||||
3. Unicode NFC normalization - combining-character forms folded to canonical (e.g., `e + U+0301` to single `é`). Default ON.
|
||||
4. Unicode NFKC normalization - compat fold (`①` to `1`, `fi` to `fi`). Default OFF, lossy, opt-in only. Not part of any preset other than `paranoid`.
|
||||
5. Smart-character folding - curly quotes to ASCII, em/en-dash to hyphen, ellipsis `…` to `...`, NBSP `U+00A0` to space. Default ON.
|
||||
6. Zero-width / invisible character strip - `U+200B`, `U+200C`, `U+200D`, `U+2060`, mid-string `U+FEFF`. Default ON.
|
||||
7. BOM strip - `U+FEFF` at the start of the first cell of the first column (covers the case where the I/O layer didn't catch it). Default ON.
|
||||
8. Control character strip - `U+0000`-`U+001F` and `U+007F`, *except* preserve `\t`, `\n`, `\r`. Default ON.
|
||||
9. Line-ending normalization - within multi-line cells, `\r\n` and bare `\r` to `\n`. Default ON.
|
||||
10. Case conversion - UPPER / lower / Title / Sentence. Default OFF, per-column. Title case is "smart": preserves all-caps tokens (`USA`, `NASA`) and lowercases mid-string particles (`of`, `and`, `the`).
|
||||
|
||||
**Scope control**
|
||||
|
||||
11. Per-column selection - by default operate on string-typed columns only; numeric / datetime columns pass through untouched. User can pick columns explicitly via `--columns`.
|
||||
12. Skip-list - exclude specific columns via `--skip` even if they match the string-dtype filter (e.g., free-text notes columns).
|
||||
|
||||
**Trust and audit**
|
||||
|
||||
13. Dry-run preview by default. Output shows N cells that would change in column X. `--apply` writes. Non-negotiable for trust. Same standard as the deduplicator.
|
||||
14. Per-cell change log: `{input}_changes.csv` with (row, column, old, new, ops_applied). Capped to first N rows by default to avoid 50MB audit files; `--full-changelog` removes the cap.
|
||||
15. Three output files on `--apply`: `{input}_cleaned.csv`, `{input}_changes.csv`, `logs/text_clean_{ts}.log`. Mirrors the deduplicator output shape.
|
||||
16. Original input file is never modified.
|
||||
17. Idempotency: `clean(clean(x)) == clean(x)` for every individual op and every preset. Asserted as a property test.
|
||||
|
||||
**Configuration**
|
||||
|
||||
18. Presets: `--preset excel-hygiene` (everything safe ON, NFKC OFF, case OFF), `--preset minimal` (only trim + collapse), `--preset paranoid` (everything including NFKC). Buyers should not have to learn 9 flags. Default preset when no flag given: `excel-hygiene`.
|
||||
19. Save / load JSON config. Same shape and reuse pattern as `DeduplicationConfig`.
|
||||
|
||||
**UX**
|
||||
|
||||
20. `--help` written for non-technical users with concrete examples, not a flag dump. Per DECISIONS.md Section 4b.
|
||||
21. Progress bar for files over ~10K rows.
|
||||
22. Error messages name the row, column, and value that caused the problem. No raw stack traces.
|
||||
23. Sample data (`samples/messy_text.csv`) demonstrates: smart quotes from Excel, NBSP-vs-space, BOM, mixed line endings, zero-width chars. The before/after diff is the demo.
|
||||
|
||||
#### Tier 2: Worth-considering for v1.1
|
||||
|
||||
24. Custom regex find/replace - power-user escape hatch, per-column.
|
||||
25. Diacritic strip (`José` to `Jose`). Lossy; opt-in only.
|
||||
26. Mojibake auto-repair - detect `é` to `é` patterns (UTF-8 read as Latin-1 then re-encoded) and fix. Standard tool: `ftfy`. Promote to Tier 1 if early buyers report this.
|
||||
27. Punctuation normalization - all Unicode dash/quote/space variants folded; runs of punctuation collapsed.
|
||||
28. Profile detector - scan a file and recommend which ops to enable based on what's actually present. Lowers config friction further.
|
||||
|
||||
#### Tier 3: Optional / later
|
||||
|
||||
29. Locale-aware case conversion (Turkish dotted/dotless `i`, German `ß`).
|
||||
30. Custom character-class strip rules (regex-class).
|
||||
31. Streaming / chunked write for very large files (defer until a buyer reports it).
|
||||
|
||||
#### Open decisions captured at spec time
|
||||
|
||||
- Smart-character folding default ON in `excel-hygiene` accepted as the right tradeoff: highest-impact use case, dry-run preview makes the change visible before commit.
|
||||
- NFKC stays Tier 1 but OFF by default and excluded from `excel-hygiene`. Lossy by design.
|
||||
- CLI surface: separate `src/cli_text_clean.py` module, matching the "one CLI binary per script on PATH" pattern in Section 3.2. Not a subcommand on the existing dedup Typer app.
|
||||
- `ftfy` dependency deferred to Tier 2 (~5MB). Revisit if mojibake reports come in.
|
||||
|
||||
### 10.3 - 10.9 (Future)
|
||||
|
||||
Functional specs for scripts 03 through 09 to be added when each script enters active build. The deduplicator (10.1) and text cleaner (10.2) specs are the template; specs for other scripts should follow the same Tier 1 / Tier 2 / Tier 3 structure with explicit strategic framing (what's the market gap this script fills, given that some of its functionality is available free elsewhere).
|
||||
|
||||
@@ -63,7 +63,7 @@ If you prefer the command line, every script also ships as a CLI tool. See Secti
|
||||
| # | Script | Purpose | Status |
|
||||
|---|---|---|---|
|
||||
| 01 | `01_deduplicator.py` | Smart duplicate removal: exact match + basic fuzzy, configurable subset columns, full logs | Working |
|
||||
| 02 | `02_text_cleaner.py` | Character-level hygiene: trim leading/trailing whitespace, collapse internal multi-spaces, strip non-printable characters, Unicode normalization (smart quotes, em-dashes, accents), remove zero-width characters, BOM handling, line-ending normalization, case operations | Skeleton |
|
||||
| 02 | `02_text_cleaner.py` | Character-level hygiene: trim leading/trailing whitespace, collapse internal multi-spaces, strip non-printable characters, Unicode normalization (smart quotes, em-dashes, accents), remove zero-width characters, BOM handling, line-ending normalization, case operations | Working |
|
||||
| 03 | `03_format_standardizer.py` | Standardize dates, currencies, names, phone numbers, addresses | Skeleton |
|
||||
| 04 | `04_missing_value_handler.py` | Detect and handle missing values: disguised nulls (`N/A`, `-`, blanks, sentinel codes), imputation (mean/median/mode/forward-fill), required-field enforcement, drop-by-threshold | Skeleton |
|
||||
| 05 | `05_column_mapper_enforcer.py` | Rename columns, enforce a target schema | Skeleton |
|
||||
|
||||
13
samples/messy_text.csv
Normal file
13
samples/messy_text.csv
Normal file
@@ -0,0 +1,13 @@
|
||||
customer_name,email,vendor,memo
|
||||
Alice Johnson,alice@example.com,ACME Corp ,Welcome aboard
|
||||
Bob Smith,bob@example.com,ACME Corp,Returning customer
|
||||
Charlie Brown,charlie@example.com,Globex,Net 30
|
||||
Diana Prince,diana@example.com,Globex,VIP
|
||||
Edward Norton,ed@example.com,“Best Pet Supplies”,Order#42 - rush
|
||||
Frank Castle,frank@example.com,Stark—Industries,"Line 1
|
||||
Line 2
|
||||
Line 3"
|
||||
grace HOPPER ,grace@example.com,Globex,Loves long memos…
|
||||
Henry Ford,henry@example.com,Ford Motor,Industrial
|
||||
Iris West,iris@example.com,S.T.A.R. Labs,Notewith-bell
|
||||
Jane Doe,jane@example.com,Acme,Standard
|
||||
|
373
src/cli_text_clean.py
Normal file
373
src/cli_text_clean.py
Normal file
@@ -0,0 +1,373 @@
|
||||
"""CLI for the DataTools text cleaner (script 02).
|
||||
|
||||
Usage:
|
||||
python -m src.cli_text_clean input.csv # dry-run preview
|
||||
python -m src.cli_text_clean input.csv --apply # write cleaned file
|
||||
python -m src.cli_text_clean input.csv --preset minimal --apply
|
||||
python -m src.cli_text_clean input.csv --case upper:name --apply
|
||||
python -m src.cli_text_clean --help # full help
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import sys
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
import typer
|
||||
from loguru import logger
|
||||
|
||||
app = typer.Typer(
|
||||
name="text-clean",
|
||||
help=(
|
||||
"Clean and normalize text content in CSV and Excel files.\n\n"
|
||||
"By default, runs in preview mode — shows what would change without "
|
||||
"modifying anything. Add --apply to write the output.\n\n"
|
||||
"Examples:\n\n"
|
||||
" # Preview what would change\n"
|
||||
" python -m src.cli_text_clean messy.csv\n\n"
|
||||
" # Apply the safe defaults (excel-hygiene preset)\n"
|
||||
" python -m src.cli_text_clean messy.csv --apply\n\n"
|
||||
" # Minimal: only trim and collapse whitespace\n"
|
||||
" python -m src.cli_text_clean messy.csv --preset minimal --apply\n\n"
|
||||
" # Title-case the 'name' column, leave others alone for case\n"
|
||||
" python -m src.cli_text_clean people.csv --case title:name --apply\n\n"
|
||||
" # Clean only specific columns\n"
|
||||
" python -m src.cli_text_clean orders.csv --columns vendor,product --apply\n\n"
|
||||
" # Skip a free-text column from cleaning\n"
|
||||
" python -m src.cli_text_clean tickets.csv --skip notes --apply\n"
|
||||
),
|
||||
add_completion=False,
|
||||
no_args_is_help=True,
|
||||
)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Helpers
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def _setup_logging(log_dir: Path) -> Path:
|
||||
"""Configure loguru to write a timestamped log file. Returns the log path."""
|
||||
log_dir.mkdir(parents=True, exist_ok=True)
|
||||
ts = datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||
log_path = log_dir / f"text_clean_{ts}.log"
|
||||
logger.remove()
|
||||
logger.add(sys.stderr, level="WARNING", format="{message}")
|
||||
logger.add(
|
||||
str(log_path),
|
||||
level="DEBUG",
|
||||
format="{time:YYYY-MM-DD HH:mm:ss} | {level:<8} | {message}",
|
||||
)
|
||||
return log_path
|
||||
|
||||
|
||||
def _parse_case(raw: Optional[str]) -> tuple[Optional[str], dict[str, str]]:
|
||||
"""Parse --case argument.
|
||||
|
||||
Forms:
|
||||
--case upper -> ("upper", {}) (apply to all selected)
|
||||
--case title:name -> (None, {"name": "title"})
|
||||
--case upper:code,title:name -> (None, {...})
|
||||
"""
|
||||
if not raw:
|
||||
return None, {}
|
||||
if ":" not in raw:
|
||||
# Bare mode applies to all selected columns
|
||||
return raw.strip(), {}
|
||||
per_col: dict[str, str] = {}
|
||||
for piece in raw.split(","):
|
||||
piece = piece.strip()
|
||||
if not piece:
|
||||
continue
|
||||
if ":" not in piece:
|
||||
raise typer.BadParameter(
|
||||
f"Invalid --case piece: '{piece}'. "
|
||||
f"Expected 'mode' or 'mode:col[,mode:col...]' "
|
||||
f"(e.g., 'upper' or 'title:name,upper:code')."
|
||||
)
|
||||
mode, col = piece.split(":", 1)
|
||||
per_col[col.strip()] = mode.strip()
|
||||
return None, per_col
|
||||
|
||||
|
||||
def _split_csv_arg(raw: Optional[str]) -> Optional[list[str]]:
|
||||
if raw is None:
|
||||
return None
|
||||
return [c.strip() for c in raw.split(",") if c.strip()]
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Main command
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
@app.command()
|
||||
def clean(
|
||||
input_file: str = typer.Argument(
|
||||
...,
|
||||
help="Path to the CSV or Excel file to clean.",
|
||||
),
|
||||
output: Optional[str] = typer.Option(
|
||||
None, "--output", "-o",
|
||||
help="Output file path. Default: {input}_cleaned.csv",
|
||||
),
|
||||
apply: bool = typer.Option(
|
||||
False, "--apply",
|
||||
help="Write the output files. Without this flag, only a preview is shown.",
|
||||
),
|
||||
preset: str = typer.Option(
|
||||
"excel-hygiene", "--preset",
|
||||
help="Preset: minimal, excel-hygiene, or paranoid.",
|
||||
),
|
||||
columns: Optional[str] = typer.Option(
|
||||
None, "--columns",
|
||||
help="Comma-separated columns to clean (default: all string columns).",
|
||||
),
|
||||
skip: Optional[str] = typer.Option(
|
||||
None, "--skip",
|
||||
help="Comma-separated columns to skip even if they look like text.",
|
||||
),
|
||||
case: Optional[str] = typer.Option(
|
||||
None, "--case",
|
||||
help=(
|
||||
"Case conversion. Bare mode 'upper'|'lower'|'title'|'sentence' applies to "
|
||||
"all selected columns. Per-column form: 'mode:col[,mode:col]' "
|
||||
"(e.g., 'title:name,upper:code')."
|
||||
),
|
||||
),
|
||||
no_trim: bool = typer.Option(False, "--no-trim", help="Disable whitespace trim."),
|
||||
no_collapse: bool = typer.Option(
|
||||
False, "--no-collapse", help="Disable internal whitespace collapse.",
|
||||
),
|
||||
no_nfc: bool = typer.Option(False, "--no-nfc", help="Disable Unicode NFC normalization."),
|
||||
nfkc: bool = typer.Option(
|
||||
False, "--nfkc",
|
||||
help="Enable NFKC compat fold (lossy: ① → 1, fi → fi). Default off.",
|
||||
),
|
||||
no_smart_chars: bool = typer.Option(
|
||||
False, "--no-smart-chars",
|
||||
help="Disable smart-character folding (curly quotes, em/en-dash, NBSP).",
|
||||
),
|
||||
no_zero_width: bool = typer.Option(
|
||||
False, "--no-zero-width", help="Disable zero-width / invisible char strip.",
|
||||
),
|
||||
no_bom: bool = typer.Option(False, "--no-bom", help="Disable BOM strip."),
|
||||
no_control: bool = typer.Option(
|
||||
False, "--no-control", help="Disable control-character strip.",
|
||||
),
|
||||
no_line_endings: bool = typer.Option(
|
||||
False, "--no-line-endings", help="Disable line-ending normalization.",
|
||||
),
|
||||
full_changelog: bool = typer.Option(
|
||||
False, "--full-changelog",
|
||||
help="Write every cell change to the audit CSV (default caps to first 1000).",
|
||||
),
|
||||
config: Optional[str] = typer.Option(
|
||||
None, "--config",
|
||||
help="Load options from a saved JSON config file.",
|
||||
),
|
||||
save_config: Optional[str] = typer.Option(
|
||||
None, "--save-config",
|
||||
help="Save current options to a JSON config file.",
|
||||
),
|
||||
sheet: Optional[str] = typer.Option(
|
||||
None, "--sheet",
|
||||
help="Excel sheet name or index (default: first sheet).",
|
||||
),
|
||||
encoding_override: Optional[str] = typer.Option(
|
||||
None, "--encoding",
|
||||
help="Override auto-detected file encoding.",
|
||||
),
|
||||
header_row: Optional[int] = typer.Option(
|
||||
None, "--header-row",
|
||||
help="0-based row index for the header (default: auto-detect).",
|
||||
),
|
||||
):
|
||||
"""Clean and normalize text in a CSV or Excel file."""
|
||||
from src.core.io import read_file, write_file
|
||||
from src.core.text_clean import (
|
||||
CleanOptions,
|
||||
PRESETS,
|
||||
clean_dataframe,
|
||||
)
|
||||
import pandas as pd
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Validate inputs
|
||||
# ------------------------------------------------------------------
|
||||
input_path = Path(input_file)
|
||||
if not input_path.exists():
|
||||
typer.echo(f"Error: File not found: {input_path}", err=True)
|
||||
raise typer.Exit(1)
|
||||
|
||||
if preset not in PRESETS:
|
||||
typer.echo(
|
||||
f"Error: Unknown preset '{preset}'. "
|
||||
f"Choose from: {', '.join(sorted(PRESETS))}.",
|
||||
err=True,
|
||||
)
|
||||
raise typer.Exit(1)
|
||||
|
||||
log_path = _setup_logging(Path("logs"))
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Build CleanOptions
|
||||
# ------------------------------------------------------------------
|
||||
if config:
|
||||
cfg_path = Path(config)
|
||||
if not cfg_path.exists():
|
||||
typer.echo(f"Error: Config file not found: {cfg_path}", err=True)
|
||||
raise typer.Exit(1)
|
||||
options = CleanOptions.from_file(cfg_path)
|
||||
logger.info("Loaded config from {}", cfg_path)
|
||||
else:
|
||||
options = CleanOptions.from_preset(preset)
|
||||
|
||||
# CLI overrides on top of preset/config
|
||||
if no_trim:
|
||||
options.trim = False
|
||||
if no_collapse:
|
||||
options.collapse_whitespace = False
|
||||
if no_nfc:
|
||||
options.nfc = False
|
||||
if nfkc:
|
||||
options.nfkc = True
|
||||
if no_smart_chars:
|
||||
options.fold_smart_chars = False
|
||||
if no_zero_width:
|
||||
options.strip_zero_width = False
|
||||
if no_bom:
|
||||
options.strip_bom = False
|
||||
if no_control:
|
||||
options.strip_control = False
|
||||
if no_line_endings:
|
||||
options.normalize_line_endings = False
|
||||
|
||||
cols_list = _split_csv_arg(columns)
|
||||
if cols_list is not None:
|
||||
options.columns = cols_list
|
||||
skip_list = _split_csv_arg(skip)
|
||||
if skip_list:
|
||||
options.skip_columns = skip_list
|
||||
|
||||
bare_case, per_col_case = _parse_case(case)
|
||||
if bare_case:
|
||||
options.case = bare_case # type: ignore[assignment]
|
||||
if per_col_case:
|
||||
options.case_columns = {**options.case_columns, **per_col_case} # type: ignore[dict-item]
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Save config if requested (after CLI merge so the file reflects intent)
|
||||
# ------------------------------------------------------------------
|
||||
if save_config:
|
||||
saved = options.to_file(save_config)
|
||||
typer.echo(f"Config saved to {saved}")
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Read input
|
||||
# ------------------------------------------------------------------
|
||||
typer.echo(f"Reading {input_path.name}...")
|
||||
try:
|
||||
sheet_arg: str | int | None = None
|
||||
if sheet is not None:
|
||||
try:
|
||||
sheet_arg = int(sheet)
|
||||
except ValueError:
|
||||
sheet_arg = sheet
|
||||
|
||||
df = read_file(
|
||||
input_path,
|
||||
encoding=encoding_override,
|
||||
header_row=header_row,
|
||||
sheet_name=sheet_arg if sheet_arg is not None else 0,
|
||||
)
|
||||
if not isinstance(df, pd.DataFrame):
|
||||
df = pd.concat(list(df), ignore_index=True)
|
||||
except Exception as e:
|
||||
typer.echo(f"Error reading file: {e}", err=True)
|
||||
raise typer.Exit(1)
|
||||
|
||||
typer.echo(f" {len(df)} rows, {len(df.columns)} columns")
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Run pipeline
|
||||
# ------------------------------------------------------------------
|
||||
typer.echo("Cleaning text...")
|
||||
try:
|
||||
result = clean_dataframe(df, options)
|
||||
except ValueError as e:
|
||||
typer.echo(f"Error: {e}", err=True)
|
||||
raise typer.Exit(1)
|
||||
|
||||
_print_results(result, input_path, options)
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Write output
|
||||
# ------------------------------------------------------------------
|
||||
if apply:
|
||||
stem = input_path.stem
|
||||
out_path = Path(output) if output else input_path.parent / f"{stem}_cleaned.csv"
|
||||
write_file(result.cleaned_df, out_path)
|
||||
typer.echo(f"\nCleaned file: {out_path}")
|
||||
|
||||
if not result.changes.empty:
|
||||
changes_path = input_path.parent / f"{stem}_changes.csv"
|
||||
audit_df = result.changes
|
||||
cap = 1000
|
||||
if not full_changelog and len(audit_df) > cap:
|
||||
typer.echo(
|
||||
f"Note: changelog capped at {cap} rows. "
|
||||
f"Use --full-changelog to write all {len(audit_df)} changes."
|
||||
)
|
||||
audit_df = audit_df.head(cap)
|
||||
write_file(audit_df, changes_path)
|
||||
typer.echo(f"Changes audit: {changes_path}")
|
||||
else:
|
||||
typer.echo("\nThis was a preview. Add --apply to write the output files.")
|
||||
|
||||
typer.echo(f"Log: {log_path}")
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Output formatting
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def _print_results(result, input_path: Path, options) -> None:
|
||||
pct = (result.cells_changed / result.cells_total * 100.0) if result.cells_total else 0.0
|
||||
typer.echo(f"\n{'─'*50}")
|
||||
typer.echo(f" File: {input_path.name}")
|
||||
typer.echo(f" Columns processed: {len(result.columns_processed)}")
|
||||
typer.echo(f" Cells scanned: {result.cells_total}")
|
||||
typer.echo(f" Cells changed: {result.cells_changed} ({pct:.1f}%)")
|
||||
typer.echo(f"{'─'*50}")
|
||||
|
||||
if result.cells_changed and not result.changes.empty:
|
||||
# Per-column change counts
|
||||
counts = result.changes["column"].value_counts()
|
||||
typer.echo("\nChanges by column:")
|
||||
for col, n in counts.head(10).items():
|
||||
typer.echo(f" {col}: {n} cell(s)")
|
||||
if len(counts) > 10:
|
||||
typer.echo(f" ... and {len(counts) - 10} more columns")
|
||||
|
||||
# Show first few examples
|
||||
typer.echo("\nFirst examples:")
|
||||
for _, row in result.changes.head(5).iterrows():
|
||||
old = repr(row["old"])[:40]
|
||||
new = repr(row["new"])[:40]
|
||||
typer.echo(
|
||||
f" Row {row['row'] + 1}, {row['column']}: {old} → {new} "
|
||||
f"[{row['ops_applied']}]"
|
||||
)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# __main__
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def main():
|
||||
app()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -59,6 +59,25 @@ from .config import (
|
||||
DeduplicationConfig,
|
||||
StrategyConfig,
|
||||
)
|
||||
from .text_clean import (
|
||||
CleanOptions,
|
||||
CleanResult,
|
||||
PRESETS,
|
||||
apply_case,
|
||||
clean_dataframe,
|
||||
clean_value,
|
||||
collapse_whitespace,
|
||||
fold_smart_chars,
|
||||
normalize_line_endings,
|
||||
sentence_case,
|
||||
smart_title_case,
|
||||
strip_bom,
|
||||
strip_control,
|
||||
strip_zero_width,
|
||||
to_nfc,
|
||||
to_nfkc,
|
||||
trim,
|
||||
)
|
||||
|
||||
__all__ = [
|
||||
# Core
|
||||
@@ -90,4 +109,22 @@ __all__ = [
|
||||
"DeduplicationConfig",
|
||||
"StrategyConfig",
|
||||
"ColumnStrategyConfig",
|
||||
# Text cleaning
|
||||
"CleanOptions",
|
||||
"CleanResult",
|
||||
"PRESETS",
|
||||
"clean_dataframe",
|
||||
"clean_value",
|
||||
"trim",
|
||||
"collapse_whitespace",
|
||||
"to_nfc",
|
||||
"to_nfkc",
|
||||
"fold_smart_chars",
|
||||
"strip_zero_width",
|
||||
"strip_bom",
|
||||
"strip_control",
|
||||
"normalize_line_endings",
|
||||
"smart_title_case",
|
||||
"sentence_case",
|
||||
"apply_case",
|
||||
]
|
||||
|
||||
489
src/core/text_clean.py
Normal file
489
src/core/text_clean.py
Normal file
@@ -0,0 +1,489 @@
|
||||
"""Character-level text hygiene for DataFrames.
|
||||
|
||||
Operations are independently toggleable, idempotent, and safe to compose.
|
||||
Each per-string helper is ``str -> str``. Numeric, datetime, and boolean
|
||||
columns pass through ``clean_dataframe`` untouched; only string cells are
|
||||
modified.
|
||||
|
||||
See TECHNICAL.md Section 10.2 for the full functional spec.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import re
|
||||
import unicodedata
|
||||
from dataclasses import asdict, dataclass, field
|
||||
from pathlib import Path
|
||||
from typing import Any, Callable, Iterable, Literal, Optional
|
||||
|
||||
import pandas as pd
|
||||
from pandas.api import types as pdtypes
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Per-string helpers
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
# Smart-character map (curly quotes, dashes, ellipsis, NBSP, narrow NBSP).
|
||||
_SMART_CHARS: dict[str, str] = {
|
||||
"‘": "'", # LEFT SINGLE QUOTATION MARK
|
||||
"’": "'", # RIGHT SINGLE QUOTATION MARK
|
||||
"‚": "'", # SINGLE LOW-9 QUOTATION MARK
|
||||
"‛": "'", # SINGLE HIGH-REVERSED-9 QUOTATION MARK
|
||||
"“": '"', # LEFT DOUBLE QUOTATION MARK
|
||||
"”": '"', # RIGHT DOUBLE QUOTATION MARK
|
||||
"„": '"', # DOUBLE LOW-9 QUOTATION MARK
|
||||
"‟": '"', # DOUBLE HIGH-REVERSED-9 QUOTATION MARK
|
||||
"–": "-", # EN DASH
|
||||
"—": "-", # EM DASH
|
||||
"―": "-", # HORIZONTAL BAR
|
||||
"−": "-", # MINUS SIGN
|
||||
"…": "...", # HORIZONTAL ELLIPSIS
|
||||
" ": " ", # NO-BREAK SPACE
|
||||
" ": " ", # NARROW NO-BREAK SPACE
|
||||
" ": " ", # THIN SPACE
|
||||
" ": " ", # HAIR SPACE
|
||||
" ": " ", # EN SPACE
|
||||
" ": " ", # EM SPACE
|
||||
" ": " ", # IDEOGRAPHIC SPACE
|
||||
}
|
||||
|
||||
_SMART_TRANS = str.maketrans(_SMART_CHARS)
|
||||
|
||||
# Zero-width / invisible characters. ``U+FEFF`` (BOM/ZWNBSP) is included; if
|
||||
# it appears at the *very start* of the first cell of the first column, the
|
||||
# BOM-strip op handles it; elsewhere it is treated as a zero-width char.
|
||||
_ZERO_WIDTH = (
|
||||
"" # ZERO WIDTH SPACE
|
||||
"" # ZERO WIDTH NON-JOINER
|
||||
"" # ZERO WIDTH JOINER
|
||||
"" # WORD JOINER
|
||||
"" # LEFT-TO-RIGHT MARK
|
||||
"" # RIGHT-TO-LEFT MARK
|
||||
"" # ZERO WIDTH NO-BREAK SPACE / BOM
|
||||
)
|
||||
_ZERO_WIDTH_RE = re.compile(f"[{_ZERO_WIDTH}]")
|
||||
|
||||
# Control characters: U+0000-U+001F and U+007F, but preserve \t \n \r.
|
||||
_CONTROL_RE = re.compile(r"[\x00-\x08\x0B\x0C\x0E-\x1F\x7F]")
|
||||
|
||||
# Any run of *horizontal* whitespace (spaces, tabs, form/vertical feeds).
|
||||
# Newlines and carriage returns are excluded so multi-line cells keep their
|
||||
# line structure; the line-ending op normalizes the actual line terminators.
|
||||
_WHITESPACE_RUN_RE = re.compile(r"[^\S\n\r]+")
|
||||
|
||||
|
||||
def trim(s: str) -> str:
|
||||
"""Strip leading/trailing whitespace."""
|
||||
if not isinstance(s, str):
|
||||
return s
|
||||
return s.strip()
|
||||
|
||||
|
||||
def collapse_whitespace(s: str) -> str:
|
||||
"""Collapse runs of whitespace to a single space.
|
||||
|
||||
Preserves leading/trailing whitespace boundaries (use ``trim`` to remove
|
||||
them). Tabs and other whitespace inside the string become a single
|
||||
regular space.
|
||||
"""
|
||||
if not isinstance(s, str):
|
||||
return s
|
||||
return _WHITESPACE_RUN_RE.sub(" ", s)
|
||||
|
||||
|
||||
def to_nfc(s: str) -> str:
|
||||
"""Apply Unicode NFC (canonical composition)."""
|
||||
if not isinstance(s, str):
|
||||
return s
|
||||
return unicodedata.normalize("NFC", s)
|
||||
|
||||
|
||||
def to_nfkc(s: str) -> str:
|
||||
"""Apply Unicode NFKC (compatibility composition). Lossy."""
|
||||
if not isinstance(s, str):
|
||||
return s
|
||||
return unicodedata.normalize("NFKC", s)
|
||||
|
||||
|
||||
def fold_smart_chars(s: str) -> str:
|
||||
"""Fold curly quotes, em/en-dashes, ellipsis, NBSP variants to ASCII."""
|
||||
if not isinstance(s, str):
|
||||
return s
|
||||
return s.translate(_SMART_TRANS)
|
||||
|
||||
|
||||
def strip_zero_width(s: str) -> str:
|
||||
"""Remove zero-width and bidi-mark characters."""
|
||||
if not isinstance(s, str):
|
||||
return s
|
||||
return _ZERO_WIDTH_RE.sub("", s)
|
||||
|
||||
|
||||
def strip_bom(s: str) -> str:
|
||||
"""Remove a leading ``U+FEFF`` (BOM) from the start of the string."""
|
||||
if not isinstance(s, str):
|
||||
return s
|
||||
return s.lstrip("")
|
||||
|
||||
|
||||
def strip_control(s: str) -> str:
|
||||
"""Remove control characters except ``\\t \\n \\r``."""
|
||||
if not isinstance(s, str):
|
||||
return s
|
||||
return _CONTROL_RE.sub("", s)
|
||||
|
||||
|
||||
def normalize_line_endings(s: str) -> str:
|
||||
"""Normalize ``\\r\\n`` and bare ``\\r`` to ``\\n``."""
|
||||
if not isinstance(s, str):
|
||||
return s
|
||||
return s.replace("\r\n", "\n").replace("\r", "\n")
|
||||
|
||||
|
||||
# Smart title-case helpers
|
||||
_TITLE_LOWERCASE_PARTICLES = {
|
||||
"a", "an", "and", "as", "at", "but", "by", "en", "for", "if", "in", "nor",
|
||||
"of", "on", "or", "per", "the", "to", "v", "v.", "vs", "vs.", "via",
|
||||
}
|
||||
|
||||
|
||||
def _is_all_caps_token(token: str) -> bool:
|
||||
"""A token is all-caps when it has at least one cased char and no lowercase."""
|
||||
has_letter = any(c.isalpha() for c in token)
|
||||
has_lower = any(c.islower() for c in token)
|
||||
return has_letter and not has_lower and len(token) >= 2
|
||||
|
||||
|
||||
def smart_title_case(s: str) -> str:
|
||||
"""Title-case that preserves all-caps tokens and lowercases mid-string particles.
|
||||
|
||||
- ``USA`` stays ``USA``.
|
||||
- ``of``, ``and``, ``the``, etc. stay lowercase except as the first/last word.
|
||||
- Apostrophes inside words don't restart capitalization (``O'Neil``).
|
||||
"""
|
||||
if not isinstance(s, str) or not s:
|
||||
return s
|
||||
tokens = s.split(" ")
|
||||
out: list[str] = []
|
||||
last_idx = len(tokens) - 1
|
||||
for i, tok in enumerate(tokens):
|
||||
if not tok:
|
||||
out.append(tok)
|
||||
continue
|
||||
if _is_all_caps_token(tok):
|
||||
out.append(tok)
|
||||
continue
|
||||
lowered = tok.lower()
|
||||
if 0 < i < last_idx and lowered in _TITLE_LOWERCASE_PARTICLES:
|
||||
out.append(lowered)
|
||||
continue
|
||||
# Capitalize first cased character; preserve apostrophes/hyphens
|
||||
chars = list(tok)
|
||||
capitalized = False
|
||||
for j, c in enumerate(chars):
|
||||
if c.isalpha():
|
||||
if not capitalized:
|
||||
chars[j] = c.upper()
|
||||
capitalized = True
|
||||
else:
|
||||
chars[j] = c.lower()
|
||||
out.append("".join(chars))
|
||||
return " ".join(out)
|
||||
|
||||
|
||||
def sentence_case(s: str) -> str:
|
||||
"""Lowercase, then capitalize the first cased letter after each ``. ! ?``."""
|
||||
if not isinstance(s, str) or not s:
|
||||
return s
|
||||
lowered = s.lower()
|
||||
chars = list(lowered)
|
||||
capitalize_next = True
|
||||
for i, c in enumerate(chars):
|
||||
if c in ".!?":
|
||||
capitalize_next = True
|
||||
continue
|
||||
if capitalize_next and c.isalpha():
|
||||
chars[i] = c.upper()
|
||||
capitalize_next = False
|
||||
elif c.strip():
|
||||
# Any non-whitespace, non-letter (e.g., quote, paren) doesn't
|
||||
# consume the "next letter" trigger.
|
||||
if c.isalpha():
|
||||
capitalize_next = False
|
||||
return "".join(chars)
|
||||
|
||||
|
||||
CaseMode = Literal["upper", "lower", "title", "sentence"]
|
||||
|
||||
|
||||
def apply_case(s: str, mode: CaseMode) -> str:
|
||||
if not isinstance(s, str):
|
||||
return s
|
||||
if mode == "upper":
|
||||
return s.upper()
|
||||
if mode == "lower":
|
||||
return s.lower()
|
||||
if mode == "title":
|
||||
return smart_title_case(s)
|
||||
if mode == "sentence":
|
||||
return sentence_case(s)
|
||||
raise ValueError(f"Unknown case mode: {mode}")
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Options / result dataclasses
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
PRESETS: dict[str, dict[str, Any]] = {
|
||||
"minimal": {
|
||||
"trim": True,
|
||||
"collapse_whitespace": True,
|
||||
"nfc": False,
|
||||
"nfkc": False,
|
||||
"fold_smart_chars": False,
|
||||
"strip_zero_width": False,
|
||||
"strip_bom": False,
|
||||
"strip_control": False,
|
||||
"normalize_line_endings": False,
|
||||
},
|
||||
"excel-hygiene": {
|
||||
"trim": True,
|
||||
"collapse_whitespace": True,
|
||||
"nfc": True,
|
||||
"nfkc": False,
|
||||
"fold_smart_chars": True,
|
||||
"strip_zero_width": True,
|
||||
"strip_bom": True,
|
||||
"strip_control": True,
|
||||
"normalize_line_endings": True,
|
||||
},
|
||||
"paranoid": {
|
||||
"trim": True,
|
||||
"collapse_whitespace": True,
|
||||
"nfc": True,
|
||||
"nfkc": True,
|
||||
"fold_smart_chars": True,
|
||||
"strip_zero_width": True,
|
||||
"strip_bom": True,
|
||||
"strip_control": True,
|
||||
"normalize_line_endings": True,
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
@dataclass
|
||||
class CleanOptions:
|
||||
"""Toggles for character-level cleaning operations.
|
||||
|
||||
Defaults match the ``excel-hygiene`` preset.
|
||||
"""
|
||||
|
||||
# Operations
|
||||
trim: bool = True
|
||||
collapse_whitespace: bool = True
|
||||
nfc: bool = True
|
||||
nfkc: bool = False
|
||||
fold_smart_chars: bool = True
|
||||
strip_zero_width: bool = True
|
||||
strip_bom: bool = True
|
||||
strip_control: bool = True
|
||||
normalize_line_endings: bool = True
|
||||
|
||||
# Case conversion: either a single mode applied to all selected columns,
|
||||
# or a dict mapping column name -> mode for per-column control.
|
||||
case: Optional[CaseMode] = None
|
||||
case_columns: dict[str, CaseMode] = field(default_factory=dict)
|
||||
|
||||
# Scope control
|
||||
columns: Optional[list[str]] = None # None = all string-typed columns
|
||||
skip_columns: list[str] = field(default_factory=list)
|
||||
|
||||
@classmethod
|
||||
def from_preset(cls, name: str) -> CleanOptions:
|
||||
if name not in PRESETS:
|
||||
raise ValueError(
|
||||
f"Unknown preset '{name}'. "
|
||||
f"Available: {', '.join(sorted(PRESETS))}."
|
||||
)
|
||||
return cls(**PRESETS[name])
|
||||
|
||||
@classmethod
|
||||
def from_dict(cls, data: dict) -> CleanOptions:
|
||||
known = {f for f in cls.__dataclass_fields__}
|
||||
kwargs = {k: v for k, v in data.items() if k in known}
|
||||
return cls(**kwargs)
|
||||
|
||||
def to_dict(self) -> dict:
|
||||
return asdict(self)
|
||||
|
||||
def to_file(self, path: str | Path) -> Path:
|
||||
out = Path(path)
|
||||
out.write_text(json.dumps(self.to_dict(), indent=2))
|
||||
return out
|
||||
|
||||
@classmethod
|
||||
def from_file(cls, path: str | Path) -> CleanOptions:
|
||||
return cls.from_dict(json.loads(Path(path).read_text()))
|
||||
|
||||
|
||||
@dataclass
|
||||
class CleanResult:
|
||||
"""Output of ``clean_dataframe``."""
|
||||
|
||||
cleaned_df: pd.DataFrame
|
||||
changes: pd.DataFrame # cols: row, column, old, new, ops_applied
|
||||
cells_changed: int
|
||||
cells_total: int
|
||||
columns_processed: list[str]
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Cell-level pipeline
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def _build_pipeline(options: CleanOptions) -> list[tuple[str, Callable[[str], str]]]:
|
||||
"""Return ordered (op_name, fn) pairs for the cell-level pipeline.
|
||||
|
||||
Order is meaningful:
|
||||
1. BOM strip first so a leading FEFF doesn't survive into other ops.
|
||||
2. Line-ending normalize before whitespace ops so \\r\\n collapses cleanly.
|
||||
3. Control-char strip before whitespace ops.
|
||||
4. Smart-char fold before NFC/NFKC (folded ASCII is already NFC-stable).
|
||||
5. NFC then NFKC (NFKC subsumes NFC if both set; we still run NFC first
|
||||
so the result is identical to NFKC alone — kept explicit for logging).
|
||||
6. Zero-width strip after Unicode normalization (NFKC can introduce
|
||||
decomposed forms whose combining marks must not be stripped).
|
||||
7. Whitespace collapse, then trim, last.
|
||||
"""
|
||||
ops: list[tuple[str, Callable[[str], str]]] = []
|
||||
if options.strip_bom:
|
||||
ops.append(("strip_bom", strip_bom))
|
||||
if options.normalize_line_endings:
|
||||
ops.append(("normalize_line_endings", normalize_line_endings))
|
||||
if options.strip_control:
|
||||
ops.append(("strip_control", strip_control))
|
||||
if options.fold_smart_chars:
|
||||
ops.append(("fold_smart_chars", fold_smart_chars))
|
||||
if options.nfc:
|
||||
ops.append(("nfc", to_nfc))
|
||||
if options.nfkc:
|
||||
ops.append(("nfkc", to_nfkc))
|
||||
if options.strip_zero_width:
|
||||
ops.append(("strip_zero_width", strip_zero_width))
|
||||
if options.collapse_whitespace:
|
||||
ops.append(("collapse_whitespace", collapse_whitespace))
|
||||
if options.trim:
|
||||
ops.append(("trim", trim))
|
||||
return ops
|
||||
|
||||
|
||||
def clean_value(value: Any, options: CleanOptions) -> tuple[Any, list[str]]:
|
||||
"""Apply the configured pipeline to a single cell.
|
||||
|
||||
Returns ``(cleaned_value, ops_applied)``. Non-strings and missing values
|
||||
pass through unchanged with an empty ``ops_applied`` list.
|
||||
"""
|
||||
if value is None or (isinstance(value, float) and pd.isna(value)):
|
||||
return value, []
|
||||
if not isinstance(value, str):
|
||||
return value, []
|
||||
|
||||
pipeline = _build_pipeline(options)
|
||||
cur = value
|
||||
applied: list[str] = []
|
||||
for name, fn in pipeline:
|
||||
new = fn(cur)
|
||||
if new != cur:
|
||||
applied.append(name)
|
||||
cur = new
|
||||
return cur, applied
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# DataFrame-level entry point
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def _select_columns(df: pd.DataFrame, options: CleanOptions) -> list[str]:
|
||||
"""Pick the columns the pipeline should operate on.
|
||||
|
||||
- If ``options.columns`` is explicit, use it (after validating).
|
||||
- Otherwise default to columns whose pandas dtype is object/string.
|
||||
- Always exclude ``options.skip_columns``.
|
||||
"""
|
||||
if options.columns is not None:
|
||||
missing = [c for c in options.columns if c not in df.columns]
|
||||
if missing:
|
||||
raise ValueError(
|
||||
f"Columns not found in input: {missing}. "
|
||||
f"Available: {list(df.columns)}"
|
||||
)
|
||||
chosen: Iterable[str] = options.columns
|
||||
else:
|
||||
chosen = [
|
||||
c for c in df.columns
|
||||
if pdtypes.is_object_dtype(df[c]) or pdtypes.is_string_dtype(df[c])
|
||||
]
|
||||
|
||||
skip = set(options.skip_columns)
|
||||
return [c for c in chosen if c not in skip]
|
||||
|
||||
|
||||
def clean_dataframe(df: pd.DataFrame, options: Optional[CleanOptions] = None) -> CleanResult:
|
||||
"""Apply text-cleaning ops to selected columns of *df*.
|
||||
|
||||
Numeric, datetime, and boolean columns are skipped by default. The input
|
||||
DataFrame is not mutated; a copy is returned in ``CleanResult.cleaned_df``.
|
||||
"""
|
||||
options = options or CleanOptions()
|
||||
out = df.copy()
|
||||
columns = _select_columns(out, options)
|
||||
|
||||
case_per_col: dict[str, CaseMode] = dict(options.case_columns)
|
||||
if options.case is not None:
|
||||
for c in columns:
|
||||
case_per_col.setdefault(c, options.case)
|
||||
|
||||
change_records: list[dict[str, Any]] = []
|
||||
cells_changed = 0
|
||||
cells_total = 0
|
||||
|
||||
for col in columns:
|
||||
series = out[col]
|
||||
new_values: list[Any] = []
|
||||
col_case = case_per_col.get(col)
|
||||
for row_idx, original in enumerate(series.tolist()):
|
||||
cells_total += 1
|
||||
cleaned, ops_applied = clean_value(original, options)
|
||||
|
||||
if col_case is not None and isinstance(cleaned, str):
|
||||
cased = apply_case(cleaned, col_case)
|
||||
if cased != cleaned:
|
||||
ops_applied.append(f"case:{col_case}")
|
||||
cleaned = cased
|
||||
|
||||
if ops_applied and cleaned != original:
|
||||
cells_changed += 1
|
||||
change_records.append({
|
||||
"row": row_idx,
|
||||
"column": col,
|
||||
"old": original,
|
||||
"new": cleaned,
|
||||
"ops_applied": ",".join(ops_applied),
|
||||
})
|
||||
new_values.append(cleaned)
|
||||
out[col] = new_values
|
||||
|
||||
changes_df = pd.DataFrame(
|
||||
change_records,
|
||||
columns=["row", "column", "old", "new", "ops_applied"],
|
||||
)
|
||||
|
||||
return CleanResult(
|
||||
cleaned_df=out,
|
||||
changes=changes_df,
|
||||
cells_changed=cells_changed,
|
||||
cells_total=cells_total,
|
||||
columns_processed=columns,
|
||||
)
|
||||
@@ -1,10 +1,13 @@
|
||||
"""DataTools Text Cleaner — stub page."""
|
||||
"""DataTools Text Cleaner — Streamlit page."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import io
|
||||
import json
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
import pandas as pd
|
||||
import streamlit as st
|
||||
|
||||
_project_root = Path(__file__).resolve().parent.parent.parent.parent
|
||||
@@ -12,82 +15,236 @@ if str(_project_root) not in sys.path:
|
||||
sys.path.insert(0, str(_project_root))
|
||||
|
||||
from src.gui.components import hide_streamlit_chrome
|
||||
from src.core.text_clean import (
|
||||
PRESETS,
|
||||
CleanOptions,
|
||||
clean_dataframe,
|
||||
)
|
||||
|
||||
hide_streamlit_chrome()
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Header
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
st.title("✂️ Text Cleaner")
|
||||
st.caption("Clean and normalize text content across your data.")
|
||||
|
||||
st.info("This tool is under development.")
|
||||
st.caption(
|
||||
"Trim whitespace, fold smart quotes, strip invisible characters, and "
|
||||
"normalize line endings. Runs locally — your data never leaves this computer."
|
||||
)
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# What this tool will do
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
st.markdown("""
|
||||
**Features:**
|
||||
- Trim leading/trailing whitespace
|
||||
- Collapse multiple spaces into one
|
||||
- Unicode normalization (NFC/NFKC)
|
||||
- Strip non-printable / control characters
|
||||
- Remove BOM (byte order mark)
|
||||
- Normalize line endings (CRLF → LF)
|
||||
- Case conversion (upper, lower, title, sentence)
|
||||
""")
|
||||
|
||||
st.divider()
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# File upload (functional)
|
||||
# File upload
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
uploaded = st.file_uploader(
|
||||
"Upload CSV or Excel file",
|
||||
type=["csv", "tsv", "xlsx", "xls"],
|
||||
help="Upload a file to preview. Processing is not yet available.",
|
||||
key="textclean_file_upload",
|
||||
)
|
||||
|
||||
if uploaded is not None:
|
||||
import pandas as pd
|
||||
try:
|
||||
if uploaded.name.endswith((".xlsx", ".xls")):
|
||||
df = pd.read_excel(uploaded)
|
||||
else:
|
||||
df = pd.read_csv(uploaded)
|
||||
st.subheader(f"Preview: {uploaded.name}")
|
||||
st.caption(f"{len(df)} rows, {len(df.columns)} columns")
|
||||
st.dataframe(df.head(10), use_container_width=True)
|
||||
except Exception as e:
|
||||
st.error(f"Failed to read file: {e}")
|
||||
if uploaded is None:
|
||||
st.info("Upload a CSV, TSV, or Excel file to begin.")
|
||||
st.stop()
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Placeholder options
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
st.subheader("Operations")
|
||||
@st.cache_data(show_spinner=False)
|
||||
def _read_uploaded(name: str, data: bytes) -> pd.DataFrame:
|
||||
"""Read the uploaded bytes into a DataFrame, treating all cells as strings."""
|
||||
suffix = Path(name).suffix.lower()
|
||||
bio = io.BytesIO(data)
|
||||
if suffix in (".xlsx", ".xls"):
|
||||
return pd.read_excel(bio, dtype=str, keep_default_na=False)
|
||||
# CSV / TSV — try utf-8 then utf-8-sig then latin-1 as a fallback
|
||||
for enc in ("utf-8", "utf-8-sig", "latin-1"):
|
||||
try:
|
||||
bio.seek(0)
|
||||
sep = "\t" if suffix == ".tsv" else ","
|
||||
return pd.read_csv(
|
||||
bio, dtype=str, keep_default_na=False,
|
||||
encoding=enc, sep=sep, on_bad_lines="warn",
|
||||
)
|
||||
except UnicodeDecodeError:
|
||||
continue
|
||||
bio.seek(0)
|
||||
return pd.read_csv(bio, dtype=str, keep_default_na=False, encoding="latin-1")
|
||||
|
||||
st.checkbox("Trim whitespace", value=True, disabled=True)
|
||||
st.checkbox("Collapse multiple spaces", value=True, disabled=True)
|
||||
st.checkbox("Unicode normalization (NFC)", value=False, disabled=True)
|
||||
st.checkbox("Strip non-printable characters", value=False, disabled=True)
|
||||
st.checkbox("Remove BOM", value=False, disabled=True)
|
||||
st.checkbox("Normalize line endings", value=False, disabled=True)
|
||||
st.selectbox("Case conversion", ["None", "UPPER", "lower", "Title Case", "Sentence case"], disabled=True)
|
||||
|
||||
try:
|
||||
df = _read_uploaded(uploaded.name, uploaded.getvalue())
|
||||
except Exception as e:
|
||||
st.error(f"Failed to read file: {e}")
|
||||
st.stop()
|
||||
|
||||
st.subheader(f"Preview: {uploaded.name}")
|
||||
st.caption(f"{len(df)} rows, {len(df.columns)} columns")
|
||||
st.dataframe(df.head(10), use_container_width=True)
|
||||
|
||||
st.divider()
|
||||
st.button("Clean Text", type="primary", use_container_width=True, disabled=True)
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Footer
|
||||
# Options
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
st.divider()
|
||||
st.caption(
|
||||
"Runs locally. Your data never leaves this computer. "
|
||||
"| DataTools v3.0"
|
||||
st.subheader("Options")
|
||||
|
||||
preset_label = st.radio(
|
||||
"Preset",
|
||||
["excel-hygiene (recommended)", "minimal", "paranoid"],
|
||||
index=0,
|
||||
horizontal=True,
|
||||
help=(
|
||||
"excel-hygiene: trim, collapse whitespace, fold smart quotes, strip "
|
||||
"invisible chars, normalize line endings, NFC. "
|
||||
"minimal: only trim and collapse. "
|
||||
"paranoid: everything including NFKC compat fold (lossy)."
|
||||
),
|
||||
)
|
||||
preset_key = preset_label.split(" ", 1)[0]
|
||||
options = CleanOptions.from_preset(preset_key)
|
||||
|
||||
with st.expander("Advanced options"):
|
||||
col_a, col_b = st.columns(2)
|
||||
with col_a:
|
||||
options.trim = st.checkbox("Trim leading/trailing whitespace", value=options.trim)
|
||||
options.collapse_whitespace = st.checkbox(
|
||||
"Collapse internal whitespace", value=options.collapse_whitespace,
|
||||
)
|
||||
options.normalize_line_endings = st.checkbox(
|
||||
"Normalize line endings (\\r\\n → \\n)", value=options.normalize_line_endings,
|
||||
)
|
||||
options.strip_control = st.checkbox(
|
||||
"Strip control characters", value=options.strip_control,
|
||||
)
|
||||
options.strip_bom = st.checkbox("Strip BOM", value=options.strip_bom)
|
||||
with col_b:
|
||||
options.fold_smart_chars = st.checkbox(
|
||||
"Fold smart characters (curly quotes, em-dash, NBSP)",
|
||||
value=options.fold_smart_chars,
|
||||
)
|
||||
options.strip_zero_width = st.checkbox(
|
||||
"Strip zero-width / invisible characters", value=options.strip_zero_width,
|
||||
)
|
||||
options.nfc = st.checkbox("Unicode NFC normalization", value=options.nfc)
|
||||
options.nfkc = st.checkbox(
|
||||
"Unicode NFKC compat fold (lossy: ① → 1, fi → fi)",
|
||||
value=options.nfkc,
|
||||
)
|
||||
|
||||
st.markdown("**Scope**")
|
||||
string_cols = [
|
||||
c for c in df.columns
|
||||
if pd.api.types.is_object_dtype(df[c]) or pd.api.types.is_string_dtype(df[c])
|
||||
]
|
||||
selected_cols = st.multiselect(
|
||||
"Columns to clean (default: all string columns)",
|
||||
options=list(df.columns),
|
||||
default=string_cols,
|
||||
)
|
||||
skip_cols = st.multiselect(
|
||||
"Columns to skip even if they look like text",
|
||||
options=list(df.columns),
|
||||
default=[],
|
||||
)
|
||||
options.columns = selected_cols if selected_cols else None
|
||||
options.skip_columns = list(skip_cols)
|
||||
|
||||
st.markdown("**Case conversion**")
|
||||
case_global = st.selectbox(
|
||||
"Apply case conversion to selected columns",
|
||||
["None", "UPPER", "lower", "Title", "Sentence"],
|
||||
index=0,
|
||||
)
|
||||
case_map = {
|
||||
"UPPER": "upper", "lower": "lower",
|
||||
"Title": "title", "Sentence": "sentence",
|
||||
}
|
||||
if case_global != "None":
|
||||
options.case = case_map[case_global] # type: ignore[assignment]
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Run
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
st.divider()
|
||||
|
||||
if st.button("Clean Text", type="primary", use_container_width=True):
|
||||
with st.spinner("Cleaning..."):
|
||||
try:
|
||||
result = clean_dataframe(df, options)
|
||||
except ValueError as e:
|
||||
st.error(str(e))
|
||||
st.stop()
|
||||
st.session_state["textclean_result"] = result
|
||||
st.session_state["textclean_input_name"] = uploaded.name
|
||||
|
||||
result = st.session_state.get("textclean_result")
|
||||
if result is None:
|
||||
st.stop()
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Results
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
st.subheader("Results")
|
||||
|
||||
pct = (result.cells_changed / result.cells_total * 100.0) if result.cells_total else 0.0
|
||||
m1, m2, m3, m4 = st.columns(4)
|
||||
m1.metric("Cells scanned", result.cells_total)
|
||||
m2.metric("Cells changed", result.cells_changed)
|
||||
m3.metric("% changed", f"{pct:.1f}%")
|
||||
m4.metric("Columns processed", len(result.columns_processed))
|
||||
|
||||
if result.cells_changed:
|
||||
counts = result.changes["column"].value_counts()
|
||||
st.markdown("**Changes by column**")
|
||||
st.dataframe(
|
||||
counts.rename("cells_changed").to_frame(),
|
||||
use_container_width=True,
|
||||
)
|
||||
|
||||
st.markdown("**Examples (first 25 changes)**")
|
||||
examples = result.changes.head(25).copy()
|
||||
examples["row"] = examples["row"] + 1
|
||||
st.dataframe(examples, use_container_width=True, hide_index=True)
|
||||
|
||||
st.markdown("**Cleaned preview (first 10 rows)**")
|
||||
st.dataframe(result.cleaned_df.head(10), use_container_width=True)
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Downloads
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
st.divider()
|
||||
stem = Path(st.session_state.get("textclean_input_name", "input")).stem
|
||||
|
||||
dl_a, dl_b, dl_c = st.columns(3)
|
||||
with dl_a:
|
||||
cleaned_bytes = result.cleaned_df.to_csv(index=False).encode("utf-8-sig")
|
||||
st.download_button(
|
||||
"Download cleaned CSV",
|
||||
data=cleaned_bytes,
|
||||
file_name=f"{stem}_cleaned.csv",
|
||||
mime="text/csv",
|
||||
)
|
||||
with dl_b:
|
||||
if not result.changes.empty:
|
||||
changes_bytes = result.changes.to_csv(index=False).encode("utf-8-sig")
|
||||
st.download_button(
|
||||
"Download changes audit",
|
||||
data=changes_bytes,
|
||||
file_name=f"{stem}_changes.csv",
|
||||
mime="text/csv",
|
||||
)
|
||||
with dl_c:
|
||||
config_bytes = json.dumps(options.to_dict(), indent=2).encode("utf-8")
|
||||
st.download_button(
|
||||
"Download config JSON",
|
||||
data=config_bytes,
|
||||
file_name="text_clean_config.json",
|
||||
mime="application/json",
|
||||
)
|
||||
|
||||
st.divider()
|
||||
st.caption("Runs locally. Your data never leaves this computer. | DataTools v3.0")
|
||||
|
||||
8
test-cases/ec05_multiline_cells.csv
Normal file
8
test-cases/ec05_multiline_cells.csv
Normal file
@@ -0,0 +1,8 @@
|
||||
id,address
|
||||
1,"123 Main St
|
||||
Apt 4B
|
||||
NYC NY 10001"
|
||||
2,"456 Oak Ave
|
||||
Suite 200
|
||||
LA CA 90001"
|
||||
3,"789 Pine Rd
|
||||
|
BIN
test-cases/ec06_control_characters.csv
Normal file
BIN
test-cases/ec06_control_characters.csv
Normal file
Binary file not shown.
|
5
test-cases/ec07_unicode_decomposed.csv
Normal file
5
test-cases/ec07_unicode_decomposed.csv
Normal file
@@ -0,0 +1,5 @@
|
||||
name,translation
|
||||
Café,Cafe
|
||||
éclair,eclair
|
||||
你好,Hello (CN)
|
||||
שלום,Hello (HE)
|
||||
|
5
test-cases/ec08_all_numeric.csv
Normal file
5
test-cases/ec08_all_numeric.csv
Normal file
@@ -0,0 +1,5 @@
|
||||
x,y,z
|
||||
1,1.1,10
|
||||
2,2.2,20
|
||||
3,3.3,30
|
||||
4,4.4,40
|
||||
|
6
test-cases/ec09_smart_chars_full.csv
Normal file
6
test-cases/ec09_smart_chars_full.csv
Normal file
@@ -0,0 +1,6 @@
|
||||
field
|
||||
‘single curly’
|
||||
“double curly”
|
||||
low-9 ‘x’ high-reversed-9
|
||||
em — en – minus − horizontal ―
|
||||
ellipsis… narrow nbsp
|
||||
|
5
test-cases/uc16_shopify_nbsp_names.csv
Normal file
5
test-cases/uc16_shopify_nbsp_names.csv
Normal file
@@ -0,0 +1,5 @@
|
||||
first_name,last_name,phone
|
||||
John ,Smith,555-1234
|
||||
Jane,Doe ,555-5678
|
||||
Bob,Jones,555-9012
|
||||
Alice,Brown,555-3456
|
||||
|
4
test-cases/uc17_product_smart_quotes.csv
Normal file
4
test-cases/uc17_product_smart_quotes.csv
Normal file
@@ -0,0 +1,4 @@
|
||||
sku,title,description
|
||||
DOG-001,“Best Dog Collar”,High quality…
|
||||
CAT-002,Cat Toy — Premium,It’s the best
|
||||
FISH-003,Fish Food – Tropical,Use don’t overfeed
|
||||
|
4
test-cases/uc18_excel_csv_utf8_bom.csv
Normal file
4
test-cases/uc18_excel_csv_utf8_bom.csv
Normal file
@@ -0,0 +1,4 @@
|
||||
customer_id,name,amount
|
||||
1001,Alice,100.0
|
||||
1002,Bob,200.0
|
||||
1003,Charlie,300.0
|
||||
|
4
test-cases/uc19_pasted_sku_zerowidth.csv
Normal file
4
test-cases/uc19_pasted_sku_zerowidth.csv
Normal file
@@ -0,0 +1,4 @@
|
||||
sku,qty
|
||||
ABC-123,10
|
||||
XYZ-456,20
|
||||
QQQ-789,30
|
||||
|
7
test-cases/uc20_bank_memo_crlf.csv
Normal file
7
test-cases/uc20_bank_memo_crlf.csv
Normal file
@@ -0,0 +1,7 @@
|
||||
date,amount,memo
|
||||
2024-01-15,-1500.0,"Payment
|
||||
Monthly recurring
|
||||
Net 30"
|
||||
2024-01-16,-250.0,Single line memo
|
||||
2024-01-17,-89.99,"Standard
|
||||
purchase"
|
||||
|
6
test-cases/uc21_quickbooks_trailing_spaces.csv
Normal file
6
test-cases/uc21_quickbooks_trailing_spaces.csv
Normal file
@@ -0,0 +1,6 @@
|
||||
vendor,ein
|
||||
ACME Corp ,12-3456789
|
||||
ACME Corp,12-3456789
|
||||
ACME Corp ,12-3456789
|
||||
Globex Inc,98-7654321
|
||||
Globex Inc ,98-7654321
|
||||
|
4
test-cases/uc22_unicode_accents.csv
Normal file
4
test-cases/uc22_unicode_accents.csv
Normal file
@@ -0,0 +1,4 @@
|
||||
company,city
|
||||
Café Roma,Boston
|
||||
Très Belle,Montréal
|
||||
Naïve Studios,São Paulo
|
||||
|
4
test-cases/uc23_word_pasted_dashes.csv
Normal file
4
test-cases/uc23_word_pasted_dashes.csv
Normal file
@@ -0,0 +1,4 @@
|
||||
task,owner
|
||||
Phase 1 — Discovery,Alice
|
||||
Phase 2 — Design,Bob
|
||||
Q1 – Q2,Charlie
|
||||
|
6
test-cases/uc24_survey_case_inconsistent.csv
Normal file
6
test-cases/uc24_survey_case_inconsistent.csv
Normal file
@@ -0,0 +1,6 @@
|
||||
response_id,agreement,category
|
||||
1,YES,Tech
|
||||
2,yes,TECH
|
||||
3,Yes,tech
|
||||
4,yEs,Tech
|
||||
5,yes, Tech
|
||||
|
4
test-cases/uc25_lead_invisible_unicode.csv
Normal file
4
test-cases/uc25_lead_invisible_unicode.csv
Normal file
@@ -0,0 +1,4 @@
|
||||
email,source
|
||||
alice@test.com,Facebook
|
||||
bob@test.com,Google
|
||||
charlie@test.com,Organic
|
||||
|
6
test-cases/uc26_mixed_line_endings.csv
Normal file
6
test-cases/uc26_mixed_line_endings.csv
Normal file
@@ -0,0 +1,6 @@
|
||||
email,platform
|
||||
alice@a.com,FB
|
||||
"alice@a.com
|
||||
",Google
|
||||
"alice@a.com
|
||||
",Organic
|
||||
|
158
tests/test_cli_text_clean.py
Normal file
158
tests/test_cli_text_clean.py
Normal file
@@ -0,0 +1,158 @@
|
||||
"""Integration tests for the text-cleaner CLI."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
import pandas as pd
|
||||
import pytest
|
||||
from typer.testing import CliRunner
|
||||
|
||||
from src.cli_text_clean import app
|
||||
|
||||
runner = CliRunner()
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def messy_csv(tmp_path):
|
||||
df = pd.DataFrame({
|
||||
"name": [" Alice ", "“Bob”", "Charlie"],
|
||||
"city": ["NYC", " LA ", "SF"],
|
||||
"qty": [1, 2, 3],
|
||||
})
|
||||
path = tmp_path / "messy.csv"
|
||||
df.to_csv(path, index=False)
|
||||
return path
|
||||
|
||||
|
||||
class TestPreview:
|
||||
def test_default_is_preview(self, messy_csv):
|
||||
result = runner.invoke(app, [str(messy_csv)])
|
||||
assert result.exit_code == 0
|
||||
assert "preview" in result.output.lower()
|
||||
assert "Cells changed" in result.output
|
||||
|
||||
def test_no_files_written_in_preview(self, messy_csv):
|
||||
result = runner.invoke(app, [str(messy_csv)])
|
||||
assert result.exit_code == 0
|
||||
assert not (messy_csv.parent / f"{messy_csv.stem}_cleaned.csv").exists()
|
||||
|
||||
def test_file_not_found(self):
|
||||
result = runner.invoke(app, ["/tmp/does_not_exist_xyz.csv"])
|
||||
assert result.exit_code != 0
|
||||
assert "not found" in result.output.lower()
|
||||
|
||||
|
||||
class TestApply:
|
||||
def test_apply_writes_cleaned_file(self, messy_csv): # E47
|
||||
result = runner.invoke(app, [str(messy_csv), "--apply"])
|
||||
assert result.exit_code == 0
|
||||
cleaned = messy_csv.parent / f"{messy_csv.stem}_cleaned.csv"
|
||||
assert cleaned.exists()
|
||||
df = pd.read_csv(cleaned)
|
||||
assert df["name"].iloc[0] == "Alice"
|
||||
|
||||
def test_apply_writes_changes_audit(self, messy_csv):
|
||||
result = runner.invoke(app, [str(messy_csv), "--apply"])
|
||||
assert result.exit_code == 0
|
||||
changes = messy_csv.parent / f"{messy_csv.stem}_changes.csv"
|
||||
assert changes.exists()
|
||||
|
||||
def test_no_audit_when_no_changes(self, tmp_path):
|
||||
clean = tmp_path / "clean.csv"
|
||||
pd.DataFrame({"a": ["x", "y"]}).to_csv(clean, index=False)
|
||||
result = runner.invoke(app, [str(clean), "--apply"])
|
||||
assert result.exit_code == 0
|
||||
assert not (tmp_path / "clean_changes.csv").exists()
|
||||
|
||||
def test_custom_output_path(self, messy_csv, tmp_path):
|
||||
out = tmp_path / "renamed.csv"
|
||||
result = runner.invoke(app, [str(messy_csv), "--apply", "-o", str(out)])
|
||||
assert result.exit_code == 0
|
||||
assert out.exists()
|
||||
|
||||
|
||||
class TestPresets:
|
||||
def test_minimal_does_not_fold_smart_chars(self, messy_csv):
|
||||
result = runner.invoke(app, [str(messy_csv), "--apply", "--preset", "minimal"])
|
||||
assert result.exit_code == 0
|
||||
cleaned = messy_csv.parent / f"{messy_csv.stem}_cleaned.csv"
|
||||
df = pd.read_csv(cleaned)
|
||||
# Smart quotes preserved under minimal preset
|
||||
assert "“" in df["name"].iloc[1] or "”" in df["name"].iloc[1]
|
||||
|
||||
def test_excel_hygiene_default_folds_smart_chars(self, messy_csv):
|
||||
result = runner.invoke(app, [str(messy_csv), "--apply"])
|
||||
assert result.exit_code == 0
|
||||
cleaned = messy_csv.parent / f"{messy_csv.stem}_cleaned.csv"
|
||||
df = pd.read_csv(cleaned)
|
||||
assert df["name"].iloc[1] == '"Bob"'
|
||||
|
||||
def test_unknown_preset_errors(self, messy_csv):
|
||||
result = runner.invoke(app, [str(messy_csv), "--preset", "weird"])
|
||||
assert result.exit_code != 0
|
||||
assert "Unknown preset" in result.output
|
||||
|
||||
|
||||
class TestColumnSelection:
|
||||
def test_columns_flag(self, messy_csv):
|
||||
result = runner.invoke(
|
||||
app, [str(messy_csv), "--apply", "--columns", "name"],
|
||||
)
|
||||
assert result.exit_code == 0
|
||||
cleaned = messy_csv.parent / f"{messy_csv.stem}_cleaned.csv"
|
||||
df = pd.read_csv(cleaned)
|
||||
assert df["name"].iloc[0] == "Alice"
|
||||
# city should be untouched (still has spaces)
|
||||
assert df["city"].iloc[1] == " LA "
|
||||
|
||||
def test_skip_flag(self, messy_csv):
|
||||
result = runner.invoke(
|
||||
app, [str(messy_csv), "--apply", "--skip", "name"],
|
||||
)
|
||||
assert result.exit_code == 0
|
||||
cleaned = messy_csv.parent / f"{messy_csv.stem}_cleaned.csv"
|
||||
df = pd.read_csv(cleaned)
|
||||
# name should still have spaces
|
||||
assert df["name"].iloc[0].startswith(" ")
|
||||
|
||||
|
||||
class TestCaseFlag:
|
||||
def test_bare_case_applies_to_all(self, tmp_path):
|
||||
path = tmp_path / "names.csv"
|
||||
pd.DataFrame({"a": ["alice"], "b": ["bob"]}).to_csv(path, index=False)
|
||||
result = runner.invoke(app, [str(path), "--apply", "--case", "upper"])
|
||||
assert result.exit_code == 0
|
||||
df = pd.read_csv(tmp_path / "names_cleaned.csv")
|
||||
assert df["a"].iloc[0] == "ALICE"
|
||||
assert df["b"].iloc[0] == "BOB"
|
||||
|
||||
def test_per_column_case(self, tmp_path):
|
||||
path = tmp_path / "names.csv"
|
||||
pd.DataFrame({"name": ["alice"], "code": ["abc"]}).to_csv(path, index=False)
|
||||
result = runner.invoke(
|
||||
app, [str(path), "--apply", "--case", "title:name,upper:code"],
|
||||
)
|
||||
assert result.exit_code == 0
|
||||
df = pd.read_csv(tmp_path / "names_cleaned.csv")
|
||||
assert df["name"].iloc[0] == "Alice"
|
||||
assert df["code"].iloc[0] == "ABC"
|
||||
|
||||
|
||||
class TestConfigRoundTrip:
|
||||
def test_save_and_load(self, messy_csv, tmp_path):
|
||||
cfg = tmp_path / "opts.json"
|
||||
result1 = runner.invoke(
|
||||
app,
|
||||
[str(messy_csv), "--save-config", str(cfg), "--preset", "minimal", "--no-trim"],
|
||||
)
|
||||
assert result1.exit_code == 0
|
||||
assert cfg.exists()
|
||||
|
||||
# Reload and apply
|
||||
result2 = runner.invoke(app, [str(messy_csv), "--apply", "--config", str(cfg)])
|
||||
assert result2.exit_code == 0
|
||||
cleaned = messy_csv.parent / f"{messy_csv.stem}_cleaned.csv"
|
||||
df = pd.read_csv(cleaned)
|
||||
# With --no-trim, leading spaces survive
|
||||
assert df["name"].iloc[0].startswith(" ")
|
||||
482
tests/test_text_clean.py
Normal file
482
tests/test_text_clean.py
Normal file
@@ -0,0 +1,482 @@
|
||||
"""Tests for src/core/text_clean.py.
|
||||
|
||||
Covers edge cases E1-E50 from TECHNICAL.md Section 10.2 plan.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import pytest
|
||||
|
||||
from src.core.text_clean import (
|
||||
CleanOptions,
|
||||
PRESETS,
|
||||
apply_case,
|
||||
clean_dataframe,
|
||||
clean_value,
|
||||
collapse_whitespace,
|
||||
fold_smart_chars,
|
||||
normalize_line_endings,
|
||||
sentence_case,
|
||||
smart_title_case,
|
||||
strip_bom,
|
||||
strip_control,
|
||||
strip_zero_width,
|
||||
to_nfc,
|
||||
to_nfkc,
|
||||
trim,
|
||||
)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Per-string helpers
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestTrim:
|
||||
def test_strips_leading_and_trailing(self):
|
||||
assert trim(" hello ") == "hello"
|
||||
|
||||
def test_preserves_internal_spaces(self):
|
||||
assert trim(" a b ") == "a b"
|
||||
|
||||
def test_empty_string(self):
|
||||
assert trim("") == ""
|
||||
|
||||
def test_idempotent(self):
|
||||
assert trim(trim(" x ")) == trim(" x ")
|
||||
|
||||
|
||||
class TestCollapseWhitespace:
|
||||
def test_multiple_spaces(self):
|
||||
assert collapse_whitespace("a b") == "a b"
|
||||
|
||||
def test_tab_inside_cell(self): # E2
|
||||
assert collapse_whitespace("a\tb") == "a b"
|
||||
|
||||
def test_mixed_tabs_and_spaces(self): # E3
|
||||
assert collapse_whitespace("a \t \t b") == "a b"
|
||||
|
||||
def test_idempotent(self):
|
||||
assert collapse_whitespace(collapse_whitespace("a b")) == collapse_whitespace("a b")
|
||||
|
||||
|
||||
class TestNFC:
|
||||
def test_combining_acute(self): # E6
|
||||
decomposed = "é" # e + combining acute
|
||||
composed = "é" # é
|
||||
assert to_nfc(decomposed) == composed
|
||||
|
||||
def test_idempotent(self):
|
||||
s = "café"
|
||||
assert to_nfc(to_nfc(s)) == to_nfc(s)
|
||||
|
||||
|
||||
class TestNFKC:
|
||||
def test_circled_digit(self): # E7
|
||||
assert to_nfkc("①") == "1"
|
||||
|
||||
def test_ligature(self): # E7
|
||||
assert to_nfkc("fi") == "fi"
|
||||
|
||||
def test_idempotent(self):
|
||||
assert to_nfkc(to_nfkc("①fi")) == to_nfkc("①fi")
|
||||
|
||||
|
||||
class TestSmartChars:
|
||||
def test_curly_quotes(self): # E11
|
||||
assert fold_smart_chars("‘hi’") == "'hi'"
|
||||
assert fold_smart_chars("“hi”") == '"hi"'
|
||||
|
||||
def test_dashes(self): # E12
|
||||
assert fold_smart_chars("a—b") == "a-b"
|
||||
assert fold_smart_chars("a–b") == "a-b"
|
||||
|
||||
def test_ellipsis(self): # E13
|
||||
assert fold_smart_chars("wait…") == "wait..."
|
||||
|
||||
def test_nbsp(self): # E14
|
||||
assert fold_smart_chars("a b") == "a b"
|
||||
|
||||
def test_idempotent(self):
|
||||
s = "“hi” — a b"
|
||||
assert fold_smart_chars(fold_smart_chars(s)) == fold_smart_chars(s)
|
||||
|
||||
|
||||
class TestZeroWidth:
|
||||
def test_zwsp_midword(self): # E16
|
||||
assert strip_zero_width("foobar") == "foobar"
|
||||
|
||||
def test_bidi_marks_stripped(self): # E17
|
||||
assert strip_zero_width("abc") == "abc"
|
||||
|
||||
def test_word_joiner(self): # E18
|
||||
assert strip_zero_width("ab") == "ab"
|
||||
|
||||
def test_mid_string_feff(self): # E22
|
||||
assert strip_zero_width("foobar") == "foobar"
|
||||
|
||||
|
||||
class TestStripBOM:
|
||||
def test_leading_bom(self):
|
||||
assert strip_bom("hello") == "hello"
|
||||
|
||||
def test_no_bom(self):
|
||||
assert strip_bom("hello") == "hello"
|
||||
|
||||
def test_idempotent(self):
|
||||
assert strip_bom(strip_bom("x")) == strip_bom("x")
|
||||
|
||||
|
||||
class TestStripControl:
|
||||
def test_null_byte(self): # E20
|
||||
assert strip_control("a\x00b") == "ab"
|
||||
|
||||
def test_preserves_tab_newline_cr(self): # E19
|
||||
assert strip_control("a\tb\nc\rd") == "a\tb\nc\rd"
|
||||
|
||||
def test_strips_other_control(self):
|
||||
# 0x01..0x1F minus tab/newline/CR/VT/FF? we keep \t \n \r only.
|
||||
assert strip_control("a\x01b\x07c\x1fd") == "abcd"
|
||||
|
||||
def test_strips_del(self):
|
||||
assert strip_control("a\x7fb") == "ab"
|
||||
|
||||
|
||||
class TestLineEndings:
|
||||
def test_crlf(self): # E23
|
||||
assert normalize_line_endings("a\r\nb") == "a\nb"
|
||||
|
||||
def test_bare_cr(self): # E24
|
||||
assert normalize_line_endings("a\rb") == "a\nb"
|
||||
|
||||
def test_idempotent(self):
|
||||
assert (
|
||||
normalize_line_endings(normalize_line_endings("a\r\nb\rc"))
|
||||
== normalize_line_endings("a\r\nb\rc")
|
||||
)
|
||||
|
||||
|
||||
class TestSmartTitleCase:
|
||||
def test_preserves_acronym(self): # E26
|
||||
assert smart_title_case("USA report") == "USA Report"
|
||||
assert smart_title_case("nasa launch") == "Nasa Launch" # already lower
|
||||
assert smart_title_case("NASA launch") == "NASA Launch"
|
||||
|
||||
def test_lowercases_particles_midstring(self): # E27
|
||||
assert smart_title_case("the lord of the rings") == "The Lord of the Rings"
|
||||
assert smart_title_case("a tale of two cities") == "A Tale of Two Cities"
|
||||
|
||||
def test_keeps_first_and_last_capitalized(self):
|
||||
# "of" at the end stays capitalized
|
||||
result = smart_title_case("kingdom of")
|
||||
assert result == "Kingdom Of"
|
||||
|
||||
def test_apostrophe(self):
|
||||
assert smart_title_case("o'neil") == "O'neil"
|
||||
|
||||
|
||||
class TestSentenceCase:
|
||||
def test_basic(self): # E28
|
||||
assert sentence_case("hello. how are you? fine!") == "Hello. How are you? Fine!"
|
||||
|
||||
def test_preserves_punctuation(self):
|
||||
assert sentence_case("WHAT? OK.") == "What? Ok."
|
||||
|
||||
|
||||
class TestApplyCase:
|
||||
def test_modes(self):
|
||||
assert apply_case("Hello World", "upper") == "HELLO WORLD"
|
||||
assert apply_case("Hello World", "lower") == "hello world"
|
||||
assert apply_case("hello world", "title") == "Hello World"
|
||||
assert apply_case("hello. world.", "sentence") == "Hello. World."
|
||||
|
||||
def test_unknown_mode_raises(self):
|
||||
with pytest.raises(ValueError):
|
||||
apply_case("x", "weird") # type: ignore[arg-type]
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# clean_value composition
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestCleanValue:
|
||||
def test_default_excel_hygiene(self):
|
||||
opts = CleanOptions()
|
||||
out, ops = clean_value("“Hello world” ", opts)
|
||||
assert out == '"Hello world"'
|
||||
assert "fold_smart_chars" in ops
|
||||
assert "trim" in ops
|
||||
|
||||
def test_pure_whitespace_to_empty(self): # E1
|
||||
opts = CleanOptions()
|
||||
out, ops = clean_value(" ", opts)
|
||||
assert out == ""
|
||||
|
||||
def test_nbsp_only_cell(self): # E5
|
||||
opts = CleanOptions()
|
||||
out, _ = clean_value(" ", opts)
|
||||
assert out == ""
|
||||
|
||||
def test_non_string_passthrough(self): # E32
|
||||
opts = CleanOptions()
|
||||
for val in (None, 42, 3.14, True, np.nan):
|
||||
out, ops = clean_value(val, opts)
|
||||
# NaN compares unequal to itself; check pd.isna for that case
|
||||
if isinstance(val, float) and pd.isna(val):
|
||||
assert pd.isna(out)
|
||||
else:
|
||||
assert out == val
|
||||
assert ops == []
|
||||
|
||||
def test_empty_string(self):
|
||||
opts = CleanOptions()
|
||||
out, ops = clean_value("", opts)
|
||||
assert out == ""
|
||||
assert ops == []
|
||||
|
||||
def test_only_unchanged_ops_not_logged(self):
|
||||
opts = CleanOptions(trim=True, collapse_whitespace=True, nfc=False, nfkc=False,
|
||||
fold_smart_chars=False, strip_zero_width=False,
|
||||
strip_bom=False, strip_control=False,
|
||||
normalize_line_endings=False)
|
||||
out, ops = clean_value("hello", opts)
|
||||
assert out == "hello"
|
||||
assert ops == []
|
||||
|
||||
|
||||
class TestIdempotency:
|
||||
"""E40 — applying the pipeline twice yields the same result as once."""
|
||||
|
||||
@pytest.mark.parametrize("preset", list(PRESETS.keys()))
|
||||
def test_preset_idempotent(self, preset):
|
||||
opts = CleanOptions.from_preset(preset)
|
||||
cases = [
|
||||
"“Hello world” ",
|
||||
" \t multi space \r\n ",
|
||||
"café",
|
||||
"éclair",
|
||||
"leading-bom",
|
||||
"USA and the Rings",
|
||||
"a\x00b\x01c",
|
||||
"",
|
||||
" ",
|
||||
]
|
||||
for s in cases:
|
||||
once, _ = clean_value(s, opts)
|
||||
twice, _ = clean_value(once, opts)
|
||||
assert once == twice, f"not idempotent on {s!r} (preset {preset})"
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# clean_dataframe
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestCleanDataframe:
|
||||
def test_only_string_columns_touched(self): # E31, E33, E35
|
||||
df = pd.DataFrame({
|
||||
"name": [" Alice ", "Bob"],
|
||||
"age": [30, 25],
|
||||
"joined": pd.to_datetime(["2024-01-01", "2024-02-01"]),
|
||||
"active": [True, False],
|
||||
})
|
||||
result = clean_dataframe(df)
|
||||
assert result.cleaned_df["name"].tolist() == ["Alice", "Bob"]
|
||||
assert result.cleaned_df["age"].tolist() == [30, 25]
|
||||
assert result.cleaned_df["active"].tolist() == [True, False]
|
||||
assert "name" in result.columns_processed
|
||||
assert "age" not in result.columns_processed
|
||||
|
||||
def test_explicit_columns(self): # E41
|
||||
df = pd.DataFrame({"a": [" x "], "b": [" y "]})
|
||||
result = clean_dataframe(df, CleanOptions(columns=["a"]))
|
||||
assert result.cleaned_df["a"].iloc[0] == "x"
|
||||
assert result.cleaned_df["b"].iloc[0] == " y "
|
||||
assert result.columns_processed == ["a"]
|
||||
|
||||
def test_skip_columns(self): # E42
|
||||
df = pd.DataFrame({"name": [" A "], "notes": [" free text "]})
|
||||
result = clean_dataframe(df, CleanOptions(skip_columns=["notes"]))
|
||||
assert result.cleaned_df["name"].iloc[0] == "A"
|
||||
assert result.cleaned_df["notes"].iloc[0] == " free text "
|
||||
|
||||
def test_unknown_column_raises(self):
|
||||
df = pd.DataFrame({"a": ["x"]})
|
||||
with pytest.raises(ValueError):
|
||||
clean_dataframe(df, CleanOptions(columns=["missing"]))
|
||||
|
||||
def test_empty_dataframe(self): # E43
|
||||
df = pd.DataFrame()
|
||||
result = clean_dataframe(df)
|
||||
assert result.cells_changed == 0
|
||||
assert result.cells_total == 0
|
||||
assert result.cleaned_df.empty
|
||||
|
||||
def test_single_column_file(self): # E44
|
||||
df = pd.DataFrame({"only": [" hello "]})
|
||||
result = clean_dataframe(df)
|
||||
assert result.cleaned_df["only"].iloc[0] == "hello"
|
||||
|
||||
def test_all_numeric_no_op(self): # E45
|
||||
df = pd.DataFrame({"a": [1, 2], "b": [3.0, 4.0]})
|
||||
result = clean_dataframe(df)
|
||||
assert result.columns_processed == []
|
||||
assert result.cells_changed == 0
|
||||
|
||||
def test_mixed_object_column_strings_only(self): # E34
|
||||
df = pd.DataFrame({"mix": [" hello ", 42, None]})
|
||||
result = clean_dataframe(df)
|
||||
assert result.cleaned_df["mix"].iloc[0] == "hello"
|
||||
assert result.cleaned_df["mix"].iloc[1] == 42
|
||||
assert result.cleaned_df["mix"].iloc[2] is None
|
||||
|
||||
def test_nan_preserved(self): # E32
|
||||
df = pd.DataFrame({"a": [" x ", np.nan]})
|
||||
result = clean_dataframe(df)
|
||||
assert result.cleaned_df["a"].iloc[0] == "x"
|
||||
assert pd.isna(result.cleaned_df["a"].iloc[1])
|
||||
|
||||
def test_changes_audit_count(self): # E48
|
||||
df = pd.DataFrame({"a": [" x ", "y", " z"]})
|
||||
result = clean_dataframe(df)
|
||||
assert result.cells_changed == 2
|
||||
assert len(result.changes) == 2
|
||||
assert set(result.changes["row"].tolist()) == {0, 2}
|
||||
|
||||
def test_does_not_mutate_input(self):
|
||||
df = pd.DataFrame({"a": [" x "]})
|
||||
original = df.copy()
|
||||
clean_dataframe(df)
|
||||
assert df.equals(original)
|
||||
|
||||
def test_per_column_case_via_case_columns(self):
|
||||
df = pd.DataFrame({"name": ["alice"], "code": ["abc"]})
|
||||
result = clean_dataframe(df, CleanOptions(case_columns={"code": "upper"}))
|
||||
assert result.cleaned_df["name"].iloc[0] == "alice"
|
||||
assert result.cleaned_df["code"].iloc[0] == "ABC"
|
||||
|
||||
def test_global_case_applied_to_selected_only(self):
|
||||
df = pd.DataFrame({"name": ["alice"], "notes": ["bob"]})
|
||||
result = clean_dataframe(
|
||||
df, CleanOptions(columns=["name"], case="upper"),
|
||||
)
|
||||
assert result.cleaned_df["name"].iloc[0] == "ALICE"
|
||||
assert result.cleaned_df["notes"].iloc[0] == "bob"
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Presets and config round-trip
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestPresets:
|
||||
def test_minimal_only_trim_collapse(self):
|
||||
opts = CleanOptions.from_preset("minimal")
|
||||
assert opts.trim is True
|
||||
assert opts.collapse_whitespace is True
|
||||
assert opts.nfc is False
|
||||
assert opts.fold_smart_chars is False
|
||||
|
||||
def test_excel_hygiene_smart_chars_on_nfkc_off(self):
|
||||
opts = CleanOptions.from_preset("excel-hygiene")
|
||||
assert opts.fold_smart_chars is True
|
||||
assert opts.nfc is True
|
||||
assert opts.nfkc is False
|
||||
|
||||
def test_paranoid_includes_nfkc(self):
|
||||
opts = CleanOptions.from_preset("paranoid")
|
||||
assert opts.nfkc is True
|
||||
|
||||
def test_unknown_preset_raises(self):
|
||||
with pytest.raises(ValueError):
|
||||
CleanOptions.from_preset("does-not-exist")
|
||||
|
||||
|
||||
class TestConfigRoundTrip:
|
||||
def test_dict_roundtrip(self): # E49
|
||||
opts = CleanOptions(
|
||||
trim=False, nfc=True, columns=["a", "b"], skip_columns=["c"],
|
||||
case="upper",
|
||||
)
|
||||
recovered = CleanOptions.from_dict(opts.to_dict())
|
||||
assert recovered == opts
|
||||
|
||||
def test_file_roundtrip(self, tmp_path):
|
||||
path = tmp_path / "opts.json"
|
||||
opts = CleanOptions(case_columns={"code": "upper"}, fold_smart_chars=False)
|
||||
opts.to_file(path)
|
||||
loaded = CleanOptions.from_file(path)
|
||||
assert loaded == opts
|
||||
|
||||
def test_unknown_keys_ignored(self): # E50
|
||||
data = {"trim": True, "totally_made_up_key": 42}
|
||||
opts = CleanOptions.from_dict(data)
|
||||
assert opts.trim is True
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Use-case smoke tests (whole-pipeline)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestUseCases:
|
||||
def test_excel_save_as_csv_utf8_bom(self):
|
||||
# UC3: BOM at start of first cell
|
||||
df = pd.DataFrame({"name": ["Alice", "Bob"], "city": ["NYC", "LA"]})
|
||||
result = clean_dataframe(df)
|
||||
assert result.cleaned_df["name"].iloc[0] == "Alice"
|
||||
|
||||
def test_word_smart_quotes_in_product_titles(self):
|
||||
# UC2
|
||||
df = pd.DataFrame({"title": ["“Best Dog Collar”", "Cat Toy — Red"]})
|
||||
result = clean_dataframe(df)
|
||||
assert result.cleaned_df["title"].iloc[0] == '"Best Dog Collar"'
|
||||
assert result.cleaned_df["title"].iloc[1] == "Cat Toy - Red"
|
||||
|
||||
def test_nbsp_in_email_field(self):
|
||||
# UC10: invisible Unicode hiding in emails
|
||||
df = pd.DataFrame({"email": ["alice@test.com", "bob @test.com"]})
|
||||
result = clean_dataframe(df)
|
||||
# ZWSP stripped; NBSP folded to space then collapsed but trim won't remove
|
||||
# internal space. So "bob @test.com" remains. That's correct: the cleaner
|
||||
# doesn't know that's an email — script 03 owns email format. Just confirm
|
||||
# the invisible char is gone.
|
||||
assert "" not in result.cleaned_df["email"].iloc[0]
|
||||
assert " " not in result.cleaned_df["email"].iloc[1]
|
||||
|
||||
def test_quickbooks_trailing_spaces(self):
|
||||
# UC6: VLOOKUP fails because of trailing spaces
|
||||
df = pd.DataFrame({"vendor": ["ACME Corp ", "ACME Corp"]})
|
||||
result = clean_dataframe(df)
|
||||
assert result.cleaned_df["vendor"].iloc[0] == result.cleaned_df["vendor"].iloc[1]
|
||||
|
||||
def test_bank_export_crlf_in_memo(self):
|
||||
# UC5: \r\n inside multi-line memo cells
|
||||
df = pd.DataFrame({"memo": ["line one\r\nline two\r\nline three"]})
|
||||
result = clean_dataframe(df)
|
||||
assert "\r" not in result.cleaned_df["memo"].iloc[0]
|
||||
assert result.cleaned_df["memo"].iloc[0].count("\n") == 2
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Reporting / dtype edge cases
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestReporting:
|
||||
def test_changes_columns_present(self):
|
||||
df = pd.DataFrame({"a": [" x "]})
|
||||
result = clean_dataframe(df)
|
||||
assert list(result.changes.columns) == [
|
||||
"row", "column", "old", "new", "ops_applied",
|
||||
]
|
||||
|
||||
def test_changes_empty_when_no_changes(self):
|
||||
df = pd.DataFrame({"a": ["x", "y"]})
|
||||
result = clean_dataframe(df)
|
||||
assert result.cells_changed == 0
|
||||
assert result.changes.empty
|
||||
|
||||
def test_cells_total_counts_only_processed_columns(self):
|
||||
df = pd.DataFrame({"a": ["x", "y", "z"], "n": [1, 2, 3]})
|
||||
result = clean_dataframe(df)
|
||||
assert result.cells_total == 3 # only "a" is processed
|
||||
Reference in New Issue
Block a user