Michael e9c490ae1b feat(gui): hidden-char-aware preview tables in Text Cleaner
The Text Cleaner had two st.dataframe previews — the initial upload
preview ("Preview: filename") and the post-clean "Cleaned preview"
table — that both rendered cells with the same browser-collapses-
whitespace, hides-invisibles problem the analyzer findings panel had
before commit 1049c03.

components.render_hidden_aware_preview(df, n_rows, caption) renders a
DataFrame as an HTML table where:
  - every cell uses visualize_hidden_html(mark_outer_whitespace=True),
    so leading/trailing ASCII spaces appear as per-character "·" badges
  - white-space: pre-wrap on every cell preserves internal multi-space
    runs and embedded newlines visually
  - headers route through the same visualizer so dirty column names
    (NBSP padding, ZWSP, smart quotes) show their badges too
  - NaN cells render as a faint "NaN" placeholder
  - rows are sticky-headed and scrollable inside a 26rem capped
    container so a 10-row preview doesn't push the rest of the UI off
    screen

2_Text_Cleaner.py wires it into both previews:
  - The upload preview gains its own "Show hidden characters in preview"
    toggle (default on).
  - The cleaned preview reuses the existing show_hidden toggle that
    already governs the Examples changes table, so one switch controls
    the whole results section.

Either toggle off falls back to the original st.dataframe view.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 16:26:30 +00:00

DataTools

A bundle of Python data-cleaning tools for CSV and Excel files. Two scripts ship today; more are in build.

# Tool What it does
01 Deduplicator Find and remove duplicate rows with exact + fuzzy matching, smart normalization, and interactive review.
02 Text Cleaner Trim whitespace, fold smart quotes, strip invisible / control characters, normalize Unicode, normalize line endings, optional case conversion.

Deduplicator

Features

  • Zero-config start — auto-detects encoding, delimiters, headers, and match columns
  • Fuzzy matching — Jaro-Winkler, Levenshtein, and token set ratio algorithms
  • 5 built-in normalizers — email (Gmail dot/plus), phone (E.164), name (titles/suffixes), address (USPS), string (whitespace/case)
  • Merge mode — fill missing fields in the surviving row from removed duplicates
  • 4 survivor rules — keep first, last, most complete, or most recent row per group
  • Interactive review — inspect match groups with inline checkboxes and column dropdowns, cherry-pick values, preview surviving rows live
  • Config profiles — save and reload your settings as JSON for repeatable runs
  • Dual interface — full CLI for automation, Streamlit GUI for visual review
  • Dry-run by default — preview what would change before writing anything
  • Audit trail — every run produces a match groups report and timestamped log

Quick Start

Install

pip install -r requirements.txt

CLI

# Preview duplicates (dry run — no files written)
python -m src.cli customers.csv

# Remove duplicates and save the result
python -m src.cli customers.csv --apply

# Fuzzy-match names at 80% similarity, merge missing fields
python -m src.cli customers.csv --fuzzy name --threshold 80 --merge --apply

# Interactively review each match group
python -m src.cli customers.csv --review --apply

GUI

streamlit run src/gui/app.py

Upload a file, click Find Duplicates, review match groups side-by-side, then download the cleaned result.

CLI Usage Summary

python -m src.cli INPUT_FILE [OPTIONS]

Options:
  --apply                  Write output files (default: preview only)
  --output, -o PATH        Output file path
  --subset, -s COLS        Columns to match on (comma-separated)
  --key, -k COLS           Strong-key columns for exact matching
  --fuzzy COLS             Columns to fuzzy-match
  --algorithm, -a ALG      levenshtein | jaro_winkler | token_set_ratio
  --threshold, -t N        Similarity threshold 0-100 (default: 85)
  --normalize COL:TYPE     Per-column normalizers (e.g., email:email,phone:phone)
  --survivor RULE          first | last | most-complete | most-recent
  --merge                  Fill missing fields from removed duplicates
  --review                 Interactively review each match group
  --config PATH            Load settings from a JSON config file
  --save-config PATH       Save current settings to JSON
  --sheet NAME             Excel sheet name or 0-based index
  --encoding ENC           Override auto-detected encoding
  --header-row N           0-based header row index
  --help                   Show full help

Sample Output

$ python -m src.cli samples/messy_sales.csv

Reading messy_sales.csv...
  50 rows, 8 columns
Finding duplicates...

──────────────────────────────────────────────────
  File:      messy_sales.csv
  Rows in:   50
  Rows out:  28
  Removed:   22
  Groups:    22
──────────────────────────────────────────────────

Match groups:
  Group 1: rows [1, 2] → keep row 1 (confidence: 100.0%, matched on: email)
  Group 2: rows [3, 4] → keep row 3 (confidence: 92.3%, matched on: name, phone)
  ...

This was a preview. Add --apply to write the output files.

Output Files

When --apply is used, three files are produced:

File Contents
{input}_deduplicated.csv Cleaned data with duplicates removed
{input}_removed.csv Rows that were removed
{input}_match_groups.csv Audit trail: group ID, confidence, matched columns, survivor flag

Text Cleaner

Character-level hygiene for messy CSV / Excel input. Solves the dirty-data failure modes that silently break VLOOKUPs, dedup runs, and downstream imports:

  • Trailing / leading whitespace and tabs in cells
  • Non-breaking spaces (U+00A0) hiding inside text where regular spaces should be
  • Smart quotes pasted from Word (" " ' '" " ' ')
  • Em / en dashes, ellipsis, other typographic Unicode
  • Zero-width and bidi-mark characters (U+200B, U+200C, U+200D, etc.)
  • BOMs from Excel "Save As CSV UTF-8"
  • Mixed line endings (\r\n, bare \r) inside multi-line cells
  • Control characters (U+0000-U+001F minus \t \n \r)
  • Optional Unicode NFC / NFKC normalization
  • Optional per-column case conversion (UPPER / lower / smart Title / Sentence)
# Preview what would change (dry-run)
python -m src.cli_text_clean samples/messy_text.csv

# Apply the safe defaults
python -m src.cli_text_clean samples/messy_text.csv --apply

# Title-case the name column, upper-case the SKU column
python -m src.cli_text_clean products.csv --case title:name,upper:sku --apply

# Just trim and collapse — nothing fancy
python -m src.cli_text_clean messy.csv --preset minimal --apply

Three presets: minimal (trim + collapse only), excel-hygiene (default; everything safe ON), paranoid (adds lossy NFKC fold).

Outputs {input}_cleaned.csv plus a per-cell {input}_changes.csv audit (row, column, old, new, ops applied).

See docs/CLI-REFERENCE.md for every flag.

Documentation

Requirements

  • Python 3.10+
  • Dependencies: pandas, openpyxl, rapidfuzz, typer, phonenumbers, loguru, tqdm, charset-normalizer

License

Proprietary. All rights reserved.

Description
Data tools development
Readme 7.7 MiB
Languages
Python 87.3%
HTML 10%
CSS 1.8%
Shell 0.4%
JavaScript 0.2%
Other 0.2%