Go to file

Michael 8dfc6ad8ae feat(analyze): add mixed_line_endings + near_duplicate_rows detectors

Two more detectors close the analyzer gap list:

mixed_line_endings (warn, tool=02): scans raw bytes for combinations of
  CRLF / LF / bare CR. Disaster pattern after multi-source concat
  (Windows + macOS + Linux exports stitched together). Operates on raw
  bytes only — DataFrame-mode analyze() skips it because raw bytes
  aren't available. _load_for_analysis now returns the raw bytes
  alongside the DataFrame and repair result so the detector has them.

near_duplicate_rows (info, tool=01): cheap dedup signal — strip and
  lowercase every string column, then count df.duplicated(). Catches the
  most common case (same customer entered twice with subtle formatting
  differences) without paying for fuzzy matching. Anything more
  sophisticated stays in tool 01.

Six new tests cover both detectors plus the dataframe-mode skip path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-29 16:09:42 +00:00

.streamlit

fix: remove headless=true so browser opens on launch

2026-04-29 01:23:07 +00:00

docs

feat: implement text cleaner (script 02) with CLI, GUI, and tests

2026-04-29 15:14:15 +00:00

samples

feat: implement text cleaner (script 02) with CLI, GUI, and tests

2026-04-29 15:14:15 +00:00

src

feat(analyze): add mixed_line_endings + near_duplicate_rows detectors

2026-04-29 16:09:42 +00:00

test-cases

test: add text-cleaner corpus and close gaps surfaced by it

2026-04-29 15:37:35 +00:00

tests

feat(analyze): add mixed_line_endings + near_duplicate_rows detectors

2026-04-29 16:09:42 +00:00

.gitignore

feat: add documentation, Streamlit GUI, and full source tree

2026-04-28 23:06:39 +00:00

pytest.ini

test: single-command runner, cross-platform automation, fixture auto-discovery

2026-04-29 16:01:06 +00:00

README.md

feat: implement text cleaner (script 02) with CLI, GUI, and tests

2026-04-29 15:14:15 +00:00

requirements-dev.txt

feat: add documentation, Streamlit GUI, and full source tree

2026-04-28 23:06:39 +00:00

requirements.txt

feat: add documentation, Streamlit GUI, and full source tree

2026-04-28 23:06:39 +00:00

run_tests.py

test: single-command runner, cross-platform automation, fixture auto-discovery

2026-04-29 16:01:06 +00:00

tox.ini

test: single-command runner, cross-platform automation, fixture auto-discovery

2026-04-29 16:01:06 +00:00

README.md

DataTools

A bundle of Python data-cleaning tools for CSV and Excel files. Two scripts ship today; more are in build.

#	Tool	What it does
01	Deduplicator	Find and remove duplicate rows with exact + fuzzy matching, smart normalization, and interactive review.
02	Text Cleaner	Trim whitespace, fold smart quotes, strip invisible / control characters, normalize Unicode, normalize line endings, optional case conversion.

Deduplicator

Features

Zero-config start — auto-detects encoding, delimiters, headers, and match columns
Fuzzy matching — Jaro-Winkler, Levenshtein, and token set ratio algorithms
5 built-in normalizers — email (Gmail dot/plus), phone (E.164), name (titles/suffixes), address (USPS), string (whitespace/case)
Merge mode — fill missing fields in the surviving row from removed duplicates
4 survivor rules — keep first, last, most complete, or most recent row per group
Interactive review — inspect match groups with inline checkboxes and column dropdowns, cherry-pick values, preview surviving rows live
Config profiles — save and reload your settings as JSON for repeatable runs
Dual interface — full CLI for automation, Streamlit GUI for visual review
Dry-run by default — preview what would change before writing anything
Audit trail — every run produces a match groups report and timestamped log

Quick Start

Install

pip install -r requirements.txt

CLI

# Preview duplicates (dry run — no files written)
python -m src.cli customers.csv

# Remove duplicates and save the result
python -m src.cli customers.csv --apply

# Fuzzy-match names at 80% similarity, merge missing fields
python -m src.cli customers.csv --fuzzy name --threshold 80 --merge --apply

# Interactively review each match group
python -m src.cli customers.csv --review --apply

GUI

streamlit run src/gui/app.py

Upload a file, click Find Duplicates, review match groups side-by-side, then download the cleaned result.

CLI Usage Summary

python -m src.cli INPUT_FILE [OPTIONS]

Options:
  --apply                  Write output files (default: preview only)
  --output, -o PATH        Output file path
  --subset, -s COLS        Columns to match on (comma-separated)
  --key, -k COLS           Strong-key columns for exact matching
  --fuzzy COLS             Columns to fuzzy-match
  --algorithm, -a ALG      levenshtein | jaro_winkler | token_set_ratio
  --threshold, -t N        Similarity threshold 0-100 (default: 85)
  --normalize COL:TYPE     Per-column normalizers (e.g., email:email,phone:phone)
  --survivor RULE          first | last | most-complete | most-recent
  --merge                  Fill missing fields from removed duplicates
  --review                 Interactively review each match group
  --config PATH            Load settings from a JSON config file
  --save-config PATH       Save current settings to JSON
  --sheet NAME             Excel sheet name or 0-based index
  --encoding ENC           Override auto-detected encoding
  --header-row N           0-based header row index
  --help                   Show full help

Sample Output

$ python -m src.cli samples/messy_sales.csv

Reading messy_sales.csv...
  50 rows, 8 columns
Finding duplicates...

──────────────────────────────────────────────────
  File:      messy_sales.csv
  Rows in:   50
  Rows out:  28
  Removed:   22
  Groups:    22
──────────────────────────────────────────────────

Match groups:
  Group 1: rows [1, 2] → keep row 1 (confidence: 100.0%, matched on: email)
  Group 2: rows [3, 4] → keep row 3 (confidence: 92.3%, matched on: name, phone)
  ...

This was a preview. Add --apply to write the output files.

Output Files

When --apply is used, three files are produced:

File	Contents
`{input}_deduplicated.csv`	Cleaned data with duplicates removed
`{input}_removed.csv`	Rows that were removed
`{input}_match_groups.csv`	Audit trail: group ID, confidence, matched columns, survivor flag

Text Cleaner

Character-level hygiene for messy CSV / Excel input. Solves the dirty-data failure modes that silently break VLOOKUPs, dedup runs, and downstream imports:

Trailing / leading whitespace and tabs in cells
Non-breaking spaces (U+00A0) hiding inside text where regular spaces should be
Smart quotes pasted from Word (" " ' ' → " " ' ')
Em / en dashes, ellipsis, other typographic Unicode
Zero-width and bidi-mark characters (U+200B, U+200C, U+200D, etc.)
BOMs from Excel "Save As CSV UTF-8"
Mixed line endings (\r\n, bare \r) inside multi-line cells
Control characters (U+0000-U+001F minus \t \n \r)
Optional Unicode NFC / NFKC normalization
Optional per-column case conversion (UPPER / lower / smart Title / Sentence)

# Preview what would change (dry-run)
python -m src.cli_text_clean samples/messy_text.csv

# Apply the safe defaults
python -m src.cli_text_clean samples/messy_text.csv --apply

# Title-case the name column, upper-case the SKU column
python -m src.cli_text_clean products.csv --case title:name,upper:sku --apply

# Just trim and collapse — nothing fancy
python -m src.cli_text_clean messy.csv --preset minimal --apply

Three presets: minimal (trim + collapse only), excel-hygiene (default; everything safe ON), paranoid (adds lossy NFKC fold).

Outputs {input}_cleaned.csv plus a per-cell {input}_changes.csv audit (row, column, old, new, ops applied).

See docs/CLI-REFERENCE.md for every flag.

Documentation

CLI Reference — every flag with examples and recipe sections
Developer Guide — architecture, data flow, how to extend

Requirements

Python 3.10+
Dependencies: pandas, openpyxl, rapidfuzz, typer, phonenumbers, loguru, tqdm, charset-normalizer

License

Languages

Python 87.3%

HTML 10%

CSS 1.8%

Shell 0.4%

JavaScript 0.2%

Other 0.2%