datatools-dev

giteadmin/datatools-dev

Fork 0

Commit Graph

Author	SHA1	Message	Date
Michael	90ceada2d1	feat(text_clean): visualize hidden characters in the cleaner GUI The whole point of the cleaner is to remove characters the user can't see — which makes the "before / after" preview nearly useless by default. A cell with NBSP padding looks identical to a cell with regular spaces. Two new helpers in src.core.text_clean: visualize_hidden_text(s) Plain-text rendering: each invisible/control/smart character is replaced by a glyph + [LABEL] (e.g. "·[NBSP]", "→[TAB]", "∅[ZWSP]", """[L DQUOTE]"). Suitable for terminal output, CSV exports, anywhere HTML is wrong. Unmapped C0 controls render as [U+XXXX]. visualize_hidden_html(s) + hidden_char_css() HTML rendering: every flagged character is wrapped in a <span> with a CSS class and a tooltip showing the codepoint and label. Pair with hidden_char_css() to inject the matching styles. Three colour bands (whitespace, special, control) so the user can scan an audit table and spot what's being changed at a glance. Mapping covers: ASCII tab/LF/CR, every NBSP variant (U+00A0, U+202F, U+2009, …), zero-width family (ZWSP/ZWNJ/ZWJ/WJ/BOM/SHY), bidi marks (LRM/RLM), all smart quotes, en/em dashes, ellipsis, prime/double-prime, and guillemets. ASCII printable text passes through; HTML output also escapes &/</> . GUI wiring (src/gui/pages/2_Text_Cleaner.py) The "Examples" changes table now defaults to a hidden-char-rendered HTML view: every NBSP/ZWSP/smart-quote/control char is shown with its badge and codepoint tooltip. A "Show hidden characters" toggle lets the user fall back to the raw st.dataframe view if they prefer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 16:14:14 +00:00
Michael	54f92ae47e	feat: implement text cleaner (script 02) with CLI, GUI, and tests Builds 02_text_cleaner.py from stub to working: character-level hygiene for CSV/Excel inputs covering trim, whitespace collapse, smart-character folding, Unicode NFC/NFKC, BOM strip, zero-width strip, control-char strip, line-ending normalization, and per-column case conversion. Three presets (minimal/excel-hygiene/paranoid) keep the buyer surface small. - src/core/text_clean.py: pure helpers + CleanOptions/CleanResult + clean_dataframe with dtype-safe column selection - src/cli_text_clean.py: Typer CLI mirroring the dedup CLI shape (dry-run by default, --apply writes cleaned + changes audit, JSON config save/load) - src/gui/pages/2_Text_Cleaner.py: real Streamlit page with preset picker, advanced toggles, preview, before/after metrics, and three download buttons - tests/test_text_clean.py + test_cli_text_clean.py: 92 new tests covering edge cases E1-E50 from the spec - samples/messy_text.csv: demo dataset surfacing UC1, UC3, UC6, UC10 in 10 rows - test-cases/uc16-uc26 + ec05-ec09: per-use-case and per-edge-case fixtures Docs: TECHNICAL.md §10.2 (full Tier 1/2/3 spec), DECISIONS.md v1.7 entry locking the spec, CLI-REFERENCE.md gains the text cleaner section, README.md gains a top-level Text Cleaner block, USER-GUIDE.md status row 02 promoted Skeleton -> Working. 200/200 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 15:14:15 +00:00

Author

SHA1

Message

Date

Michael

90ceada2d1

feat(text_clean): visualize hidden characters in the cleaner GUI

The whole point of the cleaner is to remove characters the user can't
see — which makes the "before / after" preview nearly useless by default.
A cell with NBSP padding looks identical to a cell with regular spaces.

Two new helpers in src.core.text_clean:

  visualize_hidden_text(s)
    Plain-text rendering: each invisible/control/smart character is
    replaced by a glyph + [LABEL] (e.g. "·[NBSP]", "→[TAB]", "∅[ZWSP]",
    """[L DQUOTE]"). Suitable for terminal output, CSV exports, anywhere
    HTML is wrong. Unmapped C0 controls render as [U+XXXX].

  visualize_hidden_html(s) + hidden_char_css()
    HTML rendering: every flagged character is wrapped in a <span> with
    a CSS class and a tooltip showing the codepoint and label. Pair with
    hidden_char_css() to inject the matching styles. Three colour bands
    (whitespace, special, control) so the user can scan an audit table
    and spot what's being changed at a glance.

Mapping covers: ASCII tab/LF/CR, every NBSP variant (U+00A0, U+202F,
U+2009, …), zero-width family (ZWSP/ZWNJ/ZWJ/WJ/BOM/SHY), bidi marks
(LRM/RLM), all smart quotes, en/em dashes, ellipsis, prime/double-prime,
and guillemets. ASCII printable text passes through; HTML output also
escapes &/</> .

GUI wiring (src/gui/pages/2_Text_Cleaner.py)
  The "Examples" changes table now defaults to a hidden-char-rendered
  HTML view: every NBSP/ZWSP/smart-quote/control char is shown with its
  badge and codepoint tooltip. A "Show hidden characters" toggle lets
  the user fall back to the raw st.dataframe view if they prefer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-29 16:14:14 +00:00

Michael

54f92ae47e

feat: implement text cleaner (script 02) with CLI, GUI, and tests

Builds 02_text_cleaner.py from stub to working: character-level hygiene
for CSV/Excel inputs covering trim, whitespace collapse, smart-character
folding, Unicode NFC/NFKC, BOM strip, zero-width strip, control-char
strip, line-ending normalization, and per-column case conversion. Three
presets (minimal/excel-hygiene/paranoid) keep the buyer surface small.

- src/core/text_clean.py: pure helpers + CleanOptions/CleanResult +
  clean_dataframe with dtype-safe column selection
- src/cli_text_clean.py: Typer CLI mirroring the dedup CLI shape
  (dry-run by default, --apply writes cleaned + changes audit, JSON
  config save/load)
- src/gui/pages/2_Text_Cleaner.py: real Streamlit page with preset
  picker, advanced toggles, preview, before/after metrics, and three
  download buttons
- tests/test_text_clean.py + test_cli_text_clean.py: 92 new tests
  covering edge cases E1-E50 from the spec
- samples/messy_text.csv: demo dataset surfacing UC1, UC3, UC6, UC10
  in 10 rows
- test-cases/uc16-uc26 + ec05-ec09: per-use-case and per-edge-case
  fixtures

Docs: TECHNICAL.md §10.2 (full Tier 1/2/3 spec), DECISIONS.md v1.7
entry locking the spec, CLI-REFERENCE.md gains the text cleaner
section, README.md gains a top-level Text Cleaner block, USER-GUIDE.md
status row 02 promoted Skeleton -> Working.

200/200 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-29 15:14:15 +00:00

2 Commits