The whole point of the cleaner is to remove characters the user can't
see — which makes the "before / after" preview nearly useless by default.
A cell with NBSP padding looks identical to a cell with regular spaces.
Two new helpers in src.core.text_clean:
visualize_hidden_text(s)
Plain-text rendering: each invisible/control/smart character is
replaced by a glyph + [LABEL] (e.g. "·[NBSP]", "→[TAB]", "∅[ZWSP]",
"""[L DQUOTE]"). Suitable for terminal output, CSV exports, anywhere
HTML is wrong. Unmapped C0 controls render as [U+XXXX].
visualize_hidden_html(s) + hidden_char_css()
HTML rendering: every flagged character is wrapped in a <span> with
a CSS class and a tooltip showing the codepoint and label. Pair with
hidden_char_css() to inject the matching styles. Three colour bands
(whitespace, special, control) so the user can scan an audit table
and spot what's being changed at a glance.
Mapping covers: ASCII tab/LF/CR, every NBSP variant (U+00A0, U+202F,
U+2009, …), zero-width family (ZWSP/ZWNJ/ZWJ/WJ/BOM/SHY), bidi marks
(LRM/RLM), all smart quotes, en/em dashes, ellipsis, prime/double-prime,
and guillemets. ASCII printable text passes through; HTML output also
escapes &/</> .
GUI wiring (src/gui/pages/2_Text_Cleaner.py)
The "Examples" changes table now defaults to a hidden-char-rendered
HTML view: every NBSP/ZWSP/smart-quote/control char is shown with its
badge and codepoint tooltip. A "Show hidden characters" toggle lets
the user fall back to the raw st.dataframe view if they prefer.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>