The analyzer's "Run Analysis" panel rendered sample cells via st.dataframe,
which (a) silently collapses leading/trailing ASCII whitespace and (b)
displays NBSP/ZWSP/control chars as nothing. The user couldn't see the
exact pollution they were being told about.
visualize_hidden_html gains a mark_outer_whitespace=True option that
wraps each leading and trailing ASCII space/tab in its own badge with a
"SP LEAD" / "SP TRAIL" tooltip. The badges are per-character so the
user can count exactly how much padding the cleaner will strip.
components.render_findings_panel now:
- injects hidden_char_css() once at the top of the panel
- replaces st.dataframe(samples) with a custom HTML table
- renders the value column with mark_outer_whitespace=True
- applies white-space: pre-wrap on value cells so any internal ASCII
whitespace also stays visible (browsers collapse runs by default)
Four new tests cover: leading+trailing badge counts, default-off
behaviour, leading tab badge, all-whitespace string treated entirely
as leading.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The whole point of the cleaner is to remove characters the user can't
see — which makes the "before / after" preview nearly useless by default.
A cell with NBSP padding looks identical to a cell with regular spaces.
Two new helpers in src.core.text_clean:
visualize_hidden_text(s)
Plain-text rendering: each invisible/control/smart character is
replaced by a glyph + [LABEL] (e.g. "·[NBSP]", "→[TAB]", "∅[ZWSP]",
"""[L DQUOTE]"). Suitable for terminal output, CSV exports, anywhere
HTML is wrong. Unmapped C0 controls render as [U+XXXX].
visualize_hidden_html(s) + hidden_char_css()
HTML rendering: every flagged character is wrapped in a <span> with
a CSS class and a tooltip showing the codepoint and label. Pair with
hidden_char_css() to inject the matching styles. Three colour bands
(whitespace, special, control) so the user can scan an audit table
and spot what's being changed at a glance.
Mapping covers: ASCII tab/LF/CR, every NBSP variant (U+00A0, U+202F,
U+2009, …), zero-width family (ZWSP/ZWNJ/ZWJ/WJ/BOM/SHY), bidi marks
(LRM/RLM), all smart quotes, en/em dashes, ellipsis, prime/double-prime,
and guillemets. ASCII printable text passes through; HTML output also
escapes &/</> .
GUI wiring (src/gui/pages/2_Text_Cleaner.py)
The "Examples" changes table now defaults to a hidden-char-rendered
HTML view: every NBSP/ZWSP/smart-quote/control char is shown with its
badge and codepoint tooltip. A "Show hidden characters" toggle lets
the user fall back to the raw st.dataframe view if they prefer.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>