feat(text_clean): visualize hidden characters in the cleaner GUI

The whole point of the cleaner is to remove characters the user can't
see — which makes the "before / after" preview nearly useless by default.
A cell with NBSP padding looks identical to a cell with regular spaces.

Two new helpers in src.core.text_clean:

  visualize_hidden_text(s)
    Plain-text rendering: each invisible/control/smart character is
    replaced by a glyph + [LABEL] (e.g. "·[NBSP]", "→[TAB]", "∅[ZWSP]",
    """[L DQUOTE]"). Suitable for terminal output, CSV exports, anywhere
    HTML is wrong. Unmapped C0 controls render as [U+XXXX].

  visualize_hidden_html(s) + hidden_char_css()
    HTML rendering: every flagged character is wrapped in a <span> with
    a CSS class and a tooltip showing the codepoint and label. Pair with
    hidden_char_css() to inject the matching styles. Three colour bands
    (whitespace, special, control) so the user can scan an audit table
    and spot what's being changed at a glance.

Mapping covers: ASCII tab/LF/CR, every NBSP variant (U+00A0, U+202F,
U+2009, …), zero-width family (ZWSP/ZWNJ/ZWJ/WJ/BOM/SHY), bidi marks
(LRM/RLM), all smart quotes, en/em dashes, ellipsis, prime/double-prime,
and guillemets. ASCII printable text passes through; HTML output also
escapes &/</> .

GUI wiring (src/gui/pages/2_Text_Cleaner.py)
  The "Examples" changes table now defaults to a hidden-char-rendered
  HTML view: every NBSP/ZWSP/smart-quote/control char is shown with its
  badge and codepoint tooltip. A "Show hidden characters" toggle lets
  the user fall back to the raw st.dataframe view if they prefer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-29 16:14:14 +00:00
parent 794d4cda94
commit 90ceada2d1
4 changed files with 284 additions and 1 deletions

View File

@@ -19,6 +19,8 @@ from src.core.text_clean import (
PRESETS,
CleanOptions,
clean_dataframe,
hidden_char_css,
visualize_hidden_html,
)
hide_streamlit_chrome()
@@ -205,9 +207,54 @@ if result.cells_changed:
)
st.markdown("**Examples (first 25 changes)**")
show_hidden = st.toggle(
"Show hidden characters (NBSP, ZWSP, smart quotes, control chars…)",
value=True,
help=(
"Highlights characters the cleaner is removing or replacing. "
"Hover any badge to see the codepoint and label."
),
key="textclean_show_hidden",
)
examples = result.changes.head(25).copy()
examples["row"] = examples["row"] + 1
st.dataframe(examples, use_container_width=True, hide_index=True)
if show_hidden:
# Inject the badge CSS once, then render an HTML table so the
# invisibles in old/new are actually visible to the user.
st.markdown(hidden_char_css(), unsafe_allow_html=True)
rows_html = []
for _, row in examples.iterrows():
rows_html.append(
"<tr>"
f"<td>{row['row']}</td>"
f"<td><code>{visualize_hidden_html(str(row['column']))}</code></td>"
f"<td>{visualize_hidden_html(str(row['old']))}</td>"
f"<td>{visualize_hidden_html(str(row['new']))}</td>"
f"<td><code>{row['ops_applied']}</code></td>"
"</tr>"
)
st.markdown(
"<table class='hidden-char-table'>"
"<thead><tr>"
"<th style='text-align:left'>Row</th>"
"<th style='text-align:left'>Column</th>"
"<th style='text-align:left'>Before</th>"
"<th style='text-align:left'>After</th>"
"<th style='text-align:left'>Ops applied</th>"
"</tr></thead>"
f"<tbody>{''.join(rows_html)}</tbody>"
"</table>"
"<style>"
".hidden-char-table { width: 100%; border-collapse: collapse; }"
".hidden-char-table th, .hidden-char-table td { "
" padding: 4px 8px; border-bottom: 1px solid #eee; "
" vertical-align: top; }"
".hidden-char-table tbody tr:hover { background: #fafafa; }"
"</style>",
unsafe_allow_html=True,
)
else:
st.dataframe(examples, use_container_width=True, hide_index=True)
st.markdown("**Cleaned preview (first 10 rows)**")
st.dataframe(result.cleaned_df.head(10), use_container_width=True)