feat(text_clean): visualize hidden characters in the cleaner GUI
The whole point of the cleaner is to remove characters the user can't
see — which makes the "before / after" preview nearly useless by default.
A cell with NBSP padding looks identical to a cell with regular spaces.
Two new helpers in src.core.text_clean:
visualize_hidden_text(s)
Plain-text rendering: each invisible/control/smart character is
replaced by a glyph + [LABEL] (e.g. "·[NBSP]", "→[TAB]", "∅[ZWSP]",
"""[L DQUOTE]"). Suitable for terminal output, CSV exports, anywhere
HTML is wrong. Unmapped C0 controls render as [U+XXXX].
visualize_hidden_html(s) + hidden_char_css()
HTML rendering: every flagged character is wrapped in a <span> with
a CSS class and a tooltip showing the codepoint and label. Pair with
hidden_char_css() to inject the matching styles. Three colour bands
(whitespace, special, control) so the user can scan an audit table
and spot what's being changed at a glance.
Mapping covers: ASCII tab/LF/CR, every NBSP variant (U+00A0, U+202F,
U+2009, …), zero-width family (ZWSP/ZWNJ/ZWJ/WJ/BOM/SHY), bidi marks
(LRM/RLM), all smart quotes, en/em dashes, ellipsis, prime/double-prime,
and guillemets. ASCII printable text passes through; HTML output also
escapes &/</> .
GUI wiring (src/gui/pages/2_Text_Cleaner.py)
The "Examples" changes table now defaults to a hidden-char-rendered
HTML view: every NBSP/ZWSP/smart-quote/control char is shown with its
badge and codepoint tooltip. A "Show hidden characters" toggle lets
the user fall back to the raw st.dataframe view if they prefer.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -19,6 +19,8 @@ from src.core.text_clean import (
|
||||
PRESETS,
|
||||
CleanOptions,
|
||||
clean_dataframe,
|
||||
hidden_char_css,
|
||||
visualize_hidden_html,
|
||||
)
|
||||
|
||||
hide_streamlit_chrome()
|
||||
@@ -205,9 +207,54 @@ if result.cells_changed:
|
||||
)
|
||||
|
||||
st.markdown("**Examples (first 25 changes)**")
|
||||
show_hidden = st.toggle(
|
||||
"Show hidden characters (NBSP, ZWSP, smart quotes, control chars…)",
|
||||
value=True,
|
||||
help=(
|
||||
"Highlights characters the cleaner is removing or replacing. "
|
||||
"Hover any badge to see the codepoint and label."
|
||||
),
|
||||
key="textclean_show_hidden",
|
||||
)
|
||||
examples = result.changes.head(25).copy()
|
||||
examples["row"] = examples["row"] + 1
|
||||
st.dataframe(examples, use_container_width=True, hide_index=True)
|
||||
if show_hidden:
|
||||
# Inject the badge CSS once, then render an HTML table so the
|
||||
# invisibles in old/new are actually visible to the user.
|
||||
st.markdown(hidden_char_css(), unsafe_allow_html=True)
|
||||
rows_html = []
|
||||
for _, row in examples.iterrows():
|
||||
rows_html.append(
|
||||
"<tr>"
|
||||
f"<td>{row['row']}</td>"
|
||||
f"<td><code>{visualize_hidden_html(str(row['column']))}</code></td>"
|
||||
f"<td>{visualize_hidden_html(str(row['old']))}</td>"
|
||||
f"<td>{visualize_hidden_html(str(row['new']))}</td>"
|
||||
f"<td><code>{row['ops_applied']}</code></td>"
|
||||
"</tr>"
|
||||
)
|
||||
st.markdown(
|
||||
"<table class='hidden-char-table'>"
|
||||
"<thead><tr>"
|
||||
"<th style='text-align:left'>Row</th>"
|
||||
"<th style='text-align:left'>Column</th>"
|
||||
"<th style='text-align:left'>Before</th>"
|
||||
"<th style='text-align:left'>After</th>"
|
||||
"<th style='text-align:left'>Ops applied</th>"
|
||||
"</tr></thead>"
|
||||
f"<tbody>{''.join(rows_html)}</tbody>"
|
||||
"</table>"
|
||||
"<style>"
|
||||
".hidden-char-table { width: 100%; border-collapse: collapse; }"
|
||||
".hidden-char-table th, .hidden-char-table td { "
|
||||
" padding: 4px 8px; border-bottom: 1px solid #eee; "
|
||||
" vertical-align: top; }"
|
||||
".hidden-char-table tbody tr:hover { background: #fafafa; }"
|
||||
"</style>",
|
||||
unsafe_allow_html=True,
|
||||
)
|
||||
else:
|
||||
st.dataframe(examples, use_container_width=True, hide_index=True)
|
||||
|
||||
st.markdown("**Cleaned preview (first 10 rows)**")
|
||||
st.dataframe(result.cleaned_df.head(10), use_container_width=True)
|
||||
|
||||
Reference in New Issue
Block a user