Two unrelated UX issues addressed in one sweep across all nine tool
pages because they share the same edit surface.
(1) Sticky footer replaces the top + bottom back-link buttons.
Reported: a big white empty footer space at the bottom of every page;
the Back to Home button at the top scrolled out of view on long pages.
New ``render_sticky_footer()`` helper in ``components/_legacy.py``
injects a fixed-position bar at ``bottom: 0`` of the viewport with:
- A border-top so it visually reads as a non-movable bar.
- A semi-transparent background (rgba 0.96 + ``backdrop-filter: blur``)
so content underneath shows through faintly when the user scrolls.
- A styled ``<a href="home">`` anchor (not an ``st.button``) because
Streamlit widgets can't be CSS-positioned reliably — Streamlit owns
the widget's DOM container and re-mounts it on every rerun. A real
anchor sits exactly where the CSS puts it and triggers Streamlit's
URL routing to the home page.
- ``padding-bottom: 3.5rem`` on the main container so the last widget
isn't hidden behind the bar.
Called once per tool page, immediately after ``hide_streamlit_chrome()``
so it renders even on pages that ``st.stop()`` early before any other
content runs. The old top-and-bottom ``back_to_home_link()`` calls are
removed from every tool page; their entry/exit points were dropping
the button when the script short-circuited.
(2) Tool-page headers now localize.
Reported: switching the sidebar language picker to Spanish left the
tool page's title + caption in English. Root cause: every page had
hard-coded ``st.title("✂️ Clean Text")`` / ``st.caption("Trim
whitespace...")`` strings.
Added per-tool ``tools.<id>.page_title`` and
``tools.<id>.page_caption`` keys to ``en.json`` and ``es.json`` for
all nine tools. Routed each page's title/caption call through ``t()``.
Verified: with ``ui_lang=es`` set, the Clean Text page now renders
"✂️ Limpiar texto" + the Spanish caption.
Updated ``tests/gui/test_smoke.py::EXPECTED_SUBSTRINGS`` so the
``es`` column for each tool page asserts the actual Spanish string
(was a duplicate of the English string back when the page bodies
were English-only).
2220 tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reported: the "✕" buttons on the uploaded file list removed files
inconsistently — some clicks took, some didn't.
Two compounding causes:
1. ``key=f"_home_remove_{name}"`` embedded the raw filename in the
Streamlit widget key. Streamlit's widget-identity machinery
normalizes keys differently across reruns when they contain
spaces, dots, brackets, or non-ASCII characters, so a button's
identity could shift between the render where the user clicked
it and the rerun that should have processed the click. The click
was registered, but the post-rerun render produced a new widget
under a new effective key, and the original click was "lost".
2. The handler mutated ``home_uploads`` mid-loop while subsequent
iterations were still creating buttons. ``st.rerun()`` raises
synchronously, but if ANOTHER button's state changed in the same
pass (e.g. a stale click held over from a fast double-tap), the
ordering of state-mutation vs widget-key-update vs rerun could
race.
Fixes:
- Stable widget keys: ``f"_home_remove_{sha1(name)[:10]}"``. The
hash is identifier-safe regardless of spaces / dots / Unicode in
the filename. Verified across "sample with spaces.csv",
"sample.csv", and "日本語.csv" — three sequential Remove clicks
each remove exactly one file with no clicks lost.
- Two-phase capture: the loop collects the target ``to_remove``
filename, finishes rendering every other row at consistent widget
identity, THEN mutates state once and reruns. No more mid-loop
``del`` racing other widgets' click handlers.
- Wider click target: column ratio ``[8, 1]`` (was ``[12, 1]``) and
``use_container_width=True`` on the Remove button so the click
surface fills the entire column. Label changed to "Remove" for
the same reason — "✕" is a thin glyph that compressed the
hit-test region.
2220 tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reported: too much whitespace between widgets, dividers, and headings.
Compact-spacing CSS layer added to ``_HIDE_CHROME_CSS`` (so it applies
on every page that calls ``hide_streamlit_chrome``):
- ``[data-testid="stVerticalBlock"]`` and ``stHorizontalBlock`` gap
trimmed from Streamlit's default ~1rem to 0.5rem.
- Heading margins (h1-h4) tightened — h1/h2/h3 used to leave 1-1.5rem
above; now 0.25-0.5rem.
- ``hr`` (``st.divider()``) drops from 1rem above+below to 0.4rem.
- Markdown paragraphs and captions: 0.25rem bottom margin instead of
the default 1rem.
- Expander summary padding reduced (0.35rem top/bottom).
- File-uploader, button, and metric tiles: trimmed internal padding.
Also slimmed the main-container padding from 1rem top / Streamlit
default bottom (~6rem) to 0.5rem top / 0.75rem bottom.
The existing ``zoom: 0.85`` on ``.stApp`` is kept — the user wanted
*less white space*, not *smaller content*, and dropping zoom would
shrink type alongside everything else.
2220 tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reported: user asked whether we can send Alt+F4 / Ctrl+W to the
browser from JavaScript to force-close a tab.
Honest answer that's now baked into the hint message: NO. Synthesized
keyboard events from page JS only reach DOM event listeners, not the
browser chrome or the OS. There is no flag, API, or trick that lets
a page close a tab the user opened themselves. The page CAN close a
window it opened (window.opener trail) or one whose display-mode is
``standalone`` (Chrome/Edge ``--app=URL``) — that's what
``python -m src.gui`` arranges, and that's the path that actually
closes the window without a manual Ctrl+W.
Improvements landed:
1. ``isStandalone(win)`` detects Chrome --app windows up front
(``matchMedia('(display-mode: standalone)').matches``). In a
regular tab the manual hint surfaces immediately on the
"Close this window" click; in --app mode we only show it if the
close attempt actually fails.
2. ``fallbackToBlank(win)`` navigates the tab to ``about:blank``
via ``location.replace`` (no history pollution) so the user
sees a clean empty tab instead of the farewell overlay frozen
over Streamlit's connection-error banner. They still have to
Ctrl+W the blank tab, but the screen is no longer a misleading
"did it close or not?" mess. Fires 250 ms after a failed close
in --app mode (very rare path), or 1.5 s in a regular tab so
the user has time to read the hint.
3. Hint message rewritten in en + es to explain WHY the close is
blocked (browser security — not something we can override), to
acknowledge the Alt+F4 / Ctrl+W question directly (those don't
work either, for the same reason), and to point at
``python -m src.gui`` as the path that gives a clean auto-close.
2220 tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reported: clicking "Back to Home" from a tool page returned the user
to an empty home — their previously-uploaded files were gone.
Root cause: Streamlit's ``st.file_uploader`` widget state does not
reliably survive ``st.switch_page``. The widget gets unmounted on
navigation, and its ``UploadedFile`` objects don't always re-attach
on remount. The home page was treating the widget's return value as
the source of truth, so after navigation the list was empty.
Fix: introduce a session-state stash keyed by filename
(``home_uploads: dict[str, {"bytes": bytes, "size": int}]``) and
treat it as the source of truth for everything downstream — the
active-file pickup keys for tool pages, the per-file findings
cache, and the rendered file list. The widget is reduced to its
narrow role of capturing NEW uploads, which we merge into the stash
without ever removing.
Per-file remove: a "✕" button next to each filename drops just that
file (and its findings). The widget's own "✕" is bypassed by our
rendering, since trusting it would let the widget's state diverge
from the stash.
Clear-results button is unchanged: it wipes only the analysis cache,
leaving uploaded files intact (per the user's "persistent until
cleared" requirement — removal is per-file via "✕").
Tool-page compatibility: the singular ``home_uploaded_{name,size,
bytes}`` keys still get populated from the first entry in the stash
on every render, so ``pickup_or_upload`` on a tool page keeps
finding the active upload. When the user removes the active file,
those keys are cleared so the next render repopulates from whatever
file is now first.
``_StashedUpload`` is a small duck type ( ``.name``, ``.size``,
``.getvalue()`` ) so ``_run_analysis_on_upload`` accepts entries
restored from the stash without changes.
2220 tests pass. Smoke-verified via AppTest: pre-stashed
``home_uploads`` renders the file list with per-file remove buttons,
and the persistent state survives a simulated navigation round-trip.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two issues, same fix surface.
(1) Reported crash on Back-to-Home:
StreamlitAPIException: Could not find page: app.py.
``st.switch_page("app.py")`` doesn't work under ``st.navigation`` —
the entry script is the nav manager itself and is not a registered
page. The fix needs to pass an ``st.Page`` object whose script
identity matches one registered in the nav.
First-pass attempt (``from src.gui.app import _home_page``) hit a
worse failure: importing ``app.py`` from inside a tool-page render
re-executes the nav setup with the WRONG "main script" context, so
every ``st.Page("pages/N_foo.py", ...)`` call in ``_build_navigation``
fails with "file could not be found".
Extract the home renderer into its own module ``src/gui/_home.py``
which has no top-level Streamlit side effects. Both the nav manager
and the back-link helper import ``_home_page`` from there. The Page
object built at click time has the same callable identity as the one
registered, so ``st.switch_page`` resolves it.
(2) Reported UX: the back button scrolled out of view on long pages.
Add a second ``back_to_home_link(key="_back_to_home_link_bottom")``
call near the footer of every tool page (1-9). The unique key avoids
widget-id collision with the top instance. Coming-Soon stubs get it
unconditionally; Ready tools render it only after a result exists
because the page short-circuits with ``st.stop()`` before then —
when no result is on screen the page is short enough that the top
link is sufficient.
2220 tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reported: clicking "Open Downloads folder" was opening the Documents
folder instead of Downloads. Root cause is the classic Windows
gotcha: when the path contains a space (e.g.
``C:\Users\Michael Dombaugh\Downloads``), Python's
``subprocess.Popen`` packs the ``/select,...`` argument into a single
quoted token, and Explorer's ``/select`` argument parser does NOT
accept that form — it silently falls back to whatever the user's
default Explorer view is (typically Documents).
Resolution paths considered:
- ``shell=True`` with a hand-built command string — works but opens
the door to shell-injection if a file_name ever contained a quote
or special char.
- ``cmd /c start "" explorer /select,...`` — same parsing issue.
- ctypes ShellExecuteW — pulls in a Windows-only dependency.
- **Skip /select. Open the folder directly.** ✓
Going with the last. ``explorer <folder>`` reliably opens the folder
regardless of spaces in the path; the user finds the freshly-saved
file by its name. The previous "highlight the file" nicety wasn't
worth the path-parsing fragility — every user folder on Windows is
``C:\Users\<name>`` and every Windows username can contain a space.
macOS keeps the ``open -R <file>`` reveal-in-Finder path because
macOS argument parsing is sane and that's a strict UX win.
2220 tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reported: clicking "Open Downloads folder" did nothing visible. The
previous implementation called ``os.startfile(folder)`` on Windows,
which is known to silently no-op or open Explorer behind the active
window in some configurations (Streamlit running headless, no
foreground rights inherited by the click handler thread, etc.).
Switch to the more reliable ``explorer /select,<file>`` form:
- Opens Explorer with the just-saved file pre-highlighted instead of
just navigating to the folder — better UX than the old behavior.
- explorer.exe is a real GUI process that's spawned in the user's
session with foreground rights, so it shows up on top.
- Fallback chain on Windows: ``/select`` first, then plain
``explorer <folder>``, then ``os.startfile`` as a last resort.
macOS upgraded the same way: ``open -R <file>`` reveals in Finder
rather than opening the directory.
Linux: no reliable cross-distro reveal, so ``xdg-open <folder>``.
Plus user feedback at the call site:
- On successful dispatch: ``st.toast("Opening <folder>", icon="📂")``
— confirms we tried, in case the window comes up behind the
browser.
- On dispatch failure: ``st.warning`` with the full path the user
can copy/paste into their file manager manually.
2220 tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Switch the download mechanic from "browser <a download> with a data:
URL" to "write the bytes directly to the user's Downloads folder and
show them the exact path". DataTools runs as a local Streamlit app,
so the "server" IS the user's machine — there's no reason to go
through the browser save dialog at all.
Flow:
1. Click "Download <something>" button (rendered as a regular
``st.button``, so no widget-collision issues).
2. Bytes are written to ``Path.home() / "Downloads" / file_name``
(overwriting any same-named file).
3. The page reruns and renders a success caption with the absolute
path the file landed at.
4. An "📂 Open Downloads folder" button appears. Clicking it pops the
OS file manager via ``os.startfile`` (Windows), ``open`` (macOS),
or ``xdg-open`` (Linux).
Why this is better than the previous HTML-data-URL helper:
- Unambiguous about where the file went — user sees the full path,
not "wherever your browser was configured to save".
- The data: URL approach base64-inflated the page payload by 33% and
bloated for large outputs; server-side write is byte-for-byte.
- No more browser-side widget collision class of bug.
- The save action is a real Streamlit button, so the existing widget
semantics (disabled, help tooltip, key isolation) work without
workarounds.
API surface unchanged. New canonical name ``local_download_button``;
``html_download_button`` is kept as a back-compat alias that points
at the same implementation — every existing call site continues to
work without edits.
Tests are protected from polluting the developer's home dir via a
``DATATOOLS_DOWNLOADS_DIR`` env var override returned by the new
``_downloads_dir()`` helper. Smoke verified end-to-end via AppTest:
click → file appears in tmp dir → success banner shows path →
open-folder button renders.
2220 tests pass, 91 skipped, 35 s.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reported crash: clicking "Clean Text" with mojibake.csv (a junk corpus
file that the cleaner ran on but produced zero changes) blew up the
results render with
NameError: name 'show_hidden' is not defined
at the cleaned-preview block. ``show_hidden`` was defined inside
``if result.cells_changed:`` and referenced unconditionally below.
Fix on the page itself: hoist the ``show_hidden = st.toggle(...)``
declaration out of the conditional so it's always in scope for the
downstream cleaned-preview render. One toggle now drives both the
Examples table (which only renders when there are changes) AND the
cleaned preview (which always renders).
Generalized regression net: ``tests/test_junk_corpus_tool_pages.py``.
For nine representative junk files (empty, only_nul, mojibake,
invalid_utf8, utf16_le_no_bom, mismatched_columns, all_nulls,
corrupt_xlsx, single_column) and every Ready/Coming-Soon tool page,
the test:
1. Stashes the junk bytes as the home upload via session_state.
2. Runs the page through AppTest, asserts ``app.exception`` is empty.
3. If the page exposes a deterministic primary-action button label,
clicks it and asserts no exception on the post-click render.
Pages that catch a bad file at read time and short-circuit via
``st.error`` + ``st.stop`` are correctly skipped from the
primary-action half (the button isn't rendered). A genuine crash
shows up as ``app.exception`` carrying a Python traceback — exactly
what the user reported, exactly what we now catch.
162 tests collected, 102 passed, 60 skipped. 4 seconds.
Full suite: 2220 passed, 91 skipped, 35 s.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reported: uploading 13_non_latin_scripts.csv made the home page bubble
a ``pandas.errors.EmptyDataError`` traceback up through the page
chrome instead of surfacing as a per-file error. In a multi-file
analysis run that kills every other file's results too, which is
worse than the symptom itself.
Wrap ``_run_analysis_on_upload`` in proper error handling:
- Empty bytes ``getvalue() == b""`` short-circuits with a synthetic
error Finding telling the user the upload was zero-byte and to
re-upload.
- Empty ``repair.repaired_bytes`` (file was all NULs / BOM / stripped
to nothing) likewise surfaces as a synthetic Finding rather than
reaching pd.read_csv.
- ``pd.errors.EmptyDataError`` from pandas is caught and rendered as
a Finding that names the file, its byte size, and suggests opening
it in a text editor to verify the header row matches the data row
delimiter.
- Any other exception during read/analyze is caught and surfaces as
a Finding via ``format_for_user`` so the user gets a clean message,
not a Python traceback.
Each file in a multi-file run now stands alone: a bad file produces
one red banner in its own card, every other file analyzes normally.
The 13_non_latin_scripts.csv corpus file is 249 bytes of valid UTF-8
on disk and parses cleanly under the same code path locally — the
user's specific symptom is likely a zero-byte upload (browser /
network / Python 3.14 + Streamlit edge case). The new ``empty_upload``
finding will name the bytes count so they can confirm.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There is no JavaScript override for browser tab-close security:
``window.close()`` only succeeds on windows JS opened (Chrome --app
windows qualify; a regular browser tab does not). What we can do is
make the --app path easier to hit and the failure case more
actionable.
Three changes:
1. ``src/gui/__main__.py`` — extend browser detection. PATH lookup
now also looks for ``msedge`` / ``microsoft-edge``; Windows install
candidates include the Edge install path; macOS candidates include
Edge and Chromium. Edge is Chromium-based, supports ``--app``, and
ships on every Windows 10+ machine — so users without Chrome no
longer fall through to the regular browser tab. When the fallback
IS hit, print a warning to stderr explaining why Close-from-page
will require Ctrl+W. Renamed ``_find_chrome`` to
``_find_app_browser`` to reflect the broader scope.
2. ``_FAREWELL_SCRIPT_TEMPLATE`` in ``components/_legacy.py`` —
factor close attempts into a ``tryClose`` helper that runs three
escalating tries: standard ``win.close()``, the
``win.open('', '_self')`` history-rewrite trick (no-op in modern
Chrome but free), and ``win.top.close()``. Auto-close on paint AND
the manual button now both call this helper. Skip the manual hint
if the close eventually succeeded between the click and the 250 ms
timeout.
3. ``quit.close_hint`` in en/es i18n packs — rewrite the message to
tell the user honestly that this is a browser security restriction,
tell them the Ctrl+W keystroke that works, and point them at
``python -m src.gui`` for the auto-closing app-mode experience.
2008 tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reported symptom: only the FIRST download button in a multi-button
row pops the browser save dialog. The second and third do nothing on
click. Affects every tool page that exposes (cleaned + audit + config)
downloads.
Root cause is ``st.download_button`` itself — when several render in
the same script pass, the click-to-bytes wiring on the browser side
mis-routes and only one button's data is actually exposed. Explicit
``key`` arguments don't fix it; ``use_container_width=True`` doesn't
help either; we confirmed this in the Text Cleaner reverts.
Replace the widget with a real ``<a download="file" href="data:...">``
anchor rendered via ``st.markdown(..., unsafe_allow_html=True)``.
Bypasses Streamlit's widget machinery entirely; behaves identically to
a native browser download. Side benefit: clicking it does NOT trigger
a script rerun, so other in-flight UI state survives.
New helper ``html_download_button`` lives in
``src/gui/components/_legacy.py`` (exported from ``components``). API:
html_download_button(
label, data,
*, file_name, mime="application/octet-stream",
disabled=False, help=None, use_container_width=True,
)
Translation pattern applied across every tool page (and shared
``results_summary`` / ``config_panel`` widgets in ``_legacy.py``):
- ``st.download_button(`` -> ``html_download_button(``
- ``data=foo_bytes`` kwarg -> positional second arg
- ``key="..."`` -> dropped (helper has no widget identity)
- ``use_container_width=True`` -> dropped (default)
- ``disabled=`` and ``help=`` pass through unchanged
- Pre-computed byte buffers kept where they were
Total: 17 sites replaced (3 in Text Cleaner, 3 in Format
Standardizer, 3 in Fix Missing Values, 3 in Map Columns, 3 in
Automated Workflows, 2 in Find Duplicates page + 4 in shared
_legacy.py widgets used by Find Duplicates).
Caveat: data: URLs balloon by 33% (base64). Fine for tool output
sizes we ship; if a future result topped a few hundred MB we'd want a
Blob-URL fallback.
The marketing demo at src/gui/app_demo.py keeps its single
st.download_button — single button, no collision, no need to switch.
2008 tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Apply the Clean Text page's post-run UX pattern to every other Ready
tool page (Find Duplicates, Standardize Formats, Fix Missing Values,
Map Columns, Automated Workflows) for consistency and ease of use.
Per page:
1. Preview wrapped in ``st.expander(f"Preview: {filename}",
expanded=not _has_result)``. Open before a result exists, folded
afterwards.
2. Options / configuration controls wrapped in
``st.expander("Options", expanded=not _has_result)``. Inner
sub-expanders preserved (Streamlit 1.36+ supports nesting).
3. After the primary action stashes the result, set a one-shot
``_<tool>_scroll_to_results`` flag in session state and call
``st.rerun()`` so the preview + options expanders see the new
state on the next pass and collapse themselves.
4. ``<div id="<tool>-results-anchor" style="height:1px">`` placed
immediately before the Results subheader.
5. End-of-page: pop the scroll flag and inject a tiny
``streamlit.components.v1.html`` iframe whose ``<script>`` calls
``scrollIntoView`` on the parent document's anchor. One-shot, so
unrelated reruns (toggling Show-hidden, etc.) don't yank the
viewport.
6. Download buttons hardened against the multi-button Streamlit
footgun: byte buffers pre-computed outside the column scopes,
explicit unique ``key="<tool>_dl_<purpose>"`` per button,
``use_container_width=True``, and previously-conditional buttons
now render unconditionally with ``disabled=True`` + a help
tooltip when the underlying data is empty so layout stays steady.
Per-page judgment calls (already noted in agent reports):
- Find Duplicates: sheet picker and delimiter selector kept OUTSIDE
expanders (the user still needs to see them when a file fails to
parse).
- Fix Missing Values: missingness profile wrapped INSIDE the Options
expander together with Strategy — the Results section already
shows a before/after missingness comparison that supersedes the
static input profile.
- Map Columns: all three subsections (Target schema, Strategy,
Mapping) wrapped under one outer Options expander, matching the
Text Cleaner pattern.
- Automated Workflows: inner "Recommended tool order" expander stays
nested inside the outer Options wrap; Run button stays outside
Options so the user can re-run after tweaking the (collapsed)
editor.
2008 tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The farewell overlay already attempted ``window.top.close()`` after a
Close click — but browsers only honour that for tabs that JS opened
(Chrome --app windows qualify; a regular browser tab does not). For
users whose Chrome wasn't auto-detected and who fall back to
``webbrowser.open``, the overlay stays put and they had no in-page
way to close.
Add to the overlay HTML:
- A "Close this window" button (uses the user-gesture path, which has
slightly looser browser rules than auto-close).
- A hidden hint paragraph that reveals itself 250 ms after the
button is clicked IF the window is still here, telling the user to
press Ctrl+W (⌘W on Mac).
Wired through the existing _farewell_script template + ``_js_html_safe``
escaping so neither label can break out of the JS string literal.
New i18n keys (en + es): ``quit.close_window_button`` and
``quit.close_hint``.
The existing auto-close attempt remains — Chrome --app users still get
their window closed without touching the button.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Only "Download cleaned CSV" was working; "Download changes audit" and
"Download config JSON" did nothing on click.
The symptom is the classic Streamlit footgun for multiple
``st.download_button`` widgets in adjacent columns: without an explicit
``key`` argument the auto-derived widget IDs can collide, especially
when one button is conditionally rendered, and only the first button
in source order actually fires on click. Same goes for unstable
``data`` bytes recomputed inside the ``with col:`` block — the widget
identity can drift between renders.
Robustness pattern applied:
- Compute all three byte buffers up front, outside the columns, so the
``data`` parameter is the same object across reruns.
- Pass an explicit unique ``key`` ("textclean_dl_cleaned" /
"textclean_dl_changes" / "textclean_dl_config") to each button.
- Render the changes button unconditionally with ``disabled=True`` and
a help tooltip when ``result.changes.empty`` — instead of hiding it.
Layout stays steady and the empty case is self-explanatory.
- ``use_container_width=True`` so the three buttons size identically
inside their columns.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After clicking Clean Text the user was left at the bottom of the
script with the Options block still expanded and no viewport movement
— they had to scroll to find the Results.
- Wrap the whole Options block in an outer ``st.expander("Options",
expanded=not _has_result)``. After the Clean Text rerun, both
Preview AND Options collapse, leaving the primary action button +
Results as the only prominent elements above the fold. The inner
Advanced-options expander is preserved as a nested expander
(supported in Streamlit 1.36+; this repo pins 1.35+).
- Add a 1px anchor div ``#textclean-results-anchor`` immediately
before the Results subheader.
- On Clean Text click, set a one-shot ``_textclean_scroll_to_results``
flag in session state; on the next render, pop the flag and inject
a tiny ``st.components.v1.html`` iframe whose ``<script>`` calls
``scrollIntoView`` on the parent document's anchor. One-shot so
re-renders triggered by other widgets (Show-hidden toggle, etc.)
don't jerk the viewport back to the top of Results.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two small UX fixes on the Clean Text page:
1. The input preview is now wrapped in an ``st.expander`` whose
default-expanded state is ``not has_result``. Clicking the
"Clean Text" primary button stashes the result and calls
``st.rerun()`` so the next pass sees the result in session state
and the expander folds — the Results section becomes the primary
visual focus. User can re-expand manually to re-inspect the source.
2. The Examples (changes audit) table's Before/After columns were
calling ``visualize_hidden_html`` WITHOUT ``mark_outer_whitespace``,
so leading/trailing whitespace — which is exactly what the cleaner
most often removes — was invisible. Pass ``mark_outer_whitespace=True``
to match the input-preview rendering. Column-name cell now mirrors
that flag too.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Multi-file workflow: a user uploads several files on Home, clicks
"Open <Tool>" on one file's findings, lands on a tool page. The
sidebar lets them get back to Home, but a top-of-page back affordance
is more discoverable and keeps the hand in the same screen region as
the upload list they're working through.
- New ``back_to_home_link()`` helper in components/_legacy.py renders
a secondary button that calls ``st.switch_page("app.py")`` — under
``st.navigation`` that routes to the default (Home) page.
- Wired into every tool page (1-9) directly after
``hide_streamlit_chrome()`` and BEFORE the license gate so a Lite
user who lands on a locked tool can navigate away without paying.
- New i18n key ``nav.back_to_home`` ("← Back to Home" /
"← Volver al inicio") in en/es packs.
2008 tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Restoring ``render_findings_panel`` on the home page. Previous commit
(c575efd) inlined a flat renderer that dropped the per-tool grouping
and the "Open <Tool>" jump links — that was an over-correction. The
user only wanted the bottom tool-card grid gone (already removed in
ff2eaeb). The grouping inside the findings panel is what lets a user
land on a specific finding and one-click into the cleaner that fixes
it; without it they'd have to guess which sidebar entry to open.
Tool-card grid stays removed. Sidebar nav is unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The home page was calling ``render_findings_panel``, which groups
findings by tool into expanders and renders an "Open <Tool>" page
link under each. After uploading a file, the user still saw a tool
list (just under a different shape) — defeating the earlier cleanup
that removed the tool-cards grid.
Inline a flat renderer in ``_home_page``: per uploaded file, render
the filename header + severity summary + a flat list of findings via
``_render_one_finding`` directly. No expanders, no tool names as
section headers, no per-tool page-link buttons. Tool discovery
happens in the sidebar.
``render_findings_panel`` itself is unchanged — it still groups by
tool and remains tested via the findings-panel harness, but is no
longer used on the home page.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The sidebar nav was passing ``tool.name`` (the registry's English
field) to ``st.Page``, so the tool entries stayed in English even
after the user picked Spanish from the language selector. Section
headers were already i18n-driven; tool entries were not.
Switch to ``tool_name(tool_id)`` which routes through ``t(...)`` and
picks up the active language from session state. Verified: with
``ui_lang=es`` the sidebar renders Buscar duplicados / Limpiar texto /
Mapear columnas / etc. instead of the English fallbacks.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Close is now a direct shutdown trigger: visiting the Close page (the
sidebar entry) fires shutdown_app() immediately — no confirm step, no
intermediate body. The farewell overlay paints and os._exit(0) lands
~1s later from a daemon thread.
Layout: Close moved into its own bottom-of-sidebar section so the
destructive action is visually separated from Account/Activate.
- New shutdown_app() in components/_legacy.py replaces quit_button.
os._exit thread is skipped when "pytest" is in sys.modules so the
test suite doesn't suicide on rendering 99_Close.
- pages/99_Close.py shrinks to set_page_config + chrome + shutdown_app.
- app.py nav grows a new "Close" section header (new
nav.section_close key in en/es packs) pinned at the bottom of the
navigation dict.
Tests updated:
- TestQuitButtonRenders → TestClosePageShutsDownImmediately.
Assert the shutdown caption renders + no confirm button exists.
- test_smoke EXPECTED_SUBSTRINGS["99_Close"] now pins
"Shutting down" / "Cerrando" (the visible page body) instead of
the removed page title.
2008 tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Home is now upload + analysis only. The page accepts multiple files in
one go, analyzes each independently, and renders findings grouped by
filename in bordered containers. The 3-section tool-card grid is gone —
discovery happens via the sidebar now.
Mechanics:
- file_uploader uses accept_multiple_files=True. Each file's findings
cache in session_state["home_findings_by_file"] keyed by filename so
removing a file via Streamlit's "x" button drops its findings too,
and re-clicking Run only re-analyzes pending files.
- The first uploaded file is mirrored into the singular
home_uploaded_{name,bytes,size} keys so tool pages continue to pick
up an "active" upload through pickup_or_upload — no tool-page changes.
- New i18n keys: upload.intro_multi, upload.uploader_label_multi,
upload.clear_results, upload.empty_state. upload.heading text is
updated to "Upload one or more files to start" (EN + ES).
Dropped tests pinning the tool grid:
- TestHomeToolGridLocalization (test_chrome.py)
- test_home_tool_card_uses_es_name (test_smoke.py)
- TestLiteHomeGridBadges (test_lite_tier.py — locked-card lock-badge
assertions; locking is still enforced per-tool-page via
require_feature_or_render_upgrade)
2009 tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Home is now the only entry point: the "Run analysis" button on the
upload section IS the review step (findings render inline via
render_findings_panel). Tool pages no longer gate on a passed
normalization — running the analyzer is sufficient context.
Removed:
- src/gui/pages/0_Review.py
- src/gui/components/gate.py (re-export seam)
- require_normalization_gate() in src/gui/components/_legacy.py
- "review" section enum in tools_registry.py
- Data Review entry in app.py navigation
- require_normalization_gate() calls + imports in all nine tool pages
- tests/gui/test_gate.py (whole file)
- TestReviewWorkflow in tests/gui/test_workflows.py
- 0_Review entry in tests/gui/test_smoke.py PAGE_SLUGS
- stash_upload's normalization_result+normalization_for stashing
- stash_upload_without_gate (was the gate's negative-path helper)
2017 tests pass (16 retired with the gate flow).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When a user lands on Review without an upload, show a file uploader
on the page itself and auto-run the analyzer once a file is picked,
rather than bouncing them to the home page with a "Back to home"
button.
Auto-analyze is the right default here: the user is already on the
Review page, so they've implicitly committed to a scan. Stashing the
bytes in the same session-state keys the home page uses keeps the
rest of the flow (encoding picker, gate, tool pages) unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sidebar nav now groups tools under Data Review / Data Cleaners /
Transformations / Automations via st.navigation, replacing the flat
auto-discovered list. Tool display names switch to action-first
phrasing (Find Duplicates, Fix Missing Values, Find Unusual Values,
Standardize Formats, Clean Text, Quality Check, Map Columns, Combine
Files, Automated Workflows) in EN + ES packs and on each page's H1.
The Data Cleaners section follows the requested order: Missing
Values → Outliers → Text Cleaner → Format Standardizer → Deduplicator
→ Quality Check. (Text Cleaner kept inside cleaners since the request
didn't list it but the tool still ships.) Registry now carries a
section field; helpers added: tools_in_section(), section_label().
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New operator CLI at src/admin_cli.py: mint, list, revoke, ping —
talks to the server's /internal/* endpoints over a local SSH tunnel.
Stdlib-only on the desktop side (urllib + typer), no new top-level
deps. Auth via $DATATOOLS_ADMIN_TOKEN.
scripts/generate_license.py is now annotated as a break-glass tool
for when the server is unreachable — routine work goes through the
new CLI so the authoritative `licenses` row is created.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two coupled hardening upgrades.
1. Asymmetric signatures (HMAC → Ed25519)
The previous HMAC scheme used a symmetric secret that any motivated
reverse engineer could pull out of the shipped binary and use to
mint blobs for any tier / name / email. With Ed25519, the binary
ships only the public verification key; the signing key never
leaves the seller's environment, so binary compromise no longer
yields forgery.
- src/license/crypto.py rewritten around
cryptography.hazmat.primitives.asymmetric.ed25519. Same public
API surface (sign/verify/encode_blob/decode_blob), same canonical
JSON encoding — drop-in for the manager / cli / GUI layers.
- DATATOOLS_LICENSE_PRIVKEY (seller-side) and
DATATOOLS_LICENSE_PUBKEY (build-time) env vars supply the keys;
the in-source dev keypair (src/license/_dev_keypair.py)
deterministically derives from a seed phrase for repro builds and
tests.
- Blob prefix bumped DTLIC1: → DTLIC2:. Decoding a DTLIC1 blob
surfaces a clear "old format" error rather than a confusing
signature mismatch.
- scripts/generate_keypair.py mints fresh production keypairs for
the seller (run once, stash the private key offline). Adds
cryptography>=41,<46 to requirements.txt (was an undeclared
transitive dep).
2. Production-safe tripwire
assert_production_safe() refuses to boot a frozen / shipped build
when either:
- DATATOOLS_DEV_MODE=1 is set (would unconditionally bypass every
license check — fine in source/test but catastrophic in a buyer
install).
- The active verification key is still the embedded dev key (the
build pipeline forgot to set DATATOOLS_LICENSE_PUBKEY).
No-op in source / pytest runs (sys.frozen is unset) so test
fixtures and dev workflows keep working without ceremony. Called
from src/cli_license_guard.guard() and from hide_streamlit_chrome
— so it fires on every CLI invocation and every GUI page load.
Tests: 49 license-layer unit tests (was 40); added Ed25519
wrong-key rejection, dev-keypair seed pin, blob v2 prefix, v1
rejection with clear message, and four production-safe scenarios
(no-op in source, fires on DEV_MODE in frozen, fires on dev key in
frozen, passes in frozen with prod pubkey). Total: 2024 → 2033.
Docs (REQUIREMENTS §17a, DEVELOPER licensing recipe, DECISIONS
§9b + decision log) updated with the new threat-model write-up,
key-storage workflow, and tripwire behaviour.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two coupled changes:
1. Lite tier
- New Tier.LITE in src/license/schema.py.
- FEATURES_BY_TIER[Tier.LITE] = {Deduplicator, Text Cleaner,
Format Standardizer}. The three universally-useful tools that
cover the most common bookkeeping / RevOps / Klaviyo prep
workflows. Other six tools require Core.
- i18n: license.tier_lite, license.feature_locked_title,
license.feature_locked_body, license.upgrade_link,
license.status_locked (en + es).
- Per-tool feature gate at every GUI tool page
(require_feature_or_render_upgrade) and every tool CLI
(guard(feature=...)). A locked tool renders an upgrade
prompt + Manage-license button (GUI) or exits with code 2
(CLI).
- Home grid: tool cards the user's tier doesn't unlock get a
red 🔒 Locked badge in place of green Ready.
2. Trial removed
- Activation form's "Start 1-year trial" button removed.
- license_cli's `trial` subcommand removed.
- activation.trial_button / activation.trial_help i18n keys
dropped (pack parity test stays green).
- Tier.TRIAL stays in the enum (back-compat with any field-
tested trial licenses); LicenseManager._mint stays internal
for tests and the seller's key generator.
- Decision logged in DECISIONS §9b: a 1-year all-features
trial undercuts paid Lite; paid-only keeps tier economics
clean.
Tests (+29 net): +17 Lite-tier unit/guard tests + 13 Lite-tier
GUI tests + 1 trial-absent assertion - 2 trial CLI tests - 1
trial GUI button test. Total: 1995 → 2024.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A complete offline licensing layer (no internet at any step):
Core
- src/license/ — schema (License, Tier, FeatureFlag), HMAC crypto,
JSON storage, LicenseManager singleton with activate/renew/
deactivate/issue_trial. Tier-scaffolded so future SKUs can carve
per-tool feature sets without consumer-code edits.
- scripts/generate_license.py — creator-only key generator. Mints a
DTLIC1: blob the buyer pastes into the activation page.
GUI
- New activation form component (src/gui/components/activation.py).
- hide_streamlit_chrome() now inline-renders the activation form when
no valid license is present (every page short-circuits to the form
until activated).
- Sidebar shows tier + days remaining; renewal warning under 30 days.
- New pages/_Activate.py for revisiting the form after activation.
CLI
- src/license_cli.py — activate / renew / status / trial / deactivate
commands. Exempt from the guard.
- src/cli_license_guard.py — drop-in guard call added to every tool
CLI's main(). Lets --help through; respects DATATOOLS_DEV_MODE.
i18n
- New activation.* and license.* keys in en.json + es.json
(page title, form labels, status badges, renewal warnings, error
messages). Pack parity test stays green.
Test infrastructure
- tests/conftest.py autouse fixture sets DATATOOLS_DEV_MODE=1 so the
existing 1916 tests continue to pass.
- isolated_license_path / activated_license_manager /
unactivated_license_manager fixtures for tests that want to drive
the real check.
Tests (+79)
- tests/test_license.py (40): schema, crypto roundtrip, blob
encode/decode, tier→feature mapping, activation flow, name/email
mismatch rejection, tamper detection, expiration, renewal,
dev-mode bypass.
- tests/test_license_cli.py (26): every license_cli command +
subprocess tests confirming every tool CLI refuses to run without
a license, --help always works, DEV_MODE bypasses.
- tests/gui/test_activation.py (13): gate blocks without license,
passes with trial, activation form submission unlocks the gate,
sidebar status, renewal warning, i18n.
Total: 1916 → 1995 tests. All pass under the strict warning filter.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three real issues surfaced when running the suite with strict warnings:
1. src/core/format_standardize.py: ``datetime.utcfromtimestamp`` is
deprecated in CPython 3.12 and slated for removal. Replace with
``datetime.fromtimestamp(ts, tz=timezone.utc)``. Output for the
date-only format codes we use is byte-identical.
2. src/core/io.py: ``list_sheets`` leaked the openpyxl file handle by
returning ``xl.sheet_names`` from an unclosed ``pd.ExcelFile``.
Wrap in a ``with`` block so the FD closes deterministically — also
prevents the Windows-only "file is locked" repro path.
3. tests/test_corpus.py: ``TestXlsxPollution.workbook`` fixture
returned the bare ``pd.ExcelFile`` instead of yielding + closing.
Convert to a yield-and-finally pattern so the class-scoped handle
isn't leaked across the whole test file.
Also harden pytest.ini's warning policy: escalate
``ResourceWarning`` from ``src`` to an error, alongside the existing
``DeprecationWarning`` rule. Third-party warnings stay filtered — we
can't fix pandas/openpyxl/streamlit churn from here.
All 1916 tests pass under the strict filter; full and split runs
(``pytest``, ``pytest -m 'not gui'``, ``pytest -m gui``) all clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three follow-on wins from the audit, each with shape-pinning tests.
1. Dedup blocking
- Exact-only strategies (every column EXACT @ 100 — covers strong-
key dedup like email/phone, the drop-duplicates fallback, and
explicit "match on this exact column" calls) now route through
an O(n) groupby fast path. Lossless; no API change required.
Measured: 10k-row email-exact dedup → 73 ms (was ~30 minutes
via the O(n²) pair compare).
- Fuzzy strategies still pair-compare, with opt-in prefix blocking
via deduplicate(..., blocking_columns=[...], blocking_prefix_len=1).
Measured: 5k-row fuzzy-name → 25.6s with blocking vs 179s
without (7x). Trade-off: cross-block matches missed.
2. Column-parallel standardize
- StandardizeOptions.parallel_columns (default 1) lands a
ThreadPoolExecutor over the column loop. Output order and
audit-record order are preserved deterministically via a merge
step keyed off column_types order. Honest doc: under CPython
3.12's GIL the win is roughly neutral (phonenumbers/dateutil
hold the GIL); the API is ready for free-threaded Python 3.13+.
3. Lazy-copy in missing / column_mapper
- _standardize_sentinels now builds per-column changes in a dict
and only materialises the output frame when at least one column
actually changed. On a clean 1 GB file this skips a 1 GB
allocation.
- handle_missing carries an out_is_owned flag, copying on demand
before any mutating step. No-op runs return the input frame.
- map_columns drops the unconditional upfront df.copy(); rename
and drop both return fresh frames already, and schema-add /
coerce trigger _ensure_owned() lazily.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
REQUIREMENTS §10 reflects the post-optimisation numbers and the
known O(n²) dedup match step (flagged for a future blocking pass).
en/es upload-limit copy and uploader help now say 1.5 GB.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five targeted wins driven by an end-to-end audit, with shape-pinning
regression tests so reverts are loud:
- format_standardize: fuse the dispatcher loop into one pass — was
calling Series.tolist() three times per typed column and materialising
an intermediate triples list; now one tolist, one walk. On a
synthetic 1M-row phone+email frame this measures ~2.7M rows/sec
(vs. the previous 150k/sec doc target).
- dedup: wrap normalizers in a per-call lru_cache so repeat phones /
emails / addresses skip re-parsing. phonenumbers.parse is the
expensive call; ~2–5x faster on the normalisation step for realistic
workloads.
- analyze: _detect_near_duplicates no longer copies the full input
frame; builds only the normalised string columns via a dict and
references non-string columns by view. Skips the redundant
astype(str) when a column is already pandas string dtype.
- text_clean: hoist _build_pipeline out of the per-cell loop and add a
per-call string cache so 100k repeats of "Active" only run the
pipeline once. ~1M rows/sec on repetition-heavy columns.
- io.repair_bytes: the non-UTF-8 smart-quote fold path used a
Python-level zip walk over the entire decoded string to count
replacements — replaced with sum(text.count(c) ...) which runs in
C at ~GB/s. Was a latent ~100s on a 1 GB cp1252 file; now <1s.
Updates REQUIREMENTS §10 with measured numbers and bumps the buyer-
facing upload limit from 1 GB to 1.5 GB across the i18n packs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Introduces ``src/i18n`` with a tiny JSON-backed t() lookup, an in-session
language preference, and a sidebar selector wired through
``hide_streamlit_chrome`` so every page picks up the same picker. Covers
home, tool cards, findings panel, gate, shutdown, and pickup banner
strings. Tests pin pack parity and the farewell-overlay JS escape so
future packs can't silently regress.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stand up the seamless-download path for non-technical buyers:
* .github/workflows/build.yml — matrix CI (mac/win/linux) that builds
PyInstaller bundles and packages them per platform on tag push,
attaching the resulting installers to a GitHub Release.
* build/installer.iss — Inno Setup script for the Windows installer
(per-user install, optional desktop shortcut, runs on finish).
* build/macos/build_dmg.sh — wraps DataTools.app into a .dmg with a
drag-to-/Applications layout.
* build/appimage/{AppRun,datatools.desktop,build.sh} — AppImage recipe.
* src/__init__.py — single source of truth for __version__; the spec
reads it (was hardcoded), CI passes it through to all packagers.
Buyer download path now lives in the top-level README. Per-build
README documents the Phase 2 step (signing/notarization) that needs
the owner's Apple Developer + Windows code-signing credentials —
those are intentionally not in CI yet because they require setup
outside this repo.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Update the Close page intro, the shutdown overlay, and the toast so
they all read "you can close this window" — clearer for users running
the app in a dedicated browser window rather than a tab.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the data:-URL navigation (blocked by Chrome since v60 for
top-frame navigation) with a direct DOM-append of a full-screen
overlay onto the parent document. Uses z-index 2147483647 so it sits
above Streamlit's connection-error banner when the websocket drops.
Note: still doesn't fully suppress the connection-error banner in
testing — the next iteration will render the overlay through
Streamlit's own page rather than via a component iframe.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move the shutdown control out of the inline sidebar widget and into
its own page (pages/99_Close.py), so it appears in the sidebar nav
alongside the tool pages. An explicit confirm button on the page
prevents accidental nav clicks from killing a live session.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signalling the process with SIGTERM/SIGINT didn't reliably shut Streamlit
down — its tornado/asyncio loop swallowed or deferred the signal, so the
browser saw the websocket drop ("Connection error") while the python
process kept running. Replace the signal with a daemon-thread
``os._exit(0)`` after a short delay so the current rerun can paint the
"shutting down" message before the process is hard-killed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The footer placement was easy to miss (below all tool cards) and only
rendered on the home page. Hook the button into hide_streamlit_chrome()
so every page that hides default chrome — home + all 9 tool pages — gets
the Quit button at the bottom of the sidebar without per-page edits.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The chrome-hiding CSS was removing the Streamlit header wholesale,
which also took the sidebar's expand chevron with it — a collapsed
sidebar became unreopenable. Make the header transparent instead and
explicitly preserve the sidebar collapsed-control.
Also add a Quit button in the app footer that signals the Streamlit
server (SIGTERM, falling back to SIGINT) so closing the GUI returns
the shell prompt cleanly instead of leaving Python hung.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes ~17 high-value international gaps surfaced by parallel review.
Adds 93 regression tests; full project suite now 1323 / 0 / 17 (passed
/ failed / xfailed).
DATES
- Adds Portuguese, Italian, Dutch, Russian month dictionaries to the
opt-in ``month_locales`` set (now: en, fr, de, es, pt, it, nl, ru).
- Adds localized weekday recognition for those locales — "Lundi",
"Montag", "lunedì", "понедельник", etc. all strip cleanly before
format matching.
- New CJK separator normalization: Japanese ``2024年01月15日`` and
fullwidth digits ``2024/01/15`` fold to ASCII before parsing.
- New named-timezone resolution: EST/PST/JST/CET/IST/GMT/etc. map to
fixed UTC offsets via ``_NAMED_TZ_OFFSETS`` so the trailing TZ
doesn't block format matching.
- New ISO 8601 extended formats: week date (``2024-W03-1``) and
ordinal date (``2024-015``), plus RFC 2822 mail-header form
(``Mon, 15 Jan 2024 10:30:00``).
- New ``two_digit_year_cutoff`` parameter on ``standardize_date()`` —
defaults to Python's stdlib 69; lower it for birth-year columns
where most subjects were born ≤ 1999.
NAMES
- Particles set extended with Arabic patronymic markers (bin, ibn,
bint, abu, abd, al, al-, el-) and Hebrew (ben, bat, ha, ha-).
- Title set extended with German (Herr, Frau), French (M., Mme,
Mlle), Spanish (Sr., Sra., Srta., Don, Doña), Italian (Sig., Sig.ra,
Dott.), Portuguese.
- Acronym map extended with international academic credentials
(Dipl, Ing, Mag, Habil, MSc, BSc, LLB, LLM).
- New East Asian honorific suffix handler: ``Tanaka-san``,
``Lee-ssi``, ``Park-nim`` keep the suffix lowercase after the
hyphen instead of being title-cased into ``Tanaka-San``.
- Hyphenated-segment handler now keeps Arabic prefixes ``al-`` /
``el-`` lowercase per Arabic transliteration convention.
- New ``family_first`` parameter on ``standardize_name()`` and matching
``name_family_first`` field on ``StandardizeOptions`` — set
per-column for East Asian data to skip Western comma-format reversal
(``Kim, Min-jae`` stays ``Kim, …`` instead of becoming ``Min-jae Kim``).
CURRENCY
- Symbol map extended: ฿(THB), ₫(VND), ₮(MNT), ₴(UAH), ₦(NGN),
₱(PHP), ₲(PYG), ﷼(SAR), ₨(PKR), ₵(GHS) — covers SE Asia, Africa,
Eastern Europe, Latin America gaps.
- ISO 4217 code list extended from 23 to ~50: SAR, AED, QAR, KWD,
BHD, OMR, ARS, CLP, COP, EGP, IDR, MYR, PHP, THB, VND, NGN, GHS,
KES, HUF, CZK, RON, UAH, KZT, etc.
EMAIL
- New BIDI / RTL override stripping (``standardize_email``):
U+202A-U+202E and U+2066-U+2069 stripped from every email. These
are a known phishing vector — ``alice@example.com`` displays as
``alice@elpmaxe.com`` to RTL-aware renderers.
ADDRESS
- Canadian provinces: 13 codes + names → 2-letter (Ontario → ON).
- UK postcode pattern recognition (``SW1A 2AA`` shape).
- Australian states: 8 codes + names (NSW, VIC, QLD, … + full names).
- German Bundesland: 16 codes + names (Bayern → BY, etc.).
- International PO Box variants: ``Postfach`` (DE), ``Boîte postale``
(FR), ``Apartado`` (ES), ``Casella postale`` (IT), ``Caixa postal``
(PT) — all fold to canonical ``PO Box``.
- ``_INTL_STATE_CODES`` now combines US/CA/AU/DE codes; the position
check that preserves state codes regardless of input case applies
to all four jurisdictions.
- ``_is_state_code_position`` postal pattern broadened to recognize
US ZIP, AU 4-digit, CA first half, and UK outward code.
CONSTANTS
- ``src/core/_constants.py`` gains: ``CA_PROVINCE_CODES`` /
``CA_PROVINCE_NAMES``, ``AU_STATE_CODES`` / ``AU_STATE_NAMES``,
``DE_STATE_CODES`` / ``DE_STATE_NAMES``, ``POSTAL_PATTERNS``
(us/ca/uk/de/au/fr), ``INTL_PO_BOX_PATTERNS`` (per-language regex),
``INTL_STREET_SUFFIXES`` (de/fr/es/it/uk dictionaries — ready for
use when address takes a `country_hint` parameter in a future pass).
DOCS
- TECHNICAL.md §11.3 domain table updated with the new handling per
domain plus a new "International coverage" sub-section listing the
supported locales / symbols / jurisdictions.
DEFERRED (out of scope or rare)
- Alternative calendars (Japanese era, Hijri, Hebrew, Buddhist) —
corpus § 3.5 marks out of scope.
- Persian/Arabic-Indic digit conversion — rare in tabular data.
- Trailing-minus RTL currency convention.
- Punycode ↔ Unicode IDN normalization.
- Mixed-country phone column auto-detection (user can override
``default_region`` per column).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Introduces src/core/errors.py with a small structured error hierarchy
that every public entry point now uses. Each error carries the
context a user needs to fix it and the context a maintainer needs to
trace it.
The hierarchy:
DataToolsError (base — formats path, column, operation, suggestion)
InputValidationError (extends ValueError — bad arg / wrong type)
ConfigError (extends ValueError — bad config / options)
FileFormatError (extends ValueError — file is not what we expected)
FileAccessError (extends OSError — file I/O failure)
Subclassing the stdlib bases means existing `except OSError` /
`except ValueError` handlers still catch them — no breaking change.
Helpers:
- ensure_dataframe(value, function=...) — uniform DataFrame guard
- ensure_choice(value, name=, choices=) — uniform enum/literal guard
- wrap_file_read(path, op, exc) — tag OSError with hint + path
- wrap_file_write(path, op, exc) — same, with Windows-aware tip
- format_for_user(exc, context=) — user-facing string for st.error / stderr
Library hardening:
- io.read_file: missing files surface FileAccessError listing whether
the parent directory exists, and the suggestion to check the path.
- io.read_file: chunk_size <= 0 now raises InputValidationError with
a positive-integer suggestion.
- io._read_excel: openpyxl BadZipFile / InvalidFileException / pandas
ValueError ("sheet not found") wrapped as FileFormatError listing
the path and a "list sheets with list_sheets()" hint.
- io._detect_excel_header_row: bare except narrowed to specific
openpyxl exceptions; falls back gracefully and logs at debug so
the real error surfaces from pd.read_excel.
- io.write_file: OSError / PermissionError on to_csv/to_excel wrapped
with file path and Windows-aware "file may be open in another
program" hint.
- dedup._parse_date: bare `except Exception` narrowed to
(TypeError, ValueError, OutOfBoundsDatetime); failed values
logged at debug for survivor-selection forensics.
- dedup._select_survivor: KEEP_MOST_RECENT now raises
InputValidationError instead of silently falling back to keep_first.
- dedup.deduplicate: input validation errors are InputValidationError
with operation/column/suggestion fields.
- format_standardize.from_dict: invalid FieldType for a column raises
ConfigError naming the column AND the bad value AND listing valid
values; same for date_order / phone_format / etc.
- format_standardize.from_file: OSError / JSON decode wrapped with
path AND line/column where parsing failed.
- format_standardize.to_file: TypeError on json.dumps wrapped as
ConfigError with the suspected source (extra_abbreviations).
- format_standardize._apply_field_type: dispatcher's "unknown field
type" branch now raises AssertionError (it's an internal invariant,
not user error — a new enum value was added without a branch).
- format_standardize._resolve_column_types: missing-column error now
InputValidationError with a "check for typos / unparsed header"
suggestion.
- format_standardize.standardize_dataframe: ensure_dataframe at entry.
- text_clean.clean_dataframe: ensure_dataframe at entry.
- config.to_strategies: invalid Algorithm/NormalizerType wrapped as
ConfigError naming the strategy index AND the column.
- config.to_survivor_rule: invalid SurvivorRule wrapped as ConfigError
listing valid values.
- config.from_file: OSError / JSON decode wrapped (mirror of
StandardizeOptions.from_file).
- fixes.repair_mojibake: ImportError on ftfy now logged at info level
with the underlying ImportError so a corrupt-package vs not-installed
distinction is visible in the logs.
- normalizers.normalize_phone: phonenumbers.NumberParseException now
logged at debug when the digits-only fallback drops extension /
country-code information — gives a trail when matching results
look wrong.
GUI / CLI surfaces:
- All 9 page handlers (`except Exception as e: st.error(...)`) now
use format_for_user(), which renders DataToolsError fields nicely
and falls back to "ClassName: message" for unrecognized errors.
- 2_Text_Cleaner and 3_Format_Standardizer additionally distinguish
UnicodeDecodeError with an "re-save as UTF-8" suggestion before
the generic handler.
- cli.py's "Error reading file" handler now uses format_for_user()
and includes the input path in the prefix.
Tests:
- tests/test_errors.py — 22 new tests covering: base class formatting,
stdlib inheritance, ensure_dataframe / ensure_choice helpers,
wrap_file_read / wrap_file_write, format_for_user behavior, and
end-to-end integration (missing file, missing dir, bad JSON, bad
algorithm, bad enum, missing column).
- tests/test_audit_fixes.py + tests/test_io.py — updated 4 tests for
the new exception types (InputValidationError replaces TypeError,
FileAccessError extends OSError).
Full project suite: 1230 passed, 4 skipped, 17 xfailed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes 16 high-value findings from a parallel cross-module review.
Refactors:
- New src/core/_constants.py centralizes USPS street-suffix
abbreviations, US state names, and 2-letter postal codes — one source
of truth for both normalize_address (matching keys) and
standardize_address (display formatting). Eliminates ~80 lines of
duplicated dicts across normalizers.py and format_standardize.py.
- format_standardize.py: collapse 4 identical nested _err() helpers
into one shared _err_or_passthrough() module function; drop a dead
duplicate `return _err("not a phone number")` branch in
standardize_phone.
- format_standardize.py: precompile per-locale month-name regexes
(_MONTH_LOCALE_PATTERNS) and per-state-name regexes
(_STATE_NAME_PATTERNS) at import time — they were rebuilt on every
cell, a measurable hot path on million-row inputs.
- dedup.py: extract _is_missing(value) helper; one definition of
"this cell is None / NaN / pd.NA" instead of two.
- fixes.py: extract _is_string_column(ser) helper; one dtype check
instead of three duplicates across _apply_to_strings,
_vectorized_translate, _vectorized_regex_sub.
Production-readiness:
- format_standardize.standardize_dataframe now logs a warning when
more than 10% of typed cells are unparseable — surfaces the
silently-broken-pipeline failure mode.
- StandardizeOptions.from_dict validates date_order / phone_format /
currency_decimal / name_case / boolean_style / *_error_policy
enum values up front, with a clear error message instead of a deep
crash inside the per-cell function.
- StandardizeOptions.from_file and DeduplicationConfig.from_file wrap
read + json.loads with descriptive OSError / ValueError messages
including the file path.
- standardize_date(month_locales=...) validates locale codes against
the available set instead of silently passing through unknown ones.
- io.read_file rejects chunk_size <= 0 (was silently failing inside
pandas) and logs the resolved suffix + chunk_size at info level so
data-pipeline runs are debuggable.
- io.read_file's FileNotFoundError gains explanatory context.
- io.write_file, text_clean.clean_dataframe, and dedup.deduplicate
now reject non-DataFrame inputs with clear TypeError instead of
cryptic pandas tracebacks downstream.
- dedup.deduplicate validates that survivor_rule=KEEP_MOST_RECENT has
a usable date_column up front; the helper _select_survivor now
raises (instead of silently falling back to keep_first) when called
directly with bad arguments.
- dedup.deduplicate gains a structured no-op return when strategies
is empty after auto-detection — preserves schema instead of crashing.
- analyze._detect_inconsistent_date_format narrows its bare except to
(TypeError, ValueError) and logs a debug line so genuine bugs don't
hide behind silent skip.
Tests:
- tests/test_audit_fixes.py grows by 11 cases covering the new
validation paths (chunk_size, DataFrame guards, KEEP_MOST_RECENT
date_column, enum validation, locale validation, JSON error wrapping).
Full project suite: 1208 passed, 4 skipped, 17 xfailed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes 12 bugs and 8 gaps surfaced by parallel audits across all core
modules, plus aligns the dedup-side normalizers with the new
format_standardize behavior where they had silently diverged.
Bugs (data integrity / correctness):
- dedup: NaN/None values matched as duplicates because str(None)='None'.
Two rows with missing email silently merged.
- dedup: removed_df had 0 columns when nothing was removed; downstream
code expecting matching schema broke. Now preserves column shape.
- dedup: ColumnMatchStrategy threshold accepted any value; out-of-range
silently broke matching. Validated to [0, 100] in __post_init__.
- dedup: strategy referencing a missing column was silently skipped.
Now raises ValueError listing available columns.
- fixes: replace_null_sentinels crashed on non-string sentinels (int/None
from JSON payload). Coerced to str.
- fixes: _vectorized_regex_sub raised raw re.error on bad patterns. Now
wraps as ValueError with clear message.
- io: detect_header_row mis-identified all-empty and metadata-only rows
as headers (all([]) is True). Now requires ≥2 non-empty cells.
- config: from_dict crashed when JSON had unknown fields, breaking
forward compat. Now filters to known fields.
- analyze: mixed-case email detector flagged all-None columns because
str(None)='None' contains both N and one. Now drops NaN before stringify.
New features and gap closures:
- io: _detect_excel_header_row mirrors detect_header_row for Excel via
openpyxl read-only; _read_excel uses it when header_row=None.
- io: write_file gains delimiter + encoding params; .tsv extension
defaults to tab.
- normalizers: normalize_phone preserves extensions as ;ext=N suffix.
- normalizers: normalize_address folds spelled-out US state names to
2-letter codes (California ≡ CA).
- normalizers: normalize_name drops surname particles (van, de, von)
so "Charles de Gaulle" ≡ "Charles Gaulle" for matching.
- analyze: new _detect_inconsistent_date_format detector flags columns
with mixed ISO/US/EU date shapes; routes to format standardizer.
- analyze: _NULL_LIKE recognizes "<na>" (pd.NA repr).
- analyze: duplicate-row finding renamed count → n_extra (rows that
would actually be removed) with clarified description.
- dedup: group_confidence no longer falsely 100.0 when transitive group
members lack a recorded direct pair; falls back to 100.0 only when
truly no pairs were observed.
- dedup: MatchResult / DeduplicationResult docstrings clarify that
row_indices refer to the input frame's positional index (output index
is reset).
- text_clean: visualize_hidden_html(None) now returns None (matches
visualize_hidden_text); strip_bom strips at most one BOM per call;
sentence_case dead elif branch removed.
Tests:
- tests/test_audit_fixes.py — 28 regression tests, one or more per
numbered finding, named after BUG/GAP/NIT tags so future readers
can trace each test back to its audit.
- tests/test_fixes_unit.py — 26 isolated unit tests for previously
integration-only fix functions (trim_whitespace, strip_nbsp,
strip_zero_width, normalize_line_endings, clean_headers,
repair_mojibake — last skipped if ftfy unavailable).
- tests/test_io.py — adds CSV / TSV / semicolon / UTF-8-BOM round-trip
tests + Excel auto-header-detection tests.
- tests/test_normalizers.py — adds 8 tests for the alignment work
above (phone extension, state names, particles).
Adds .claude/ to .gitignore (agent worktrees + local settings).
Full project suite: 1197 passed, 4 skipped, 17 xfailed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds src/core/format_standardize.py — a per-cell standardizer for dates,
phones, emails, addresses, names, currencies, booleans — wired through
StandardizeOptions / standardize_dataframe with FieldType registry.
Includes:
- Date parser handles ISO/US/EU/longform/excel-serial/unix-timestamp/
partial-precision/quarter notation; opt-in French/German/Spanish month
dictionaries via month_locales.
- Phone via libphonenumber with extension preservation (;ext=N), 001
international prefix handling, error sentinels for placeholders /
multi-number cells.
- Email lowercase/trim/mailto/angle-bracket strip with optional
--gmail-canonical mode.
- Address USPS abbreviation expansion or compression (expand=False per
corpus § 6.3), state-name → 2-letter conversion, multi-line collapse,
PO Box normalization, state-code preservation regardless of input case.
- Name handler: Mc/Mac/O'/D' inner caps, hyphen segments, particle
lowercasing (von/van/de/da), comma-format reversal, period stripping
for titles/suffixes/initials, PhD/MD acronym preservation, conservative
mode for mixed-case input.
- Currency: auto-detect EU vs US separators, space-thousands, Swiss
apostrophe, accounting parens, optional ISO code preservation, error
sentinels for percentages/ranges/word-values/ambiguous separators.
- Per-domain error_policy ("passthrough" | "sentinel") for surfacing
malformed values as <error: reason> per corpus § 0.3.
Test corpus from Business/DataTools/test-cases-format-cleaner copied to
test-cases/format-cleaner-corpus/ — 7 fixtures plus FORMATS-CASES.md.
tests/test_format_standardize_corpus.py drives all 199 rows through the
per-cell standardizers; 0 xfailed.
Wires the GUI page (3_Format_Standardizer.py) to "Ready" status.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>