From 38011872e1214d18fdd7a2ba79cd557ef34a0937 Mon Sep 17 00:00:00 2001 From: Michael Date: Wed, 13 May 2026 15:16:24 +0000 Subject: [PATCH] docs(i18n): document language packs across user, dev, and marketing docs MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit README + USER-GUIDE describe the sidebar picker and current coverage (home + shared chrome, per-tool bodies pending). DEVELOPER gains a how-to for adding packs and keys with the parity-test guarantee. TECHNICAL §10b records the in-house-JSON architecture and locks in the no-gettext decision (also logged in DECISIONS). REQUIREMENTS reflects the new interface surface and updated test count. COPY.md adds a "Language claim" slot so landing/email work can pick it up. Co-Authored-By: Claude Opus 4.7 (1M context) --- README.md | 6 +++++- docs/DECISIONS.md | 1 + docs/DEVELOPER.md | 31 +++++++++++++++++++++++++++++++ docs/REQUIREMENTS.md | 7 ++++--- docs/TECHNICAL.md | 19 +++++++++++++++++++ docs/USER-GUIDE.md | 11 +++++++++++ marketing/COPY.md | 2 ++ 7 files changed, 73 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index b88be10..683a64e 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ # DataTools -Local CSV / Excel cleaning. CLI + browser GUI, no cloud, no install ceremony. +Local CSV / Excel cleaning. CLI + browser GUI, no cloud, no install ceremony. GUI ships with English and Spanish language packs. ## Tools @@ -56,6 +56,10 @@ python -m src.cli_analyze any_file.csv [--json] # scan only Every CLI runs preview-only by default; add `--apply` to write output. +## Language + +The GUI sidebar has a language picker. Packs ship for **English** and **Español** (`src/i18n/packs/`); the choice persists for the session. Adding a language: drop a `.json` next to `en.json` mirroring its key tree, then list it in `LANGUAGES`. See [Developer Guide §i18n](docs/DEVELOPER.md#i18n--language-packs). + ## Review & Normalize gate Every uploaded file passes through a CSV-normalization gate before any tool sees it. The analyzer flags ~15 issue types (whitespace, NBSP / zero-width chars, BOM, encoding, smart punct, dirty headers, null sentinels, mojibake, …) tagged by **confidence** (high / medium / low) and **fix action**. The GUI shows each finding with Auto-fix / Skip / Customize, a live before/after preview, and an encoding-override picker. Tool pages refuse to load until the gate passes. diff --git a/docs/DECISIONS.md b/docs/DECISIONS.md index 58142be..0111c9f 100644 --- a/docs/DECISIONS.md +++ b/docs/DECISIONS.md @@ -174,6 +174,7 @@ $49-79/bundle · $149 full suite (when 3+ exist). | Apr 28 (v1.6) | Fold conversation-history content into docs (deduplicator spec, lead bundle use cases, full GUI matrix, 04/06 examples, Streamlit-to-SaaS reasoning) | No new decisions; promote at-risk analysis from chat history per no-silent-drift rule. | | May 1 (v1.6) | Mark Format Standardizer **Ready** | 199-row buyer corpus passing; Tier 1 + most Tier 2 built. | | May 1 (v1.6) | Add `src/core/errors.py` structured hierarchy | Uniform helpful messages across CLI + GUI. See TECHNICAL §7. | +| May 13 (v1.6) | Ship in-house JSON i18n + EN/ES packs | Expand addressable market (Spanish-first buyers, LatAm bookkeepers) without a `gettext` build step. JSON packs editable by non-devs; parity test prevents drift. See TECHNICAL §10b. | ## 8. Re-lock triggers diff --git a/docs/DEVELOPER.md b/docs/DEVELOPER.md index dcf0727..a9b0554 100644 --- a/docs/DEVELOPER.md +++ b/docs/DEVELOPER.md @@ -20,6 +20,7 @@ CLI (src/cli*.py) GUI (src/gui/app.py + pages/) | Module | Public surface | |--------|----------------| +| `i18n` | `t(key, lang=None, **fmt)`, `current_language()`, `set_language()`, `render_language_selector()`, `LANGUAGES` | | `core.dedup` | `deduplicate()`, `MatchStrategy`, `ColumnMatchStrategy`, `Algorithm`, `SurvivorRule`, `DeduplicationResult`, `MatchResult`, `build_default_strategies()` | | `core.normalizers` | `normalize_email/phone/name/address/string`, `NormalizerType`, `get_normalizer()` | | `core.io` | `read_file()`, `write_file()`, `list_sheets()`, `detect_encoding/delimiter/header_row`, `repair_bytes()` | @@ -95,6 +96,36 @@ DeduplicationResult # deduplicated_df, removed_df, match_groups, l No other call sites change. Gate auto-discovers it via the registry. +### i18n — language packs + +The GUI's user-facing strings live in `src/i18n/packs/.json`, keyed by ISO-639-1 code. English (`en.json`) is canonical; missing keys in other packs fall back to English, and missing keys in English fall back to the literal dotted key so a typo is visible rather than silent. + +**Look up a string in code:** +```python +from src.i18n import t +st.button(t("upload.run_button")) +st.warning(t("gate.warning", name=filename)) # {name} interpolated via str.format +``` + +`t()` reads the active language from `st.session_state["ui_lang"]`. Outside a Streamlit run (tests, scripts) it falls back to English. + +**Add a new language:** +1. Copy `src/i18n/packs/en.json` to `src/i18n/packs/.json` and translate values in place. Keep the key tree identical. +2. Add a one-line entry to `LANGUAGES` in `src/i18n/__init__.py`: `{"code": "fr", "label": "Français"}`. The sidebar picker auto-renders. +3. Run `pytest tests/test_lang_packs.py` — the parity test fails until every key from `en.json` exists in the new pack (and orphan keys not in English are also flagged). + +**Add a new key:** +1. Add it to `en.json` first (canonical pack). +2. Add it to every other registered pack in the same commit. The parity test enforces this. +3. Use the dotted key at the call site: `t("section.subsection.key")` or `t("section.key", name=value)` for placeholder interpolation. + +**Authoring rules:** +- Keys live under semantic sections (`home.*`, `upload.*`, `findings.*`, `tools..name`). Don't nest by language or by tool unless the string is genuinely tool-specific. +- Use `{named}` placeholders (not positional `{0}`) so translators see what's being interpolated. +- Strings can contain Streamlit markdown (`**bold**`) — pass through `st.markdown` / `st.caption` as usual. +- Do **not** put strings inside the farewell-overlay JS payload without going through `_js_html_safe()` in `src/gui/components/_legacy.py`; the helper escapes both the JS string terminator and HTML special chars. The test `TestFarewellEscape` pins that contract. +- The sidebar picker is mounted by `hide_streamlit_chrome()`, so every page that calls that helper automatically gets the picker. Pages that don't call it (rare) can call `render_language_selector()` directly. + ### Add a format-standardizer field type 1. Add value to `FieldType` enum in `core/format_standardize.py`. diff --git a/docs/REQUIREMENTS.md b/docs/REQUIREMENTS.md index 2591b0f..193c81d 100644 --- a/docs/REQUIREMENTS.md +++ b/docs/REQUIREMENTS.md @@ -114,10 +114,11 @@ and proceeds. - Result keyed by upload SHA-256; survives reload, invalidated on re-upload. ## 13. Interfaces -- **GUI**: Streamlit, browser-based, local, no internet. -- **CLI**: `python -m src.cli` (dedup) · `src.cli_text_clean` · `src.cli_format` · `src.cli_missing` · `src.cli_column_map` · `src.cli_pipeline` · `src.cli_analyze`. +- **GUI**: Streamlit, browser-based, local, no internet. Sidebar language picker (English, Español). +- **CLI**: `python -m src.cli` (dedup) · `src.cli_text_clean` · `src.cli_format` · `src.cli_missing` · `src.cli_column_map` · `src.cli_pipeline` · `src.cli_analyze`. (CLI output is English-only.) - **Python API**: `from src.core import …` (analyze, repair_bytes, clean_dataframe, deduplicate, standardize_dataframe, …). - **JSON output**: `--json` on `cli_analyze`. +- **Language packs**: `from src.i18n import t, LANGUAGES`. Add `.json` to `src/i18n/packs/` + entry in `LANGUAGES` to add a language. ## 14. Platforms - Python ≥ 3.10. @@ -133,7 +134,7 @@ and proceeds. - **Dev**: pytest, tox. ## 16. Test coverage -- 1,729 tests passing, 0 skipped, 0 xfailed. +- 1,762 tests passing, 0 skipped, 0 xfailed. - Fixture corpora: text-cleaner (21), encodings (31), reference UTF-8 (9), format-cleaner (199 buyer cases + 20-row international stress fixture), missing-handler (3 use cases + 16 edge cases), column-mapper (3 use cases + 5 edge cases). - Run: `python run_tests.py [--tool …] [--fixtures] [--coverage]`. diff --git a/docs/TECHNICAL.md b/docs/TECHNICAL.md index 87e07db..136579b 100644 --- a/docs/TECHNICAL.md +++ b/docs/TECHNICAL.md @@ -38,6 +38,9 @@ src/ app.py # Streamlit entry point pages/ # One page per tool components/ # shared, dedup_review, findings, gate, _legacy + i18n/ # GUI language packs (JSON-backed, in-house lookup) + __init__.py # t() · current_language() · render_language_selector() + packs/ # en.json, es.json, … (one file per language) build/ # PyInstaller spec, launcher, OS-specific configs demo/ # Constrained Streamlit Community Cloud version tests/ # pytest; targets core/, not UI @@ -220,6 +223,22 @@ Deliberately separate. Confluent original spec was wrong. - `-999` sentinel — 04 converts to `NaN` first; 06 then computes stats. - Suspicious-but-plausible (age 110) — 06 territory. +## 10b. GUI internationalization (i18n) + +The GUI uses an in-house, JSON-backed translation layer at `src/i18n/`. **No** `gettext` / `babel` / `.po` pipeline — the surface is small enough that a 100-line module + per-language JSON file is a better fit than a build-time toolchain. + +**Resolution model**: `t(key, lang=None, **fmt)` walks a dotted key (`home.title`, `tools.01_deduplicator.name`) through a nested dict. Fallback chain: requested lang → English (canonical) → the literal key. Missing format placeholders return the raw template rather than raising so a translation file cannot crash the UI. + +**Active language** is stored in `st.session_state["ui_lang"]`. Reading it outside a Streamlit run (tests, scripts) silently falls back to English, keeping the module importable without Streamlit context. + +**Picker placement**: `hide_streamlit_chrome()` calls `render_language_selector()` on every page that hides Streamlit's default chrome — i.e., the entire app. One mount point, every page picks it up. + +**Pack parity** is a tested invariant: `tests/test_lang_packs.py::TestPackParity` fails CI when `en.json` and another pack diverge in either direction. This catches translation drift at PR time rather than from buyer reports. + +**Farewell overlay**: the shutdown screen's JS payload interpolates pack strings into an `innerHTML` inside a JS single-quoted string. `_js_html_safe()` in `components/_legacy.py` escapes both the JS string terminator (`'`) and HTML special chars (`< > &`). The test `TestFarewellEscape` pins this; never bypass it. + +**Why not gettext**: zero compiled artifacts in the PyInstaller bundle, no build step before tests run, no `.po`/`.mo` round-trip for translators (anyone can edit JSON), and the same lookup works in unit tests without process state. Locked in because the surface won't grow large enough to need the alternative, and the alternative breaks the "drop a file, run pytest, ship" loop. + ## 11. Per-script functional specs Specs live in this section as scripts enter active build. Each follows the Tier 1/2/3 structure with explicit strategic framing (what's the market gap given some of this is free elsewhere). diff --git a/docs/USER-GUIDE.md b/docs/USER-GUIDE.md index 54f6e8b..a36d042 100644 --- a/docs/USER-GUIDE.md +++ b/docs/USER-GUIDE.md @@ -80,6 +80,17 @@ If you skip the Pipeline Runner, follow this order: The Pipeline Runner enforces this automatically. +### 3.4 Language + +The sidebar has a **Language / Idioma** picker. Two packs ship today: + +- **English** (default) +- **Español** + +Pick a language once — the choice persists for the session and the picker is visible from every page. Switch any time; the page re-renders in place with no data loss. + +**Coverage** (v1.6): home page, tool cards, the upload + analysis panel, the findings list, the Review & Normalize gate prompt, the sidebar picker, and the shutdown screen. Per-tool page bodies (advanced-option labels, column-mapper prompts, dedup review labels) are tracked for future packs — they currently render in English in both modes. If a string you'd expect to switch doesn't, that's a missing pack key, not a bug in the picker; email support with a screenshot. + ## 4. Review & Normalize gate Every uploaded file is scanned before any tool sees it. diff --git a/marketing/COPY.md b/marketing/COPY.md index e23dbb5..03580c3 100644 --- a/marketing/COPY.md +++ b/marketing/COPY.md @@ -26,6 +26,7 @@ short`) rather than editing in place. | Privacy claim | Your data never leaves your computer. | | Audit claim | Every change logged to a CSV-format audit trail. | | Format claim | $ £ € ¥ R$ kr zł and 50+ phone-country codes — handled. | +| Language claim | GUI available in English and Español. | | Support email | support@datatools.app | | Distribution URL | https://datatools.gumroad.com/l/datatools | @@ -188,3 +189,4 @@ ships from a known state. | Date | Slot | Old → New | Why | |------|------|-----------|-----| | 2026-05-01 | (initial) | — | First SoT extracted from landing pages 1.0 | +| 2026-05-13 | Language claim (new) | — → "GUI available in English and Español." | Ships v1.6 i18n: EN + ES packs in GUI sidebar. Expands addressable market without a CLI/copy rebuild. |