docs(i18n): document language packs across user, dev, and marketing docs
README + USER-GUIDE describe the sidebar picker and current coverage (home + shared chrome, per-tool bodies pending). DEVELOPER gains a how-to for adding packs and keys with the parity-test guarantee. TECHNICAL §10b records the in-house-JSON architecture and locks in the no-gettext decision (also logged in DECISIONS). REQUIREMENTS reflects the new interface surface and updated test count. COPY.md adds a "Language claim" slot so landing/email work can pick it up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -1,6 +1,6 @@
|
|||||||
# DataTools
|
# DataTools
|
||||||
|
|
||||||
Local CSV / Excel cleaning. CLI + browser GUI, no cloud, no install ceremony.
|
Local CSV / Excel cleaning. CLI + browser GUI, no cloud, no install ceremony. GUI ships with English and Spanish language packs.
|
||||||
|
|
||||||
## Tools
|
## Tools
|
||||||
|
|
||||||
@@ -56,6 +56,10 @@ python -m src.cli_analyze any_file.csv [--json] # scan only
|
|||||||
|
|
||||||
Every CLI runs preview-only by default; add `--apply` to write output.
|
Every CLI runs preview-only by default; add `--apply` to write output.
|
||||||
|
|
||||||
|
## Language
|
||||||
|
|
||||||
|
The GUI sidebar has a language picker. Packs ship for **English** and **Español** (`src/i18n/packs/`); the choice persists for the session. Adding a language: drop a `<code>.json` next to `en.json` mirroring its key tree, then list it in `LANGUAGES`. See [Developer Guide §i18n](docs/DEVELOPER.md#i18n--language-packs).
|
||||||
|
|
||||||
## Review & Normalize gate
|
## Review & Normalize gate
|
||||||
|
|
||||||
Every uploaded file passes through a CSV-normalization gate before any tool sees it. The analyzer flags ~15 issue types (whitespace, NBSP / zero-width chars, BOM, encoding, smart punct, dirty headers, null sentinels, mojibake, …) tagged by **confidence** (high / medium / low) and **fix action**. The GUI shows each finding with Auto-fix / Skip / Customize, a live before/after preview, and an encoding-override picker. Tool pages refuse to load until the gate passes.
|
Every uploaded file passes through a CSV-normalization gate before any tool sees it. The analyzer flags ~15 issue types (whitespace, NBSP / zero-width chars, BOM, encoding, smart punct, dirty headers, null sentinels, mojibake, …) tagged by **confidence** (high / medium / low) and **fix action**. The GUI shows each finding with Auto-fix / Skip / Customize, a live before/after preview, and an encoding-override picker. Tool pages refuse to load until the gate passes.
|
||||||
|
|||||||
@@ -174,6 +174,7 @@ $49-79/bundle · $149 full suite (when 3+ exist).
|
|||||||
| Apr 28 (v1.6) | Fold conversation-history content into docs (deduplicator spec, lead bundle use cases, full GUI matrix, 04/06 examples, Streamlit-to-SaaS reasoning) | No new decisions; promote at-risk analysis from chat history per no-silent-drift rule. |
|
| Apr 28 (v1.6) | Fold conversation-history content into docs (deduplicator spec, lead bundle use cases, full GUI matrix, 04/06 examples, Streamlit-to-SaaS reasoning) | No new decisions; promote at-risk analysis from chat history per no-silent-drift rule. |
|
||||||
| May 1 (v1.6) | Mark Format Standardizer **Ready** | 199-row buyer corpus passing; Tier 1 + most Tier 2 built. |
|
| May 1 (v1.6) | Mark Format Standardizer **Ready** | 199-row buyer corpus passing; Tier 1 + most Tier 2 built. |
|
||||||
| May 1 (v1.6) | Add `src/core/errors.py` structured hierarchy | Uniform helpful messages across CLI + GUI. See TECHNICAL §7. |
|
| May 1 (v1.6) | Add `src/core/errors.py` structured hierarchy | Uniform helpful messages across CLI + GUI. See TECHNICAL §7. |
|
||||||
|
| May 13 (v1.6) | Ship in-house JSON i18n + EN/ES packs | Expand addressable market (Spanish-first buyers, LatAm bookkeepers) without a `gettext` build step. JSON packs editable by non-devs; parity test prevents drift. See TECHNICAL §10b. |
|
||||||
|
|
||||||
## 8. Re-lock triggers
|
## 8. Re-lock triggers
|
||||||
|
|
||||||
|
|||||||
@@ -20,6 +20,7 @@ CLI (src/cli*.py) GUI (src/gui/app.py + pages/)
|
|||||||
|
|
||||||
| Module | Public surface |
|
| Module | Public surface |
|
||||||
|--------|----------------|
|
|--------|----------------|
|
||||||
|
| `i18n` | `t(key, lang=None, **fmt)`, `current_language()`, `set_language()`, `render_language_selector()`, `LANGUAGES` |
|
||||||
| `core.dedup` | `deduplicate()`, `MatchStrategy`, `ColumnMatchStrategy`, `Algorithm`, `SurvivorRule`, `DeduplicationResult`, `MatchResult`, `build_default_strategies()` |
|
| `core.dedup` | `deduplicate()`, `MatchStrategy`, `ColumnMatchStrategy`, `Algorithm`, `SurvivorRule`, `DeduplicationResult`, `MatchResult`, `build_default_strategies()` |
|
||||||
| `core.normalizers` | `normalize_email/phone/name/address/string`, `NormalizerType`, `get_normalizer()` |
|
| `core.normalizers` | `normalize_email/phone/name/address/string`, `NormalizerType`, `get_normalizer()` |
|
||||||
| `core.io` | `read_file()`, `write_file()`, `list_sheets()`, `detect_encoding/delimiter/header_row`, `repair_bytes()` |
|
| `core.io` | `read_file()`, `write_file()`, `list_sheets()`, `detect_encoding/delimiter/header_row`, `repair_bytes()` |
|
||||||
@@ -95,6 +96,36 @@ DeduplicationResult # deduplicated_df, removed_df, match_groups, l
|
|||||||
|
|
||||||
No other call sites change. Gate auto-discovers it via the registry.
|
No other call sites change. Gate auto-discovers it via the registry.
|
||||||
|
|
||||||
|
### i18n — language packs
|
||||||
|
|
||||||
|
The GUI's user-facing strings live in `src/i18n/packs/<code>.json`, keyed by ISO-639-1 code. English (`en.json`) is canonical; missing keys in other packs fall back to English, and missing keys in English fall back to the literal dotted key so a typo is visible rather than silent.
|
||||||
|
|
||||||
|
**Look up a string in code:**
|
||||||
|
```python
|
||||||
|
from src.i18n import t
|
||||||
|
st.button(t("upload.run_button"))
|
||||||
|
st.warning(t("gate.warning", name=filename)) # {name} interpolated via str.format
|
||||||
|
```
|
||||||
|
|
||||||
|
`t()` reads the active language from `st.session_state["ui_lang"]`. Outside a Streamlit run (tests, scripts) it falls back to English.
|
||||||
|
|
||||||
|
**Add a new language:**
|
||||||
|
1. Copy `src/i18n/packs/en.json` to `src/i18n/packs/<code>.json` and translate values in place. Keep the key tree identical.
|
||||||
|
2. Add a one-line entry to `LANGUAGES` in `src/i18n/__init__.py`: `{"code": "fr", "label": "Français"}`. The sidebar picker auto-renders.
|
||||||
|
3. Run `pytest tests/test_lang_packs.py` — the parity test fails until every key from `en.json` exists in the new pack (and orphan keys not in English are also flagged).
|
||||||
|
|
||||||
|
**Add a new key:**
|
||||||
|
1. Add it to `en.json` first (canonical pack).
|
||||||
|
2. Add it to every other registered pack in the same commit. The parity test enforces this.
|
||||||
|
3. Use the dotted key at the call site: `t("section.subsection.key")` or `t("section.key", name=value)` for placeholder interpolation.
|
||||||
|
|
||||||
|
**Authoring rules:**
|
||||||
|
- Keys live under semantic sections (`home.*`, `upload.*`, `findings.*`, `tools.<id>.name`). Don't nest by language or by tool unless the string is genuinely tool-specific.
|
||||||
|
- Use `{named}` placeholders (not positional `{0}`) so translators see what's being interpolated.
|
||||||
|
- Strings can contain Streamlit markdown (`**bold**`) — pass through `st.markdown` / `st.caption` as usual.
|
||||||
|
- Do **not** put strings inside the farewell-overlay JS payload without going through `_js_html_safe()` in `src/gui/components/_legacy.py`; the helper escapes both the JS string terminator and HTML special chars. The test `TestFarewellEscape` pins that contract.
|
||||||
|
- The sidebar picker is mounted by `hide_streamlit_chrome()`, so every page that calls that helper automatically gets the picker. Pages that don't call it (rare) can call `render_language_selector()` directly.
|
||||||
|
|
||||||
### Add a format-standardizer field type
|
### Add a format-standardizer field type
|
||||||
|
|
||||||
1. Add value to `FieldType` enum in `core/format_standardize.py`.
|
1. Add value to `FieldType` enum in `core/format_standardize.py`.
|
||||||
|
|||||||
@@ -114,10 +114,11 @@ and proceeds.
|
|||||||
- Result keyed by upload SHA-256; survives reload, invalidated on re-upload.
|
- Result keyed by upload SHA-256; survives reload, invalidated on re-upload.
|
||||||
|
|
||||||
## 13. Interfaces
|
## 13. Interfaces
|
||||||
- **GUI**: Streamlit, browser-based, local, no internet.
|
- **GUI**: Streamlit, browser-based, local, no internet. Sidebar language picker (English, Español).
|
||||||
- **CLI**: `python -m src.cli` (dedup) · `src.cli_text_clean` · `src.cli_format` · `src.cli_missing` · `src.cli_column_map` · `src.cli_pipeline` · `src.cli_analyze`.
|
- **CLI**: `python -m src.cli` (dedup) · `src.cli_text_clean` · `src.cli_format` · `src.cli_missing` · `src.cli_column_map` · `src.cli_pipeline` · `src.cli_analyze`. (CLI output is English-only.)
|
||||||
- **Python API**: `from src.core import …` (analyze, repair_bytes, clean_dataframe, deduplicate, standardize_dataframe, …).
|
- **Python API**: `from src.core import …` (analyze, repair_bytes, clean_dataframe, deduplicate, standardize_dataframe, …).
|
||||||
- **JSON output**: `--json` on `cli_analyze`.
|
- **JSON output**: `--json` on `cli_analyze`.
|
||||||
|
- **Language packs**: `from src.i18n import t, LANGUAGES`. Add `<code>.json` to `src/i18n/packs/` + entry in `LANGUAGES` to add a language.
|
||||||
|
|
||||||
## 14. Platforms
|
## 14. Platforms
|
||||||
- Python ≥ 3.10.
|
- Python ≥ 3.10.
|
||||||
@@ -133,7 +134,7 @@ and proceeds.
|
|||||||
- **Dev**: pytest, tox.
|
- **Dev**: pytest, tox.
|
||||||
|
|
||||||
## 16. Test coverage
|
## 16. Test coverage
|
||||||
- 1,729 tests passing, 0 skipped, 0 xfailed.
|
- 1,762 tests passing, 0 skipped, 0 xfailed.
|
||||||
- Fixture corpora: text-cleaner (21), encodings (31), reference UTF-8 (9), format-cleaner (199 buyer cases + 20-row international stress fixture), missing-handler (3 use cases + 16 edge cases), column-mapper (3 use cases + 5 edge cases).
|
- Fixture corpora: text-cleaner (21), encodings (31), reference UTF-8 (9), format-cleaner (199 buyer cases + 20-row international stress fixture), missing-handler (3 use cases + 16 edge cases), column-mapper (3 use cases + 5 edge cases).
|
||||||
- Run: `python run_tests.py [--tool …] [--fixtures] [--coverage]`.
|
- Run: `python run_tests.py [--tool …] [--fixtures] [--coverage]`.
|
||||||
|
|
||||||
|
|||||||
@@ -38,6 +38,9 @@ src/
|
|||||||
app.py # Streamlit entry point
|
app.py # Streamlit entry point
|
||||||
pages/ # One page per tool
|
pages/ # One page per tool
|
||||||
components/ # shared, dedup_review, findings, gate, _legacy
|
components/ # shared, dedup_review, findings, gate, _legacy
|
||||||
|
i18n/ # GUI language packs (JSON-backed, in-house lookup)
|
||||||
|
__init__.py # t() · current_language() · render_language_selector()
|
||||||
|
packs/ # en.json, es.json, … (one file per language)
|
||||||
build/ # PyInstaller spec, launcher, OS-specific configs
|
build/ # PyInstaller spec, launcher, OS-specific configs
|
||||||
demo/ # Constrained Streamlit Community Cloud version
|
demo/ # Constrained Streamlit Community Cloud version
|
||||||
tests/ # pytest; targets core/, not UI
|
tests/ # pytest; targets core/, not UI
|
||||||
@@ -220,6 +223,22 @@ Deliberately separate. Confluent original spec was wrong.
|
|||||||
- `-999` sentinel — 04 converts to `NaN` first; 06 then computes stats.
|
- `-999` sentinel — 04 converts to `NaN` first; 06 then computes stats.
|
||||||
- Suspicious-but-plausible (age 110) — 06 territory.
|
- Suspicious-but-plausible (age 110) — 06 territory.
|
||||||
|
|
||||||
|
## 10b. GUI internationalization (i18n)
|
||||||
|
|
||||||
|
The GUI uses an in-house, JSON-backed translation layer at `src/i18n/`. **No** `gettext` / `babel` / `.po` pipeline — the surface is small enough that a 100-line module + per-language JSON file is a better fit than a build-time toolchain.
|
||||||
|
|
||||||
|
**Resolution model**: `t(key, lang=None, **fmt)` walks a dotted key (`home.title`, `tools.01_deduplicator.name`) through a nested dict. Fallback chain: requested lang → English (canonical) → the literal key. Missing format placeholders return the raw template rather than raising so a translation file cannot crash the UI.
|
||||||
|
|
||||||
|
**Active language** is stored in `st.session_state["ui_lang"]`. Reading it outside a Streamlit run (tests, scripts) silently falls back to English, keeping the module importable without Streamlit context.
|
||||||
|
|
||||||
|
**Picker placement**: `hide_streamlit_chrome()` calls `render_language_selector()` on every page that hides Streamlit's default chrome — i.e., the entire app. One mount point, every page picks it up.
|
||||||
|
|
||||||
|
**Pack parity** is a tested invariant: `tests/test_lang_packs.py::TestPackParity` fails CI when `en.json` and another pack diverge in either direction. This catches translation drift at PR time rather than from buyer reports.
|
||||||
|
|
||||||
|
**Farewell overlay**: the shutdown screen's JS payload interpolates pack strings into an `innerHTML` inside a JS single-quoted string. `_js_html_safe()` in `components/_legacy.py` escapes both the JS string terminator (`'`) and HTML special chars (`< > &`). The test `TestFarewellEscape` pins this; never bypass it.
|
||||||
|
|
||||||
|
**Why not gettext**: zero compiled artifacts in the PyInstaller bundle, no build step before tests run, no `.po`/`.mo` round-trip for translators (anyone can edit JSON), and the same lookup works in unit tests without process state. Locked in because the surface won't grow large enough to need the alternative, and the alternative breaks the "drop a file, run pytest, ship" loop.
|
||||||
|
|
||||||
## 11. Per-script functional specs
|
## 11. Per-script functional specs
|
||||||
|
|
||||||
Specs live in this section as scripts enter active build. Each follows the Tier 1/2/3 structure with explicit strategic framing (what's the market gap given some of this is free elsewhere).
|
Specs live in this section as scripts enter active build. Each follows the Tier 1/2/3 structure with explicit strategic framing (what's the market gap given some of this is free elsewhere).
|
||||||
|
|||||||
@@ -80,6 +80,17 @@ If you skip the Pipeline Runner, follow this order:
|
|||||||
|
|
||||||
The Pipeline Runner enforces this automatically.
|
The Pipeline Runner enforces this automatically.
|
||||||
|
|
||||||
|
### 3.4 Language
|
||||||
|
|
||||||
|
The sidebar has a **Language / Idioma** picker. Two packs ship today:
|
||||||
|
|
||||||
|
- **English** (default)
|
||||||
|
- **Español**
|
||||||
|
|
||||||
|
Pick a language once — the choice persists for the session and the picker is visible from every page. Switch any time; the page re-renders in place with no data loss.
|
||||||
|
|
||||||
|
**Coverage** (v1.6): home page, tool cards, the upload + analysis panel, the findings list, the Review & Normalize gate prompt, the sidebar picker, and the shutdown screen. Per-tool page bodies (advanced-option labels, column-mapper prompts, dedup review labels) are tracked for future packs — they currently render in English in both modes. If a string you'd expect to switch doesn't, that's a missing pack key, not a bug in the picker; email support with a screenshot.
|
||||||
|
|
||||||
## 4. Review & Normalize gate
|
## 4. Review & Normalize gate
|
||||||
|
|
||||||
Every uploaded file is scanned before any tool sees it.
|
Every uploaded file is scanned before any tool sees it.
|
||||||
|
|||||||
@@ -26,6 +26,7 @@ short`) rather than editing in place.
|
|||||||
| Privacy claim | Your data never leaves your computer. |
|
| Privacy claim | Your data never leaves your computer. |
|
||||||
| Audit claim | Every change logged to a CSV-format audit trail. |
|
| Audit claim | Every change logged to a CSV-format audit trail. |
|
||||||
| Format claim | $ £ € ¥ R$ kr zł and 50+ phone-country codes — handled. |
|
| Format claim | $ £ € ¥ R$ kr zł and 50+ phone-country codes — handled. |
|
||||||
|
| Language claim | GUI available in English and Español. |
|
||||||
| Support email | support@datatools.app |
|
| Support email | support@datatools.app |
|
||||||
| Distribution URL | https://datatools.gumroad.com/l/datatools |
|
| Distribution URL | https://datatools.gumroad.com/l/datatools |
|
||||||
|
|
||||||
@@ -188,3 +189,4 @@ ships from a known state.
|
|||||||
| Date | Slot | Old → New | Why |
|
| Date | Slot | Old → New | Why |
|
||||||
|------|------|-----------|-----|
|
|------|------|-----------|-----|
|
||||||
| 2026-05-01 | (initial) | — | First SoT extracted from landing pages 1.0 |
|
| 2026-05-01 | (initial) | — | First SoT extracted from landing pages 1.0 |
|
||||||
|
| 2026-05-13 | Language claim (new) | — → "GUI available in English and Español." | Ships v1.6 i18n: EN + ES packs in GUI sidebar. Expands addressable market without a CLI/copy rebuild. |
|
||||||
|
|||||||
Reference in New Issue
Block a user