Files
datatools-dev/docs/DEVELOPER.md
Michael 38011872e1 docs(i18n): document language packs across user, dev, and marketing docs
README + USER-GUIDE describe the sidebar picker and current coverage
(home + shared chrome, per-tool bodies pending). DEVELOPER gains a
how-to for adding packs and keys with the parity-test guarantee.
TECHNICAL §10b records the in-house-JSON architecture and locks in the
no-gettext decision (also logged in DECISIONS). REQUIREMENTS reflects
the new interface surface and updated test count. COPY.md adds a
"Language claim" slot so landing/email work can pick it up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 15:16:24 +00:00

193 lines
9.1 KiB
Markdown

# Developer Guide
Architecture, data flow, extension points.
## Architecture
```
CLI (src/cli*.py) GUI (src/gui/app.py + pages/)
│ │
└──────────┐ ┌──────────┘
▼ ▼
┌────────────────┐
│ src/core/ │
└────────────────┘
```
**Core/UI rule**: business logic in `core/` only. CLI + GUI translate user input → core call → display result.
## Module map
| Module | Public surface |
|--------|----------------|
| `i18n` | `t(key, lang=None, **fmt)`, `current_language()`, `set_language()`, `render_language_selector()`, `LANGUAGES` |
| `core.dedup` | `deduplicate()`, `MatchStrategy`, `ColumnMatchStrategy`, `Algorithm`, `SurvivorRule`, `DeduplicationResult`, `MatchResult`, `build_default_strategies()` |
| `core.normalizers` | `normalize_email/phone/name/address/string`, `NormalizerType`, `get_normalizer()` |
| `core.io` | `read_file()`, `write_file()`, `list_sheets()`, `detect_encoding/delimiter/header_row`, `repair_bytes()` |
| `core.config` | `DeduplicationConfig.from_file/to_file/to_strategies/to_survivor_rule` |
| `core.analyze` | `analyze()`, `Finding`, `findings_by_tool()`, `_NULL_LIKE` |
| `core.fixes` | `@register("fix_id")` decorator, `get_fix()`, `available_actions()` |
| `core.normalize` | `auto_fix()`, `apply_decisions()`, `NormalizationResult`, `is_normalized()` |
| `core.text_clean` | `clean_dataframe()`, `CleanOptions`, `CleanResult`, `smart_title_case()` |
| `core.format_standardize` | `standardize_dataframe()`, `StandardizeOptions`, `StandardizeResult`, `FieldType`, per-cell `standardize_*()` |
| `core.errors` | `DataToolsError` hierarchy, `ensure_dataframe()`, `ensure_choice()`, `wrap_file_read/write()`, `format_for_user()` |
| `core._constants` | `US_STATE_NAMES`, `US_STATE_CODES`, `USPS_EXPANSIONS`, `USPS_COMPRESSIONS` |
## Data flow — Deduplicator
```
read_file() # auto-detect encoding, delimiter, header
▼ DataFrame
build_default_strategies() # if no explicit strategies
▼ # strong keys (email, phone) → standalone OR
# weak keys (name, address) → AND with strong
_apply_normalizations() # add _norm_* shadow columns
_find_match_groups() # O(n²) pair compare, OR strategies, union-find
[review_callback()] # optional interactive review
_select_survivor() # per group: first/last/most-complete/most-recent
[_merge_group()] # optional: fill blanks from losers
DeduplicationResult # deduplicated_df, removed_df, match_groups, log
```
## Extension recipes
### Add a normalizer
1. Add function to `core/normalizers.py`:
```python
def normalize_company(value: Optional[str]) -> str:
if not value or not isinstance(value, str): return ""
name = value.strip().casefold()
for sfx in ("inc", "llc", "corp", "ltd", "co"):
name = re.sub(rf"\b{sfx}\.?\s*$", "", name).strip()
return name
```
2. Register: add `COMPANY = "company"` to `NormalizerType` + entry in `_NORMALIZER_MAP`.
3. Auto-detect (optional): add a `_COLUMN_TYPE_PATTERNS` row in `core/dedup.py`.
### Add a fuzzy algorithm
1. Add value to `Algorithm` enum in `core/dedup.py`.
2. Add case in `_compute_similarity()`.
3. Document the value in CLI help text.
### Add a survivor rule
1. Add value to `SurvivorRule` enum.
2. Add branch in `_select_survivor()`.
3. Add CLI mapping.
### Add a fix + detector (analyzer/gate)
1. **Detector** in `core/analyze.py`: add `_detect_<thing>(df) -> list[Finding]`, hook into the main `analyze()` pipeline. Emit Finding with a unique `fix_action` id.
2. **Fix** in `core/fixes.py`:
```python
@register("fix_id")
def my_fix(df, payload=None) -> tuple[pd.DataFrame, int]:
# ...
return out_df, cells_changed
```
3. **Constant** in `core/analyze.py`: add `FIX_<NAME> = "fix_id"` so the detector and fix can reference it.
No other call sites change. Gate auto-discovers it via the registry.
### i18n — language packs
The GUI's user-facing strings live in `src/i18n/packs/<code>.json`, keyed by ISO-639-1 code. English (`en.json`) is canonical; missing keys in other packs fall back to English, and missing keys in English fall back to the literal dotted key so a typo is visible rather than silent.
**Look up a string in code:**
```python
from src.i18n import t
st.button(t("upload.run_button"))
st.warning(t("gate.warning", name=filename)) # {name} interpolated via str.format
```
`t()` reads the active language from `st.session_state["ui_lang"]`. Outside a Streamlit run (tests, scripts) it falls back to English.
**Add a new language:**
1. Copy `src/i18n/packs/en.json` to `src/i18n/packs/<code>.json` and translate values in place. Keep the key tree identical.
2. Add a one-line entry to `LANGUAGES` in `src/i18n/__init__.py`: `{"code": "fr", "label": "Français"}`. The sidebar picker auto-renders.
3. Run `pytest tests/test_lang_packs.py` — the parity test fails until every key from `en.json` exists in the new pack (and orphan keys not in English are also flagged).
**Add a new key:**
1. Add it to `en.json` first (canonical pack).
2. Add it to every other registered pack in the same commit. The parity test enforces this.
3. Use the dotted key at the call site: `t("section.subsection.key")` or `t("section.key", name=value)` for placeholder interpolation.
**Authoring rules:**
- Keys live under semantic sections (`home.*`, `upload.*`, `findings.*`, `tools.<id>.name`). Don't nest by language or by tool unless the string is genuinely tool-specific.
- Use `{named}` placeholders (not positional `{0}`) so translators see what's being interpolated.
- Strings can contain Streamlit markdown (`**bold**`) — pass through `st.markdown` / `st.caption` as usual.
- Do **not** put strings inside the farewell-overlay JS payload without going through `_js_html_safe()` in `src/gui/components/_legacy.py`; the helper escapes both the JS string terminator and HTML special chars. The test `TestFarewellEscape` pins that contract.
- The sidebar picker is mounted by `hide_streamlit_chrome()`, so every page that calls that helper automatically gets the picker. Pages that don't call it (rare) can call `render_language_selector()` directly.
### Add a format-standardizer field type
1. Add value to `FieldType` enum in `core/format_standardize.py`.
2. Add per-cell `standardize_<x>(value, *, …)` returning `(new_value, changed)`.
3. Add option fields to `StandardizeOptions` (with defaults that preserve existing behavior).
4. Wire into `_apply_field_type()` dispatcher (the `else` branch raises `AssertionError` — every enum value needs a branch).
5. Add validation entry in `StandardizeOptions.from_dict()` for any new enum-shaped option.
## Errors
Use `core/errors.py` instead of raw `ValueError` / `OSError`:
| Pattern | Use |
|---------|-----|
| Bad arg, wrong type, missing column | `InputValidationError` |
| Bad config / options file | `ConfigError` |
| File parses but isn't what we expected | `FileFormatError` |
| File I/O failure (perms, missing, disk full) | `FileAccessError` |
| Internal invariant broken (unreachable branch) | `AssertionError` |
Helpers:
- `ensure_dataframe(value, function="my_func")` at every public entry that takes a df.
- `ensure_choice(value, name="mode", choices=[...])` at every entry that takes a literal.
- `wrap_file_read(path, "operation", exc)` / `wrap_file_write(...)` when wrapping `OSError`.
GUI / CLI handlers: use `format_for_user(exc, context="...")` to render.
All `DataToolsError` subclasses extend stdlib `ValueError` or `OSError` so existing handlers still catch them.
## Tests
```bash
# All
pytest -q
# By module
pytest tests/test_dedup.py
# Include slow / integration
pytest -m slow
# Single test
pytest tests/test_dedup.py::TestExactMatch::test_basic
```
Test layout:
```
tests/
├── conftest.py # fixtures
├── test_dedup.py · test_normalizers.py · test_io.py · test_config.py
├── test_analyze.py · test_normalize.py · test_text_clean.py
├── test_format_standardize.py
├── test_format_standardize_corpus.py # 199-row buyer corpus
├── test_audit_fixes.py · test_errors.py · test_fixes_unit.py
├── test_corpus.py · test_encodings_corpus.py · test_fixtures_sweep.py
└── test_cli.py · test_cli_*.py · test_e2e.py · test_install.py
```
Fixture corpora: `test-cases/text-cleaner-corpus/` (21 files) · `test-cases/encodings-corpus/` (31 files) · `test-cases/format-cleaner-corpus/` (7 files + spec).
## Known limitations
- **Dedup is O(n²)** — no blocking. Works to ~50k rows. Future: partition by first letter / ZIP prefix.
- **Single-threaded** — could benefit from `multiprocessing`.
- **Memory-bound** — entire file loaded into pandas. Streaming reads exist but not integrated with dedup engine.
- **No multi-sheet dedup** — each Excel sheet processed independently.
- **Phonenumbers minimum-length** — international numbers without country codes fall back to digits-only.