Two coupled changes:
1. Lite tier
- New Tier.LITE in src/license/schema.py.
- FEATURES_BY_TIER[Tier.LITE] = {Deduplicator, Text Cleaner,
Format Standardizer}. The three universally-useful tools that
cover the most common bookkeeping / RevOps / Klaviyo prep
workflows. Other six tools require Core.
- i18n: license.tier_lite, license.feature_locked_title,
license.feature_locked_body, license.upgrade_link,
license.status_locked (en + es).
- Per-tool feature gate at every GUI tool page
(require_feature_or_render_upgrade) and every tool CLI
(guard(feature=...)). A locked tool renders an upgrade
prompt + Manage-license button (GUI) or exits with code 2
(CLI).
- Home grid: tool cards the user's tier doesn't unlock get a
red 🔒 Locked badge in place of green Ready.
2. Trial removed
- Activation form's "Start 1-year trial" button removed.
- license_cli's `trial` subcommand removed.
- activation.trial_button / activation.trial_help i18n keys
dropped (pack parity test stays green).
- Tier.TRIAL stays in the enum (back-compat with any field-
tested trial licenses); LicenseManager._mint stays internal
for tests and the seller's key generator.
- Decision logged in DECISIONS §9b: a 1-year all-features
trial undercuts paid Lite; paid-only keeps tier economics
clean.
Tests (+29 net): +17 Lite-tier unit/guard tests + 13 Lite-tier
GUI tests + 1 trial-absent assertion - 2 trial CLI tests - 1
trial GUI button test. Total: 1995 → 2024.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
317 lines
14 KiB
Markdown
317 lines
14 KiB
Markdown
# Developer Guide
|
|
|
|
Architecture, data flow, extension points.
|
|
|
|
## Architecture
|
|
|
|
```
|
|
CLI (src/cli*.py) GUI (src/gui/app.py + pages/)
|
|
│ │
|
|
└──────────┐ ┌──────────┘
|
|
▼ ▼
|
|
┌────────────────┐
|
|
│ src/core/ │
|
|
└────────────────┘
|
|
```
|
|
|
|
**Core/UI rule**: business logic in `core/` only. CLI + GUI translate user input → core call → display result.
|
|
|
|
## Module map
|
|
|
|
| Module | Public surface |
|
|
|--------|----------------|
|
|
| `i18n` | `t(key, lang=None, **fmt)`, `current_language()`, `set_language()`, `render_language_selector()`, `LANGUAGES` |
|
|
| `core.dedup` | `deduplicate()`, `MatchStrategy`, `ColumnMatchStrategy`, `Algorithm`, `SurvivorRule`, `DeduplicationResult`, `MatchResult`, `build_default_strategies()` |
|
|
| `core.normalizers` | `normalize_email/phone/name/address/string`, `NormalizerType`, `get_normalizer()` |
|
|
| `core.io` | `read_file()`, `write_file()`, `list_sheets()`, `detect_encoding/delimiter/header_row`, `repair_bytes()` |
|
|
| `core.config` | `DeduplicationConfig.from_file/to_file/to_strategies/to_survivor_rule` |
|
|
| `core.analyze` | `analyze()`, `Finding`, `findings_by_tool()`, `_NULL_LIKE` |
|
|
| `core.fixes` | `@register("fix_id")` decorator, `get_fix()`, `available_actions()` |
|
|
| `core.normalize` | `auto_fix()`, `apply_decisions()`, `NormalizationResult`, `is_normalized()` |
|
|
| `core.text_clean` | `clean_dataframe()`, `CleanOptions`, `CleanResult`, `smart_title_case()` |
|
|
| `core.format_standardize` | `standardize_dataframe()`, `StandardizeOptions`, `StandardizeResult`, `FieldType`, per-cell `standardize_*()` |
|
|
| `core.errors` | `DataToolsError` hierarchy, `ensure_dataframe()`, `ensure_choice()`, `wrap_file_read/write()`, `format_for_user()` |
|
|
| `core._constants` | `US_STATE_NAMES`, `US_STATE_CODES`, `USPS_EXPANSIONS`, `USPS_COMPRESSIONS` |
|
|
|
|
## Data flow — Deduplicator
|
|
|
|
```
|
|
read_file() # auto-detect encoding, delimiter, header
|
|
▼ DataFrame
|
|
build_default_strategies() # if no explicit strategies
|
|
▼ # strong keys (email, phone) → standalone OR
|
|
# weak keys (name, address) → AND with strong
|
|
_apply_normalizations() # add _norm_* shadow columns
|
|
▼
|
|
_find_match_groups() # O(n²) pair compare, OR strategies, union-find
|
|
▼
|
|
[review_callback()] # optional interactive review
|
|
▼
|
|
_select_survivor() # per group: first/last/most-complete/most-recent
|
|
▼
|
|
[_merge_group()] # optional: fill blanks from losers
|
|
▼
|
|
DeduplicationResult # deduplicated_df, removed_df, match_groups, log
|
|
```
|
|
|
|
## Extension recipes
|
|
|
|
### Add a normalizer
|
|
|
|
1. Add function to `core/normalizers.py`:
|
|
```python
|
|
def normalize_company(value: Optional[str]) -> str:
|
|
if not value or not isinstance(value, str): return ""
|
|
name = value.strip().casefold()
|
|
for sfx in ("inc", "llc", "corp", "ltd", "co"):
|
|
name = re.sub(rf"\b{sfx}\.?\s*$", "", name).strip()
|
|
return name
|
|
```
|
|
2. Register: add `COMPANY = "company"` to `NormalizerType` + entry in `_NORMALIZER_MAP`.
|
|
3. Auto-detect (optional): add a `_COLUMN_TYPE_PATTERNS` row in `core/dedup.py`.
|
|
|
|
### Add a fuzzy algorithm
|
|
|
|
1. Add value to `Algorithm` enum in `core/dedup.py`.
|
|
2. Add case in `_compute_similarity()`.
|
|
3. Document the value in CLI help text.
|
|
|
|
### Add a survivor rule
|
|
|
|
1. Add value to `SurvivorRule` enum.
|
|
2. Add branch in `_select_survivor()`.
|
|
3. Add CLI mapping.
|
|
|
|
### Add a fix + detector (analyzer/gate)
|
|
|
|
1. **Detector** in `core/analyze.py`: add `_detect_<thing>(df) -> list[Finding]`, hook into the main `analyze()` pipeline. Emit Finding with a unique `fix_action` id.
|
|
2. **Fix** in `core/fixes.py`:
|
|
```python
|
|
@register("fix_id")
|
|
def my_fix(df, payload=None) -> tuple[pd.DataFrame, int]:
|
|
# ...
|
|
return out_df, cells_changed
|
|
```
|
|
3. **Constant** in `core/analyze.py`: add `FIX_<NAME> = "fix_id"` so the detector and fix can reference it.
|
|
|
|
No other call sites change. Gate auto-discovers it via the registry.
|
|
|
|
### i18n — language packs
|
|
|
|
The GUI's user-facing strings live in `src/i18n/packs/<code>.json`, keyed by ISO-639-1 code. English (`en.json`) is canonical; missing keys in other packs fall back to English, and missing keys in English fall back to the literal dotted key so a typo is visible rather than silent.
|
|
|
|
**Look up a string in code:**
|
|
```python
|
|
from src.i18n import t
|
|
st.button(t("upload.run_button"))
|
|
st.warning(t("gate.warning", name=filename)) # {name} interpolated via str.format
|
|
```
|
|
|
|
`t()` reads the active language from `st.session_state["ui_lang"]`. Outside a Streamlit run (tests, scripts) it falls back to English.
|
|
|
|
**Add a new language:**
|
|
1. Copy `src/i18n/packs/en.json` to `src/i18n/packs/<code>.json` and translate values in place. Keep the key tree identical.
|
|
2. Add a one-line entry to `LANGUAGES` in `src/i18n/__init__.py`: `{"code": "fr", "label": "Français"}`. The sidebar picker auto-renders.
|
|
3. Run `pytest tests/test_lang_packs.py` — the parity test fails until every key from `en.json` exists in the new pack (and orphan keys not in English are also flagged).
|
|
|
|
**Add a new key:**
|
|
1. Add it to `en.json` first (canonical pack).
|
|
2. Add it to every other registered pack in the same commit. The parity test enforces this.
|
|
3. Use the dotted key at the call site: `t("section.subsection.key")` or `t("section.key", name=value)` for placeholder interpolation.
|
|
|
|
**Authoring rules:**
|
|
- Keys live under semantic sections (`home.*`, `upload.*`, `findings.*`, `tools.<id>.name`). Don't nest by language or by tool unless the string is genuinely tool-specific.
|
|
- Use `{named}` placeholders (not positional `{0}`) so translators see what's being interpolated.
|
|
- Strings can contain Streamlit markdown (`**bold**`) — pass through `st.markdown` / `st.caption` as usual.
|
|
- Do **not** put strings inside the farewell-overlay JS payload without going through `_js_html_safe()` in `src/gui/components/_legacy.py`; the helper escapes both the JS string terminator and HTML special chars. The test `TestFarewellEscape` pins that contract.
|
|
- The sidebar picker is mounted by `hide_streamlit_chrome()`, so every page that calls that helper automatically gets the picker. Pages that don't call it (rare) can call `render_language_selector()` directly.
|
|
|
|
### Licensing
|
|
|
|
The license layer lives at ``src/license/``. The public API:
|
|
|
|
```python
|
|
from src.license import (
|
|
get_manager, require_feature, current_state,
|
|
FeatureFlag, Tier, License,
|
|
)
|
|
|
|
mgr = get_manager()
|
|
if not mgr.is_valid():
|
|
raise RuntimeError("Not licensed")
|
|
require_feature(FeatureFlag.DEDUPLICATOR)
|
|
```
|
|
|
|
**Storage**: ``~/.datatools/license.json`` (override via
|
|
``DATATOOLS_LICENSE_PATH``). Signed locally with HMAC-SHA256 using a
|
|
secret read from ``DATATOOLS_LICENSE_SECRET`` (build-time replace; the
|
|
in-repo default is a development placeholder).
|
|
|
|
**Dev bypass**: ``DATATOOLS_DEV_MODE=1`` short-circuits every check.
|
|
The test suite's autouse fixture sets this so existing tests don't
|
|
need their own license fixtures. Tests that need the real check
|
|
explicitly use ``isolated_license_path`` /
|
|
``activated_license_manager`` / ``unactivated_license_manager``.
|
|
|
|
**Adding a feature flag**:
|
|
|
|
1. Add the enum value to ``FeatureFlag`` in ``src/license/schema.py``.
|
|
2. Add it to the relevant tier's set in
|
|
``FEATURES_BY_TIER`` in ``src/license/features.py``.
|
|
3. Gate at the call site: ``require_feature(FeatureFlag.YOUR_FLAG)``.
|
|
|
|
**Adding a new tier**:
|
|
|
|
1. Add the enum value to ``Tier``.
|
|
2. Add a row to ``FEATURES_BY_TIER`` listing the unlocked flags.
|
|
3. Add ``license.tier_<name>`` translation keys to every i18n pack.
|
|
4. The activation flow, sidebar status badge, feature gate, and home
|
|
grid lock badge all pick up the new tier automatically.
|
|
|
|
**Worked example — the Lite tier**:
|
|
|
|
```python
|
|
# src/license/schema.py
|
|
class Tier(str, Enum):
|
|
LITE = "lite" # new
|
|
CORE = "core"
|
|
...
|
|
|
|
# src/license/features.py
|
|
FEATURES_BY_TIER = {
|
|
...
|
|
Tier.LITE: frozenset({
|
|
FeatureFlag.DEDUPLICATOR,
|
|
FeatureFlag.TEXT_CLEANER,
|
|
FeatureFlag.FORMAT_STANDARDIZER,
|
|
}),
|
|
Tier.CORE: _all(),
|
|
...
|
|
}
|
|
```
|
|
|
|
Then in en.json/es.json add ``license.tier_lite``. That's it — the
|
|
existing ``require_feature_or_render_upgrade`` (GUI) and
|
|
``guard(feature=...)`` (CLI) calls in every tool page/CLI route a
|
|
Lite user into the upgrade prompt for any tool the tier doesn't
|
|
unlock. The home grid's lock badge fires off the same feature
|
|
lookup.
|
|
|
|
**Minting a license** (creator-only):
|
|
|
|
```bash
|
|
DATATOOLS_LICENSE_SECRET=<shipping-secret> \
|
|
python scripts/generate_license.py \
|
|
--name "Jane Doe" --email jane@example.com \
|
|
--tier core --years 1
|
|
```
|
|
|
|
The script prints a ``DTLIC1:`` blob to stdout — deliver this in the
|
|
Gumroad / purchase email. The buyer pastes it into the activation
|
|
page or runs ``python -m src.license_cli activate <blob> --name ...``.
|
|
|
|
### Add a format-standardizer field type
|
|
|
|
1. Add value to `FieldType` enum in `core/format_standardize.py`.
|
|
2. Add per-cell `standardize_<x>(value, *, …)` returning `(new_value, changed)`.
|
|
3. Add option fields to `StandardizeOptions` (with defaults that preserve existing behavior).
|
|
4. Wire into `_apply_field_type()` dispatcher (the `else` branch raises `AssertionError` — every enum value needs a branch).
|
|
5. Add validation entry in `StandardizeOptions.from_dict()` for any new enum-shaped option.
|
|
|
|
## Errors
|
|
|
|
Use `core/errors.py` instead of raw `ValueError` / `OSError`:
|
|
|
|
| Pattern | Use |
|
|
|---------|-----|
|
|
| Bad arg, wrong type, missing column | `InputValidationError` |
|
|
| Bad config / options file | `ConfigError` |
|
|
| File parses but isn't what we expected | `FileFormatError` |
|
|
| File I/O failure (perms, missing, disk full) | `FileAccessError` |
|
|
| Internal invariant broken (unreachable branch) | `AssertionError` |
|
|
|
|
Helpers:
|
|
- `ensure_dataframe(value, function="my_func")` at every public entry that takes a df.
|
|
- `ensure_choice(value, name="mode", choices=[...])` at every entry that takes a literal.
|
|
- `wrap_file_read(path, "operation", exc)` / `wrap_file_write(...)` when wrapping `OSError`.
|
|
|
|
GUI / CLI handlers: use `format_for_user(exc, context="...")` to render.
|
|
|
|
All `DataToolsError` subclasses extend stdlib `ValueError` or `OSError` so existing handlers still catch them.
|
|
|
|
## Tests
|
|
|
|
```bash
|
|
# All (core + CLI + GUI)
|
|
pytest -q
|
|
# Quick loop — skip the GUI layer
|
|
pytest -q -m 'not gui'
|
|
# Only the GUI tests
|
|
pytest -q -m gui
|
|
# By module
|
|
pytest tests/test_dedup.py
|
|
# Include slow / integration
|
|
pytest -m slow
|
|
# Single test
|
|
pytest tests/test_dedup.py::TestExactMatch::test_basic
|
|
```
|
|
|
|
Test layout:
|
|
```
|
|
tests/
|
|
├── conftest.py # core/CLI fixtures
|
|
├── test_dedup.py · test_normalizers.py · test_io.py · test_config.py
|
|
├── test_analyze.py · test_normalize.py · test_text_clean.py
|
|
├── test_format_standardize.py
|
|
├── test_format_standardize_corpus.py # 199-row buyer corpus
|
|
├── test_audit_fixes.py · test_errors.py · test_fixes_unit.py
|
|
├── test_corpus.py · test_encodings_corpus.py · test_fixtures_sweep.py
|
|
├── test_cli.py · test_cli_*.py · test_e2e.py · test_install.py
|
|
├── test_perf_regressions.py # shape pins for the perf wins
|
|
└── gui/ # Streamlit AppTest-driven tests
|
|
├── conftest.py # AppTest fixtures + helpers
|
|
├── _findings_panel_harness.py # isolated component test page
|
|
├── test_smoke.py # every page renders in EN + ES
|
|
├── test_chrome.py # language selector, hide_chrome
|
|
├── test_gate.py # require_normalization_gate
|
|
├── test_workflows.py # happy path per Ready tool
|
|
├── test_dedup_review.py # match-group card interactions
|
|
├── test_advanced_panels.py # config_panel widgets
|
|
├── test_errors.py # malformed-upload error paths
|
|
└── test_findings_panel.py # analyzer findings rendering
|
|
```
|
|
|
|
### GUI test layer
|
|
|
|
GUI tests drive pages with `streamlit.testing.v1.AppTest` —
|
|
in-process, no browser, no display. They pre-populate
|
|
`st.session_state` with stashed-upload bytes (via the
|
|
`stash_upload()` helper in `tests/gui/conftest.py`) and either click
|
|
buttons via `app.button[i].click().run()` or assert on the
|
|
`session_state` after the run.
|
|
|
|
Marker registered in `pytest.ini`. Default `pytest` runs everything;
|
|
`pytest -m 'not gui'` skips them for a faster core-only loop.
|
|
Coming-Soon stubs are pinned by the smoke tests so a regression
|
|
("import error", "missing widget") shows up immediately.
|
|
|
|
Fixture corpora: `test-cases/text-cleaner-corpus/` (21 files) · `test-cases/encodings-corpus/` (31 files) · `test-cases/format-cleaner-corpus/` (7 files + spec).
|
|
|
|
## Known limitations
|
|
|
|
- **Dedup pair-compare is O(n²)** for fuzzy strategies. Exact-only
|
|
strategies (every column uses `Algorithm.EXACT` at threshold 100)
|
|
now route through an O(n) groupby fast path automatically — no API
|
|
change. Fuzzy strategies can opt into prefix blocking via
|
|
`deduplicate(..., blocking_columns=[...], blocking_prefix_len=1)`
|
|
to partition pairs by a cheap key (trades recall for speed).
|
|
- **Threading is opt-in for format_standardize** —
|
|
`StandardizeOptions.parallel_columns > 1` uses a thread pool.
|
|
On CPython 3.12 the GIL caps the win at roughly neutral; the
|
|
scaffolding is in place for free-threaded Python 3.13+.
|
|
- **Memory-bound** — entire file loaded into pandas. Streaming reads
|
|
exist but not integrated with the dedup engine.
|
|
- **No multi-sheet dedup** — each Excel sheet processed independently.
|
|
- **Phonenumbers minimum-length** — international numbers without
|
|
country codes fall back to digits-only.
|