Michael 60969c0770 feat(pdf): UI rework — Auto-detect is the default build flow
Pulls the user's primary mental model away from "draw column
boundaries" toward "tell me what shape your amounts have, see
detected rows, save." The visual picker that wasn't working for
multi-statement workflows is reachable but no longer the
default.

**Build mode header** now has a mode radio:

- "Auto-detect (recommended)" — row_heuristic. Tabs: Amount
  layout · Filters & date · Save. Three small forms; no
  coordinate UI anywhere. The Amount-layout tab's dropdown picks
  one of single / txn+balance / debit+credit / debit+credit+balance
  and auto-derives the min/max amount-count range (overridable
  under an expander).
- "Visual columns (advanced)" — column_visual. Five tabs (the
  original Visual picker / Pages & table / Columns / Parsing /
  Save). A yellow warning panel up top reminds the user that
  column-x templates only work when statement layout is stable.

Switching modes triggers a rerun so the right tab set renders
immediately. The template object preserves both mode's config
trees side-by-side so a user can flip between them without
losing work.

**Live preview** below the form runs ``apply_template`` against
the cached sample pages (already cached in session_state so this
re-renders cheaply on every form edit). The "no rows yet"
message is mode-aware — points users at the right tuning knobs
for whichever mode they're in. The preview caption notes which
mode produced the rows so the user can correlate decisions to
output.

The visual picker bug the user reported — "a single box stays in
the same location regardless of page" — is sidestepped rather
than fixed: in row_heuristic mode there's no canvas to confuse,
and for the rare column_visual user the canvas is still
imperfect but no longer their first interaction with the tool.
Cleaning up the column_visual canvas state bugs is a separate
follow-up if real users still hit the Advanced mode.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 23:46:27 +00:00

🌐 Language: English · Español

DataTools

Local CSV / Excel cleaning. CLI + browser GUI, no cloud, no install ceremony. GUI ships with English and Spanish language packs.

Tools

# Tool Status
01 Find Duplicates — exact + fuzzy match, 5 normalizers, survivor rules, audit Ready
02 Clean Text — whitespace, smart chars, BOM, line endings, case ops Ready
03 Standardize Formats — dates, phones, emails, addresses, names, currencies, booleans Ready
04 Fix Missing Values — disguised-null detection, profile, mean/median/mode/ffill/bfill/interpolate, drop strategies Ready
05 Map Columns — fuzzy auto-rename, target schema with type coercion, required fields with defaults, drop/reorder Ready
06 Find Unusual Values Coming Soon
07 Combine Files Coming Soon
08 Quality Check Coming Soon
09 Automated Workflows — chain tools with recommended (not forced) order, save/load JSON, automate weekly cleanups Ready

Download (non-technical users)

Pre-built installers — no Python required:

Platform Download First-launch note
macOS DataTools-X.Y.Z-mac.dmg Drag DataTools.app into /Applications, then double-click.
Windows DataTools-X.Y.Z-win-setup.exe Run the installer; launches from Start Menu.
Linux DataTools-X.Y.Z-linux-x86_64.AppImage chmod +x the file, then double-click.

Latest release: see GitHub Releases (or the Gumroad listing). The installers are ~150200 MB; the launcher boots a local server at http://127.0.0.1:8501 and opens your browser. Nothing is sent to the cloud.

Install from source (developers)

pip install -r requirements.txt

Python 3.10+ required.

Run

GUI (recommended):

streamlit run src/gui/app.py

CLI — seven entry points:

python -m src.cli            customers.csv [--apply]   # dedup
python -m src.cli_text_clean messy.csv     [--apply]   # text clean
python -m src.cli_format     intl.csv      [--apply]   # format standardize (auto-streams >100 MB)
python -m src.cli_missing    holes.csv     [--apply]   # missing values
python -m src.cli_column_map vendor.csv    [--apply]   # column mapper
python -m src.cli_pipeline   any_file.csv  [--apply]   # chain tools end-to-end
python -m src.cli_analyze    any_file.csv  [--json]    # scan only

Every CLI runs preview-only by default; add --apply to write output.

Language

The GUI sidebar has a language picker. Packs ship for English and Español (src/i18n/packs/); the choice persists for the session. Adding a language: drop a <code>.json next to en.json mirroring its key tree, then list it in LANGUAGES. See Developer Guide §i18n.

Review & Normalize gate

Every uploaded file passes through a CSV-normalization gate before any tool sees it. The analyzer flags ~15 issue types (whitespace, NBSP / zero-width chars, BOM, encoding, smart punct, dirty headers, null sentinels, mojibake, …) tagged by confidence (high / medium / low) and fix action. The GUI shows each finding with Auto-fix / Skip / Customize, a live before/after preview, and an encoding-override picker. Tool pages refuse to load until the gate passes.

Output

Every run writes:

  • {input}_<tool>.csv — the cleaned data
  • {input}_changes.csv (text cleaner) or {input}_match_groups.csv (dedup) — audit trail
  • logs/<tool>_YYYYMMDD_HHMMSS.log — debug-level run log

Original input file is never modified.

Docs

Dependencies

pandas, openpyxl, rapidfuzz, phonenumbers, typer, loguru, charset-normalizer, streamlit. Optional: ftfy for mojibake repair.

License

Proprietary.

Description
Data tools development
Readme 7.7 MiB
Languages
Python 87.3%
HTML 10%
CSS 1.8%
Shell 0.4%
JavaScript 0.2%
Other 0.2%