Go to file

Michael 60969c0770 feat(pdf): UI rework — Auto-detect is the default build flow

Pulls the user's primary mental model away from "draw column
boundaries" toward "tell me what shape your amounts have, see
detected rows, save." The visual picker that wasn't working for
multi-statement workflows is reachable but no longer the
default.

**Build mode header** now has a mode radio:

- "Auto-detect (recommended)" — row_heuristic. Tabs: Amount
  layout · Filters & date · Save. Three small forms; no
  coordinate UI anywhere. The Amount-layout tab's dropdown picks
  one of single / txn+balance / debit+credit / debit+credit+balance
  and auto-derives the min/max amount-count range (overridable
  under an expander).
- "Visual columns (advanced)" — column_visual. Five tabs (the
  original Visual picker / Pages & table / Columns / Parsing /
  Save). A yellow warning panel up top reminds the user that
  column-x templates only work when statement layout is stable.

Switching modes triggers a rerun so the right tab set renders
immediately. The template object preserves both mode's config
trees side-by-side so a user can flip between them without
losing work.

**Live preview** below the form runs ``apply_template`` against
the cached sample pages (already cached in session_state so this
re-renders cheaply on every form edit). The "no rows yet"
message is mode-aware — points users at the right tuning knobs
for whichever mode they're in. The preview caption notes which
mode produced the rows so the user can correlate decisions to
output.

The visual picker bug the user reported — "a single box stays in
the same location regardless of page" — is sidestepped rather
than fixed: in row_heuristic mode there's no canvas to confuse,
and for the rare column_visual user the canvas is still
imperfect but no longer their first interaction with the tool.
Cleaning up the column_visual canvas state bugs is a separate
follow-up if real users still hit the Advanced mode.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-19 23:46:27 +00:00

.github/workflows

build: wire desktop-bundle pipeline (CI matrix + per-platform installers)

2026-05-05 13:58:43 +00:00

.streamlit

feat(brand): rebrand to UNALOGIX DataTools + Clean. Normalize. Transform.

2026-05-19 01:45:38 +00:00

build

build(pdf): bundle PDF deps in installers + pin versions + smoke tests

2026-05-19 23:10:43 +00:00

docs

docs: design notes for future PDF→CSV tool

2026-05-17 01:52:42 +00:00

landing

docs+code: rename tool labels everywhere

2026-05-16 19:50:09 +00:00

marketing

docs(i18n): document language packs across user, dev, and marketing docs

2026-05-13 15:16:24 +00:00

samples

feat: 3 new tools, format streaming, distribution-ready demo + landing pages

2026-05-01 22:31:26 +00:00

scripts

feat(license): datatools-admin CLI for the mint API

2026-05-14 00:47:01 +00:00

server

feat(server): Gumroad webhook receiver + Postmark email (PR 2)

2026-05-14 01:33:43 +00:00

src

feat(pdf): UI rework — Auto-detect is the default build flow

2026-05-19 23:46:27 +00:00

test-cases

test(junk-corpus): pathological-input stress suite for the analyzer

2026-05-16 21:35:22 +00:00

tests

feat(pdf): schema v2 + mode field + v1 in-memory migration

2026-05-19 23:46:10 +00:00

.gitignore

feat: Tier B operator scaffolding — bundle, copy SoT, posts, emails

2026-05-02 14:04:37 +00:00

pytest.ini

fix: clear all latent deprecation + resource warnings

2026-05-13 16:28:48 +00:00

README.es.md

docs+code: rename tool labels everywhere

2026-05-16 19:50:09 +00:00

README.md

docs+code: rename tool labels everywhere

2026-05-16 19:50:09 +00:00

requirements-dev.txt

build(pdf): bundle PDF deps in installers + pin versions + smoke tests

2026-05-19 23:10:43 +00:00

requirements.txt

build(pdf): bundle PDF deps in installers + pin versions + smoke tests

2026-05-19 23:10:43 +00:00

run_tests.py

feat(gate): CSV-normalization gate with confidence-tiered findings

2026-04-29 20:35:27 +00:00

streamlit_app.py

feat: 3 new tools, format streaming, distribution-ready demo + landing pages

2026-05-01 22:31:26 +00:00

tox.ini

test: single-command runner, cross-platform automation, fixture auto-discovery

2026-04-29 16:01:06 +00:00

README.md

🌐 Language: English · Español

DataTools

Local CSV / Excel cleaning. CLI + browser GUI, no cloud, no install ceremony. GUI ships with English and Spanish language packs.

Tools

#	Tool	Status
01	Find Duplicates — exact + fuzzy match, 5 normalizers, survivor rules, audit	Ready
02	Clean Text — whitespace, smart chars, BOM, line endings, case ops	Ready
03	Standardize Formats — dates, phones, emails, addresses, names, currencies, booleans	Ready
04	Fix Missing Values — disguised-null detection, profile, mean/median/mode/ffill/bfill/interpolate, drop strategies	Ready
05	Map Columns — fuzzy auto-rename, target schema with type coercion, required fields with defaults, drop/reorder	Ready
06	Find Unusual Values	Coming Soon
07	Combine Files	Coming Soon
08	Quality Check	Coming Soon
09	Automated Workflows — chain tools with recommended (not forced) order, save/load JSON, automate weekly cleanups	Ready

Download (non-technical users)

Pre-built installers — no Python required:

Platform	Download	First-launch note
macOS	`DataTools-X.Y.Z-mac.dmg`	Drag DataTools.app into /Applications, then double-click.
Windows	`DataTools-X.Y.Z-win-setup.exe`	Run the installer; launches from Start Menu.
Linux	`DataTools-X.Y.Z-linux-x86_64.AppImage`	`chmod +x` the file, then double-click.

Latest release: see GitHub Releases (or the Gumroad listing). The installers are ~150–200 MB; the launcher boots a local server at http://127.0.0.1:8501 and opens your browser. Nothing is sent to the cloud.

Install from source (developers)

pip install -r requirements.txt

Python 3.10+ required.

Run

GUI (recommended):

streamlit run src/gui/app.py

CLI — seven entry points:

python -m src.cli            customers.csv [--apply]   # dedup
python -m src.cli_text_clean messy.csv     [--apply]   # text clean
python -m src.cli_format     intl.csv      [--apply]   # format standardize (auto-streams >100 MB)
python -m src.cli_missing    holes.csv     [--apply]   # missing values
python -m src.cli_column_map vendor.csv    [--apply]   # column mapper
python -m src.cli_pipeline   any_file.csv  [--apply]   # chain tools end-to-end
python -m src.cli_analyze    any_file.csv  [--json]    # scan only

Every CLI runs preview-only by default; add --apply to write output.

Language

The GUI sidebar has a language picker. Packs ship for English and Español (src/i18n/packs/); the choice persists for the session. Adding a language: drop a <code>.json next to en.json mirroring its key tree, then list it in LANGUAGES. See Developer Guide §i18n.

Review & Normalize gate

Every uploaded file passes through a CSV-normalization gate before any tool sees it. The analyzer flags ~15 issue types (whitespace, NBSP / zero-width chars, BOM, encoding, smart punct, dirty headers, null sentinels, mojibake, …) tagged by confidence (high / medium / low) and fix action. The GUI shows each finding with Auto-fix / Skip / Customize, a live before/after preview, and an encoding-override picker. Tool pages refuse to load until the gate passes.

Output

Every run writes:

{input}_<tool>.csv — the cleaned data
{input}_changes.csv (text cleaner) or {input}_match_groups.csv (dedup) — audit trail
logs/<tool>_YYYYMMDD_HHMMSS.log — debug-level run log

Original input file is never modified.

Docs

User Guide — install, GUI workflow, gate
CLI Reference — every flag with recipes
Requirements — file sizes, encodings, detectors, perf targets
Technical — architecture, gate internals, fix registry
Developer Guide — adding fixes / detectors / standardizers

Dependencies

pandas, openpyxl, rapidfuzz, phonenumbers, typer, loguru, charset-normalizer, streamlit. Optional: ftfy for mojibake repair.

License

Proprietary.

Languages

Python 87.3%

HTML 10%

CSS 1.8%

Shell 0.4%

JavaScript 0.2%

Other 0.2%

README.md Unescape Escape

DataTools

Tools

Download (non-technical users)

Install from source (developers)

Run

Language

Review & Normalize gate

Output

Docs

Dependencies

License

README.md