datatools-dev

Author	SHA1	Message	Date
Michael	93ccada974	build: bundle Tesseract 5.5.0 + tessdata into every release artifact End users no longer have to install Tesseract separately for OCR on scanned PDFs — the engine ships inside the installer, portable .zip, and AppImage for all three platforms. Per-platform fetch in build/make_release.py (run before PyInstaller): - Windows: download UB-Mannheim installer 5.5.0.20241111, extract with 7-Zip, copy tesseract.exe + required DLLs into the staging dir. - macOS: ``brew install tesseract``, copy binary + every Homebrew- prefixed dylib resolved via otool -L (recurse one level for transitive deps), then install_name_tool rewrites IDs / load paths to @loader_path/... so the bundle is relocatable. - Linux: ``apt-get install tesseract-ocr libtesseract5``, copy binary + every non-system .so from ldd output, patchelf --set-rpath '$ORIGIN'. Wire-up: - build/datatools.spec reads DATATOOLS_TESS_STAGING env var (set by make_release) and adds the staging dir + tessdata + the LICENSE_TESSERACT.txt Apache 2.0 attribution to PyInstaller datas so they land at <bundle>/tesseract/{tesseract[.exe],tessdata/} and the license sits at the bundle root. Soft-warns when staging is empty so dev spec runs still complete. - English tessdata pulled by fetch_tessdata() from tesseract-ocr/tessdata_best (eng.traineddata, ~16 MB). Cached at build/vendor/tessdata/. - .github/workflows/build.yml: actions/cache@v4 step keyed on ``tesseract-${runner.os}-5.5.0-tessdata_best-v1`` caches the staging dir and the vendored tessdata across runs; apt installs patchelf on the Linux runner; PyInstaller step now receives the DATATOOLS_TESS_STAGING env var. - .gitignore: build/_tesseract/ and the .traineddata blob. - TESSERACT_SKIP_FETCH=1 honored for offline / manual stages. - Installer / .dmg / .zip / AppImage scripts: one-line comments confirming Tesseract rides along automatically via PyInstaller's datas (no extra packaging steps required in those scripts). Bundle-size delta: ~50-70 MB on disk per platform, ~25-40 MB post- compression. Net installer size ~250-300 MB (was ~120 MB) — accepted tradeoff for zero end-user OCR setup. Reversal of the prior "don't bundle Tesseract" decision (option A). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-06-02 18:20:33 +00:00
Michael	9c426194b1	build: add single-command release script + portable zip artifacts One-developer workflow: ``python build/make_release.py`` on each target OS produces both the installer and a portable .zip for that platform. Preflight checks PyInstaller / Pillow / iscc / hdiutil / ditto / appimagetool and bails with install hints if anything is missing — no half-built dist/. New scripts: - build/make_release.py — orchestrator, auto-detects host OS. - build/generate_icons.py — icon.ico / icon.icns / icon.png from src/gui/assets/datatools_icon_256.png (Pillow ships ICO + ICNS writers; no platform tooling needed). - build/build_portable_zip.py — Win/Linux portable zip via stdlib. - build/macos/build_zip.sh — Mac portable .app via ditto so bundle metadata survives. installer.iss now adds: Quick Launch task (opt-in, legacy Win 7), App Paths registry entry (Win+R "DataTools" works), SetupIconFile, UninstallDisplayIcon, AppSupportURL, AppUpdatesURL. CI workflow uploads installer + portable per platform and attaches both to GitHub Releases on tag push. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 19:30:17 +00:00
Michael	e1f364f010	feat: Tier B operator scaffolding — bundle, copy SoT, posts, emails Pick up and finish yesterday's cut-off Tier B pass. - build/: PyInstaller scaffold (datatools.spec + launcher.py + hook-streamlit.py + README) — folder-mode bundle, locked 127.0.0.1, per-OS recipe - marketing/COPY.md: single source of truth for every customer-facing string — landing H1/sub/CTAs, demo CTAs, email subjects, Gumroad listing, banned phrases - marketing/community-posts/: 9 drafts (3 posts × 3 niches: bookkeeper, revops, shopify-pet) — story / tip / soft-offer - marketing/emails/: 18 drafts (Gumroad delivery + 5-touch onboarding × 3 niches), per-niche segmentation guidance - docs/NEXT-STEPS.md: flip 2.2 / 2.4 / 3.1 / 3.4 to done with pointers to the new assets; add Phase 0 inventory rows - .gitignore: narrow `build/` ignore so PyInstaller spec + launcher + hooks get tracked, only generated artifacts (build/build/, build/__pycache__/, build/dist/) stay ignored Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 14:04:37 +00:00
Michael	966af8ef94	feat: 3 new tools, format streaming, distribution-ready demo + landing pages Tools shipped this batch (4 → 6 of 9 Ready): 04 Missing Value Handler src/core/missing.py + cli_missing.py + GUI 05 Column Mapper src/core/column_mapper.py + cli_column_map.py + GUI 09 Pipeline Runner src/core/pipeline.py + cli_pipeline.py + GUI with soft tool-dependency graph (recommended, not enforced) and JSON save/load for repeatable weekly cleanups. Format Standardizer reworked for 1 GB international files: • Vectorised dispatch + LRU cache over phone/date/currency/boolean/email • Per-row country / address columns drive parsing • Audit cap (default 10 k rows, ~50 MB RAM) • standardize_file(): chunked streaming entry point (~165 k rows/sec) • currency_decimal="auto" for EU comma-decimal locales • R$ / kr / zł multi-char currency prefixes • cli_format.py with auto-stream above 100 MB inputs Encoding detection arbiter + language-aware probe: Closes the last 4 xfails (cp1250 / mac_iceland / shift_jis_2004 / lying-BOM) via tied-confidence arbiter + Cyrillic / EE-Latin coverage probes. Distribution-readiness assets: • streamlit_app.py — Streamlit Community Cloud entry shim • src/gui/app_demo.py — single-page demo, ?p=<persona> routing, 100-row cap + watermark, free-vs-paid boundary enforced at surface • samples/demo/ — 3 niche datasets + pre-tuned pipeline JSONs • landing/ — 4 static HTML pages (apex chooser + 3 niche), shared CSS, deploy.py URL-substitution script, auto-generated robots.txt + sitemap.xml + 404.html + favicon • docs/PLAN.md, DEMO-PLAN.md, DEPLOYMENT.md, POST-LAUNCH.md, NEXT-STEPS.md — full strategy + measurement + deployment + master checklist Test counts: before: 1,520 passed · 4 skipped · 17 xfailed after: 1,729 passed · 0 skipped · 0 xfailed Tier-1 corpora added: • missing-corpus 3 use cases + 16 edge cases • column-mapper-corpus 3 use cases + 5 edge cases • format-cleaner intl 20-row 13-country stress fixture Engine hardening flushed out by the corpora: • interpolate guards against object-dtype columns • mean/median skip all-NaN columns (silences numpy warning) • fillna runs under future.no_silent_downcasting (silences pandas warning) • mojibake test no longer skips when ftfy installed (monkeypatch path) • drop-row threshold semantics: strict-greater (consistent across rows / cols) • currency_decimal validator allow-set updated for "auto" Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 22:31:26 +00:00
Michael	b23a27d4e3	fix: cross-tool audit findings + alignment with format standardizer Closes 12 bugs and 8 gaps surfaced by parallel audits across all core modules, plus aligns the dedup-side normalizers with the new format_standardize behavior where they had silently diverged. Bugs (data integrity / correctness): - dedup: NaN/None values matched as duplicates because str(None)='None'. Two rows with missing email silently merged. - dedup: removed_df had 0 columns when nothing was removed; downstream code expecting matching schema broke. Now preserves column shape. - dedup: ColumnMatchStrategy threshold accepted any value; out-of-range silently broke matching. Validated to [0, 100] in __post_init__. - dedup: strategy referencing a missing column was silently skipped. Now raises ValueError listing available columns. - fixes: replace_null_sentinels crashed on non-string sentinels (int/None from JSON payload). Coerced to str. - fixes: _vectorized_regex_sub raised raw re.error on bad patterns. Now wraps as ValueError with clear message. - io: detect_header_row mis-identified all-empty and metadata-only rows as headers (all([]) is True). Now requires ≥2 non-empty cells. - config: from_dict crashed when JSON had unknown fields, breaking forward compat. Now filters to known fields. - analyze: mixed-case email detector flagged all-None columns because str(None)='None' contains both N and one. Now drops NaN before stringify. New features and gap closures: - io: _detect_excel_header_row mirrors detect_header_row for Excel via openpyxl read-only; _read_excel uses it when header_row=None. - io: write_file gains delimiter + encoding params; .tsv extension defaults to tab. - normalizers: normalize_phone preserves extensions as ;ext=N suffix. - normalizers: normalize_address folds spelled-out US state names to 2-letter codes (California ≡ CA). - normalizers: normalize_name drops surname particles (van, de, von) so "Charles de Gaulle" ≡ "Charles Gaulle" for matching. - analyze: new _detect_inconsistent_date_format detector flags columns with mixed ISO/US/EU date shapes; routes to format standardizer. - analyze: _NULL_LIKE recognizes "<na>" (pd.NA repr). - analyze: duplicate-row finding renamed count → n_extra (rows that would actually be removed) with clarified description. - dedup: group_confidence no longer falsely 100.0 when transitive group members lack a recorded direct pair; falls back to 100.0 only when truly no pairs were observed. - dedup: MatchResult / DeduplicationResult docstrings clarify that row_indices refer to the input frame's positional index (output index is reset). - text_clean: visualize_hidden_html(None) now returns None (matches visualize_hidden_text); strip_bom strips at most one BOM per call; sentence_case dead elif branch removed. Tests: - tests/test_audit_fixes.py — 28 regression tests, one or more per numbered finding, named after BUG/GAP/NIT tags so future readers can trace each test back to its audit. - tests/test_fixes_unit.py — 26 isolated unit tests for previously integration-only fix functions (trim_whitespace, strip_nbsp, strip_zero_width, normalize_line_endings, clean_headers, repair_mojibake — last skipped if ftfy unavailable). - tests/test_io.py — adds CSV / TSV / semicolon / UTF-8-BOM round-trip tests + Excel auto-header-detection tests. - tests/test_normalizers.py — adds 8 tests for the alignment work above (phone extension, state names, particles). Adds .claude/ to .gitignore (agent worktrees + local settings). Full project suite: 1197 passed, 4 skipped, 17 xfailed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 02:11:57 +00:00
Michael	b871ab24fc	feat: add documentation, Streamlit GUI, and full source tree - Rewrite README.md with project overview, quick-start, and CLI summary - Add docs/CLI-REFERENCE.md with full flag reference and 8 recipe sections - Add docs/DEVELOPER.md with architecture, data flow, and extension guides - Rewrite src/core/__init__.py with public API exports and module docstring - Add Streamlit GUI (src/gui/) with file upload, advanced options, interactive match group review with side-by-side diff, and download buttons - Add .gitignore, requirements.txt, all source code, tests, and sample data - Add streamlit to requirements.txt Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-28 23:06:39 +00:00

6 Commits