Files
datatools-dev/docs
Michael 82d7fef21e feat(gate): CSV-normalization gate with confidence-tiered findings
Adds a Review & Normalize page that sits between upload and every tool
page. The analyzer now tags each finding with confidence (high/medium/low)
and a fix_action; the gate auto-applies high-confidence fixes, surfaces
medium/low ones for user review, and blocks tool pages on error-level
findings until resolved or waived.

Core (src/core/):
  - analyze.py: Finding gains confidence, fix_action, pre_applied; new
    detectors for encoding_uncertain, encoding_decode_failed; new top-
    level encoding_override parameter.
  - fixes.py: registry of fix algorithms keyed by fix_action id.
  - normalize.py: auto_fix(), apply_decisions(), is_normalized(), and
    the NormalizationResult / Decision dataclasses the gate consumes.
  - io.py: detect_encoding tries strict UTF-8 first; repair_bytes now
    transcodes UTF-16/32 to UTF-8 before NUL-strip (fixes UTF-16 corruption)
    and normalizes line endings (fixes bare-CR parser crash); empty file
    handled gracefully instead of EmptyDataError traceback.

GUI (src/gui/):
  - pages/0_Review.py: gate page with per-finding decision controls,
    encoding override picker (16 codepages + custom), and Advanced output
    options (encoding, delimiter, line terminator) on the download.
  - components.py: require_normalization_gate() helper.
  - pages/1-9: gate guard wired on every tool page.

Test corpora:
  - test-cases/encodings-corpus/: 31 encoded CSV fixtures + 9 reference
    UTF-8 files + manifest, synced from Business/DataTools.
  - test-cases/text-cleaner-corpus/test_data/17: synced malformed input
    (unquoted $1,500.00) for the unquoted-delimiter detector.

Tests (94 new):
  - test_normalize.py (48): finding fields, fix registry, auto_fix scope,
    decision paths, gate idempotency, output-options helper.
  - test_encodings_corpus.py (90, 16 xfailed): parametric detection +
    decode + analyzer-no-crash sweep against the manifest.
  - test_analyze.py: encoding override + encoding_uncertain detectors.
  - test_corpus.py: pre-parse repair in the strict reader.

run_tests.py: new aliases --tool normalize, --tool encodings, --tool gate;
encodings corpus added to --fixtures category.

Docs: USER-GUIDE §3.3 covers the gate workflow, encoding override, and
output options; TECHNICAL §10.2.1-10.2.4 documents the analyzer schema,
gate API, Review page, and pre-parse repair pipeline; CLI-REFERENCE adds
the analyzer JSON schema with the new fields; README links to all of it.

Suite: 765 passed, 17 xfailed (was 458 passed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 20:35:27 +00:00
..

Excel & CSV Data Cleaning Mastery Bundle

Ready-to-sell Python automation product. 9 scripts for data cleaning, deduplication, text hygiene, formatting, merging, validation, and reporting.

Each script ships with both a GUI (runs in your browser locally, no internet needed) and a CLI.

Cross-platform: Windows, macOS, Linux.


Quick Start (for buyers)

  1. Download the installer for your operating system.
  2. Run the installer. No Python knowledge required.
  3. Launch via the desktop shortcut "Launch Bundle" (or the app icon on macOS, or the AppImage on Linux).
  4. Your default browser opens to a local page where the data tool runs. Your data never leaves your computer.

Full instructions: see USER-GUIDE.md.


Documentation Index

Ships with the product (buyer-facing)

  • USER-GUIDE.md - Installation, script reference, usage examples for both GUI and CLI.

Creator-only (do not ship to buyers)

  • BUSINESS.md - Business case, market analysis, pricing, marketing strategy (including the hosted browser demo as a conversion lever).
  • TECHNICAL.md - Architecture (dual CLI + Streamlit GUI), build pipeline, dev standards.
  • DECISIONS.md - Locked criteria, scoring rubric, decisions log, rationale for product choices including the GUI framework decision.
  • RECOVERY.md - How to rebuild the entire project from scratch if lost.

Version: 1.6 Last updated: April 28, 2026 Owner: Michael