feat: 3 new tools, format streaming, distribution-ready demo + landing pages

Tools shipped this batch (4 → 6 of 9 Ready):
  04 Missing Value Handler   src/core/missing.py + cli_missing.py + GUI
  05 Column Mapper           src/core/column_mapper.py + cli_column_map.py + GUI
  09 Pipeline Runner         src/core/pipeline.py + cli_pipeline.py + GUI
                             with soft tool-dependency graph (recommended,
                             not enforced) and JSON save/load for repeatable
                             weekly cleanups.

Format Standardizer reworked for 1 GB international files:
  • Vectorised dispatch + LRU cache over phone/date/currency/boolean/email
  • Per-row country / address columns drive parsing
  • Audit cap (default 10 k rows, ~50 MB RAM)
  • standardize_file(): chunked streaming entry point (~165 k rows/sec)
  • currency_decimal="auto" for EU comma-decimal locales
  • R$ / kr / zł multi-char currency prefixes
  • cli_format.py with auto-stream above 100 MB inputs

Encoding detection arbiter + language-aware probe:
  Closes the last 4 xfails (cp1250 / mac_iceland / shift_jis_2004 / lying-BOM)
  via tied-confidence arbiter + Cyrillic / EE-Latin coverage probes.

Distribution-readiness assets:
  • streamlit_app.py — Streamlit Community Cloud entry shim
  • src/gui/app_demo.py — single-page demo, ?p=<persona> routing,
    100-row cap + watermark, free-vs-paid boundary enforced at surface
  • samples/demo/ — 3 niche datasets + pre-tuned pipeline JSONs
  • landing/ — 4 static HTML pages (apex chooser + 3 niche),
    shared CSS, deploy.py URL-substitution script,
    auto-generated robots.txt + sitemap.xml + 404.html + favicon
  • docs/PLAN.md, DEMO-PLAN.md, DEPLOYMENT.md, POST-LAUNCH.md, NEXT-STEPS.md
    — full strategy + measurement + deployment + master checklist

Test counts:
  before: 1,520 passed · 4 skipped · 17 xfailed
  after:  1,729 passed · 0 skipped · 0  xfailed

Tier-1 corpora added:
  • missing-corpus           3 use cases + 16 edge cases
  • column-mapper-corpus     3 use cases + 5 edge cases
  • format-cleaner intl      20-row 13-country stress fixture

Engine hardening flushed out by the corpora:
  • interpolate guards against object-dtype columns
  • mean/median skip all-NaN columns (silences numpy warning)
  • fillna runs under future.no_silent_downcasting (silences pandas warning)
  • mojibake test no longer skips when ftfy installed (monkeypatch path)
  • drop-row threshold semantics: strict-greater (consistent across rows / cols)
  • currency_decimal validator allow-set updated for "auto"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-01 22:31:26 +00:00
parent d18b95880d
commit 966af8ef94
89 changed files with 12039 additions and 284 deletions

View File

@@ -1,111 +1,368 @@
"""DataTools Missing Value Handler — stub page."""
"""DataTools Missing Value Handler — Streamlit page."""
from __future__ import annotations
import io
import json
import sys
from pathlib import Path
import pandas as pd
import streamlit as st
_project_root = Path(__file__).resolve().parent.parent.parent.parent
if str(_project_root) not in sys.path:
sys.path.insert(0, str(_project_root))
from src.gui.components import hide_streamlit_chrome, require_normalization_gate
from src.gui.components import (
hide_streamlit_chrome,
pickup_or_upload,
require_normalization_gate,
)
from src.core.missing import (
DEFAULT_SENTINELS,
MissingOptions,
PRESETS,
handle_missing,
profile_missing,
)
hide_streamlit_chrome()
require_normalization_gate()
# ---------------------------------------------------------------------------
# Header
# ---------------------------------------------------------------------------
st.title("🕳️ Missing Value Handler")
st.caption("Detect, analyze, and handle missing values in your data.")
st.caption(
"Detect disguised nulls, profile missingness, and apply imputation or "
"drop strategies. Runs locally — your data never leaves this computer."
)
st.info("This tool is under development.")
# ---------------------------------------------------------------------------
# What this tool will do
# File upload
# ---------------------------------------------------------------------------
st.markdown("""
**Features:**
- Detect disguised nulls (empty strings, "N/A", "n/a", "-", "NULL", "None", etc.)
- Missingness analysis: per-column counts, percentages, and patterns
- Visualize missing data heatmap
- Imputation strategies: drop rows/columns, fill with mean/median/mode, forward-fill, backward-fill
- Custom sentinel value replacement
- Before/after comparison
""")
uploaded = pickup_or_upload(
label="Upload CSV or Excel file",
key="missing_file_upload",
types=["csv", "tsv", "xlsx", "xls"],
)
if uploaded is None:
st.info("Upload a CSV, TSV, or Excel file to begin.")
st.stop()
@st.cache_data(show_spinner=False)
def _read_uploaded(name: str, data: bytes) -> pd.DataFrame:
"""Read the uploaded bytes into a DataFrame.
Unlike the text cleaner, we do *not* force ``dtype=str`` here: missing-
value handling is more useful when numeric columns are typed correctly
(so mean / median / interpolate work without manual coercion).
Sentinel strings are still detected because they survive in object
columns where any cell is non-numeric.
"""
suffix = Path(name).suffix.lower()
bio = io.BytesIO(data)
if suffix in (".xlsx", ".xls"):
return pd.read_excel(bio)
for enc in ("utf-8", "utf-8-sig", "latin-1"):
try:
bio.seek(0)
sep = "\t" if suffix == ".tsv" else ","
return pd.read_csv(bio, encoding=enc, sep=sep, on_bad_lines="warn")
except UnicodeDecodeError:
continue
bio.seek(0)
return pd.read_csv(bio, encoding="latin-1")
try:
df = _read_uploaded(uploaded.name, uploaded.getvalue())
except Exception as e:
from src.core.errors import format_for_user
st.error(
f"**Could not read `{uploaded.name}`**\n\n"
f"```\n{format_for_user(e)}\n```"
)
st.stop()
st.subheader(f"Preview: {uploaded.name}")
st.caption(f"{len(df)} rows, {len(df.columns)} columns")
st.dataframe(df.head(10), use_container_width=True)
st.divider()
# ---------------------------------------------------------------------------
# File upload (functional)
# Initial profile (read-only)
# ---------------------------------------------------------------------------
uploaded = st.file_uploader(
"Upload CSV or Excel file",
type=["csv", "tsv", "xlsx", "xls"],
help="Upload a file to preview. Processing is not yet available.",
key="missing_file_upload",
)
st.subheader("Missingness profile")
if uploaded is not None:
import pandas as pd
try:
if uploaded.name.endswith((".xlsx", ".xls")):
df = pd.read_excel(uploaded)
else:
df = pd.read_csv(uploaded)
st.subheader(f"Preview: {uploaded.name}")
st.caption(f"{len(df)} rows, {len(df.columns)} columns")
st.dataframe(df.head(10), use_container_width=True)
except Exception as e:
from src.core.errors import format_for_user
st.error(
f"**Could not read `{uploaded.name}`**\n\n"
f"```\n{format_for_user(e)}\n```"
initial_profile = profile_missing(df, MissingOptions())
prof_df = initial_profile.to_dataframe()
m1, m2, m3, m4 = st.columns(4)
m1.metric("Rows", initial_profile.rows_total)
m2.metric("Cells missing", initial_profile.cells_missing)
m3.metric("% cells missing", f"{initial_profile.cells_missing_pct:.1f}%")
m4.metric("Complete rows", initial_profile.rows_complete)
st.dataframe(prof_df, use_container_width=True, hide_index=True)
if initial_profile.cells_missing == 0:
st.success("No missing values or disguised nulls detected. Nothing to handle.")
st.divider()
# ---------------------------------------------------------------------------
# Options
# ---------------------------------------------------------------------------
st.subheader("Strategy")
preset_label = st.radio(
"Preset",
[
"detect-only (standardize sentinels to NaN, no fill or drop)",
"safe-fill (numeric → median, categorical → mode)",
"drop-incomplete (drop any row with missing)",
],
index=0,
help=(
"detect-only: replace 'N/A', '-', 'NULL', etc. with real NaN, then stop. "
"safe-fill: also fill — numeric columns with median, others with mode. "
"drop-incomplete: also drop every row that has any missing cell."
),
)
preset_key = preset_label.split(" ", 1)[0]
options = MissingOptions.from_preset(preset_key)
with st.expander("Advanced options"):
col_a, col_b = st.columns(2)
with col_a:
st.markdown("**Detection**")
options.standardize_sentinels = st.checkbox(
"Standardize disguised nulls to NaN",
value=options.standardize_sentinels,
help="Replace 'N/A', '-', 'NULL', whitespace-only cells, etc. with real NaN.",
)
sentinels_text = st.text_input(
"Sentinel values (comma-separated)",
value=", ".join(options.sentinels),
disabled=not options.standardize_sentinels,
help="Matched case-insensitively after stripping whitespace.",
)
options.sentinels = [
s.strip() for s in sentinels_text.split(",") if s.strip()
]
with col_b:
st.markdown("**Strategy override**")
strat_options = [
"(use preset)",
"none", "drop_row", "drop_col", "drop_both",
"mean", "median", "mode", "constant",
"ffill", "bfill", "interpolate",
]
strat_choice = st.selectbox(
"Global strategy",
strat_options,
index=0,
help=(
"drop_row / drop_col use the thresholds below. "
"mean / median / interpolate are numeric only — non-numeric "
"columns fall back to the categorical strategy."
),
)
if strat_choice != "(use preset)":
options.strategy = strat_choice # type: ignore[assignment]
cat_strat = st.selectbox(
"Categorical fallback (for non-numeric columns)",
["mode", "constant", "ffill", "bfill", "none"],
index=0,
)
options.categorical_strategy = cat_strat # type: ignore[assignment]
if options.strategy == "constant" or cat_strat == "constant":
fill_val = st.text_input(
"Constant fill value",
value="",
help="Used when strategy = constant. Leave blank to fill with empty string.",
)
options.fill_value = fill_val
st.markdown("**Drop thresholds**")
col_c, col_d = st.columns(2)
with col_c:
options.row_drop_threshold = st.slider(
"Row drop threshold (drop rows with ≥ this fraction missing across selected cols)",
0.0, 1.0, options.row_drop_threshold, 0.05,
)
with col_d:
options.col_drop_threshold = st.slider(
"Column drop threshold (drop columns with ≥ this fraction missing)",
0.0, 1.0, options.col_drop_threshold, 0.05,
)
# ---------------------------------------------------------------------------
# Placeholder options
# ---------------------------------------------------------------------------
st.markdown("**Scope**")
selected_cols = st.multiselect(
"Columns to handle (default: all)",
options=list(df.columns),
default=list(df.columns),
)
skip_cols = st.multiselect(
"Columns to skip",
options=list(df.columns),
default=[],
)
options.columns = selected_cols if selected_cols else None
options.skip_columns = list(skip_cols)
st.subheader("Detection Settings")
st.text_input(
"Null patterns (comma-separated)",
value="N/A, n/a, NA, -, NULL, None, empty, .",
disabled=True,
help="Values to treat as missing.",
)
st.subheader("Handling Strategy")
st.selectbox("Strategy", [
"Drop rows with any missing",
"Drop rows above threshold",
"Fill with mean (numeric)",
"Fill with median (numeric)",
"Fill with mode (categorical)",
"Forward-fill",
"Backward-fill",
"Custom value",
], disabled=True)
st.slider("Drop threshold (%)", 0, 100, 50, disabled=True, help="Drop rows missing more than this % of columns.")
st.divider()
st.button("Handle Missing Values", type="primary", use_container_width=True, disabled=True)
st.markdown("**Per-column strategy overrides** (optional)")
st.caption(
"Set a different strategy for specific columns. Leave any row blank to "
"use the global strategy."
)
per_col_overrides: dict[str, str] = {}
only_missing_cols = [
r.column for r in initial_profile.columns if r.has_missing
]
if only_missing_cols:
edit_df = pd.DataFrame({
"column": only_missing_cols,
"strategy": ["" for _ in only_missing_cols],
})
edited = st.data_editor(
edit_df,
use_container_width=True,
hide_index=True,
column_config={
"column": st.column_config.TextColumn("Column", disabled=True),
"strategy": st.column_config.SelectboxColumn(
"Override",
options=[
"", "drop_row", "drop_col",
"mean", "median", "mode", "constant",
"ffill", "bfill", "interpolate",
],
),
},
key="missing_per_col_editor",
)
for _, row in edited.iterrows():
if row["strategy"]:
per_col_overrides[row["column"]] = row["strategy"]
options.column_strategies = per_col_overrides # type: ignore[assignment]
# ---------------------------------------------------------------------------
# Footer
# Run
# ---------------------------------------------------------------------------
st.divider()
st.caption(
"Runs locally. Your data never leaves this computer. "
"| DataTools v3.0"
)
if st.button("Handle Missing Values", type="primary", use_container_width=True):
with st.spinner("Handling..."):
try:
result = handle_missing(df, options)
except (ValueError, OSError) as e:
from src.core.errors import format_for_user
st.error(format_for_user(e))
st.stop()
st.session_state["missing_result"] = result
st.session_state["missing_input_name"] = uploaded.name
st.session_state["missing_options"] = options.to_dict()
result = st.session_state.get("missing_result")
if result is None:
st.info("Choose a strategy and click **Handle Missing Values** to run.")
st.stop()
# ---------------------------------------------------------------------------
# Results
# ---------------------------------------------------------------------------
st.subheader("Results")
m1, m2, m3, m4 = st.columns(4)
m1.metric("Sentinels → NaN", result.sentinels_standardized)
m2.metric("Cells filled", result.cells_filled)
m3.metric("Rows dropped", result.rows_dropped)
m4.metric("Columns dropped", len(result.columns_dropped))
if result.columns_dropped:
st.warning(f"Dropped columns: {', '.join(result.columns_dropped)}")
st.markdown("**Missingness — before vs. after**")
before = result.profile_before.to_dataframe().set_index("column")[
["missing", "missing_pct"]
].rename(columns={"missing": "before_missing", "missing_pct": "before_pct"})
after = result.profile_after.to_dataframe().set_index("column")[
["missing", "missing_pct"]
].rename(columns={"missing": "after_missing", "missing_pct": "after_pct"})
combined = before.join(after, how="outer").fillna(0)
st.dataframe(combined, use_container_width=True)
if result.strategy_per_column:
st.markdown("**Strategy applied per column**")
strat_df = pd.DataFrame(
[{"column": c, "strategy": s} for c, s in result.strategy_per_column.items()]
)
st.dataframe(strat_df, use_container_width=True, hide_index=True)
if not result.changes.empty:
st.markdown("**Audit (first 50 changes)**")
audit_view = result.changes.head(50).copy()
audit_view["row"] = audit_view["row"].apply(lambda x: "" if x == -1 else x + 1)
st.dataframe(audit_view, use_container_width=True, hide_index=True)
if len(result.changes) > 50:
st.caption(f"… and {len(result.changes) - 50} more (download the full audit below).")
st.markdown("**Handled preview (first 10 rows)**")
st.dataframe(result.handled_df.head(10), use_container_width=True)
# ---------------------------------------------------------------------------
# Downloads
# ---------------------------------------------------------------------------
st.divider()
stem = Path(st.session_state.get("missing_input_name", "input")).stem
dl_a, dl_b, dl_c = st.columns(3)
with dl_a:
handled_bytes = result.handled_df.to_csv(index=False).encode("utf-8-sig")
st.download_button(
"Download handled CSV",
data=handled_bytes,
file_name=f"{stem}_missing.csv",
mime="text/csv",
)
with dl_b:
if not result.changes.empty:
changes_bytes = result.changes.to_csv(index=False).encode("utf-8-sig")
st.download_button(
"Download changes audit",
data=changes_bytes,
file_name=f"{stem}_missing_changes.csv",
mime="text/csv",
)
with dl_c:
config_bytes = json.dumps(
st.session_state.get("missing_options", {}), indent=2, default=str,
).encode("utf-8")
st.download_button(
"Download config JSON",
data=config_bytes,
file_name="missing_config.json",
mime="application/json",
)
st.divider()
st.caption("Runs locally. Your data never leaves this computer. | DataTools v3.0")

View File

@@ -1,102 +1,413 @@
"""DataTools Column Mapper — stub page."""
"""DataTools Column Mapper — Streamlit page."""
from __future__ import annotations
import io
import json
import sys
from pathlib import Path
import pandas as pd
import streamlit as st
_project_root = Path(__file__).resolve().parent.parent.parent.parent
if str(_project_root) not in sys.path:
sys.path.insert(0, str(_project_root))
from src.gui.components import hide_streamlit_chrome, require_normalization_gate
from src.gui.components import (
hide_streamlit_chrome,
pickup_or_upload,
require_normalization_gate,
)
from src.core.column_mapper import (
MapOptions,
PRESETS,
TargetField,
TargetSchema,
infer_mapping,
map_columns,
)
hide_streamlit_chrome()
require_normalization_gate()
# ---------------------------------------------------------------------------
# Header
# ---------------------------------------------------------------------------
st.title("🗂️ Column Mapper")
st.caption("Rename columns, enforce a target schema, and coerce types.")
st.caption(
"Rename columns, enforce a target schema, and coerce types. Runs locally — "
"your data never leaves this computer."
)
st.info("This tool is under development.")
# ---------------------------------------------------------------------------
# What this tool will do
# File upload
# ---------------------------------------------------------------------------
st.markdown("""
**Features:**
- Rename columns via interactive mapping table
- Load a target schema (JSON/CSV) to auto-map columns
- Fuzzy column name matching for automatic suggestions
- Type coercion (string → int, string → date, etc.)
- Drop unmapped columns or keep as-is
- Reorder columns to match target schema
""")
uploaded = pickup_or_upload(
label="Upload CSV or Excel file",
key="colmap_file_upload",
types=["csv", "tsv", "xlsx", "xls"],
)
if uploaded is None:
st.info("Upload a CSV, TSV, or Excel file to begin.")
st.stop()
@st.cache_data(show_spinner=False)
def _read_uploaded(name: str, data: bytes) -> pd.DataFrame:
suffix = Path(name).suffix.lower()
bio = io.BytesIO(data)
if suffix in (".xlsx", ".xls"):
return pd.read_excel(bio)
for enc in ("utf-8", "utf-8-sig", "latin-1"):
try:
bio.seek(0)
sep = "\t" if suffix == ".tsv" else ","
return pd.read_csv(bio, encoding=enc, sep=sep, on_bad_lines="warn")
except UnicodeDecodeError:
continue
bio.seek(0)
return pd.read_csv(bio, encoding="latin-1")
try:
df = _read_uploaded(uploaded.name, uploaded.getvalue())
except Exception as e:
from src.core.errors import format_for_user
st.error(
f"**Could not read `{uploaded.name}`**\n\n"
f"```\n{format_for_user(e)}\n```"
)
st.stop()
st.subheader(f"Preview: {uploaded.name}")
st.caption(f"{len(df)} rows, {len(df.columns)} columns")
st.dataframe(df.head(10), use_container_width=True)
st.divider()
# ---------------------------------------------------------------------------
# Schema input
# ---------------------------------------------------------------------------
st.subheader("Target schema")
schema_mode = st.radio(
"How would you like to define the target schema?",
[
"Build interactively (start from current columns)",
"Upload schema JSON",
"Skip (rename / coerce only — no schema)",
],
index=0,
help=(
"An interactive build is fastest for one-off cleanup. Upload a JSON "
"when you have a fixed contract (a CRM import format, db schema). "
"Skip when you only want to rename or coerce specific columns."
),
)
schema: TargetSchema | None = None
if schema_mode.startswith("Upload"):
schema_file = st.file_uploader(
"Schema JSON",
type=["json"],
key="colmap_schema_upload",
help='Format: {"fields": [{"name": "email", "dtype": "string", "required": true, "aliases": ["EmailAddr"]}, ...]}',
)
if schema_file is not None:
try:
schema = TargetSchema.from_dict(json.loads(schema_file.getvalue()))
st.success(f"Loaded {len(schema.fields)} target field(s).")
except Exception as e:
from src.core.errors import format_for_user
st.error(f"**Could not parse schema**\n\n```\n{format_for_user(e)}\n```")
elif schema_mode.startswith("Build"):
st.caption(
"Edit the table to define your target schema. Add rows for fields the "
"input doesn't have yet (with a default), or remove rows for columns "
"you want to drop."
)
initial = pd.DataFrame({
"name": list(df.columns),
"dtype": ["auto"] * len(df.columns),
"required": [False] * len(df.columns),
"default": [""] * len(df.columns),
"aliases": [""] * len(df.columns),
})
edited = st.data_editor(
initial,
use_container_width=True,
num_rows="dynamic",
column_config={
"name": st.column_config.TextColumn("Target name"),
"dtype": st.column_config.SelectboxColumn(
"Type",
options=[
"auto", "string", "integer", "float",
"boolean", "date", "datetime", "category",
],
),
"required": st.column_config.CheckboxColumn("Required"),
"default": st.column_config.TextColumn("Default (for added cols)"),
"aliases": st.column_config.TextColumn(
"Aliases (comma-sep, helps fuzzy-match)",
),
},
key="colmap_schema_editor",
)
fields: list[TargetField] = []
for _, row in edited.iterrows():
name = str(row.get("name", "")).strip()
if not name:
continue
aliases = [
a.strip() for a in str(row.get("aliases", "") or "").split(",")
if a.strip()
]
default_raw = row.get("default")
default_val = (
default_raw if (default_raw not in (None, "", float("nan")))
else None
)
try:
if isinstance(default_val, float) and pd.isna(default_val):
default_val = None
except TypeError:
pass
fields.append(TargetField(
name=name,
dtype=str(row.get("dtype", "auto")), # type: ignore[arg-type]
required=bool(row.get("required", False)),
aliases=aliases,
default=default_val,
))
if fields:
schema = TargetSchema(fields=fields)
st.divider()
# ---------------------------------------------------------------------------
# File upload (functional)
# Strategy
# ---------------------------------------------------------------------------
uploaded = st.file_uploader(
"Upload CSV or Excel file",
type=["csv", "tsv", "xlsx", "xls"],
help="Upload a file to preview. Processing is not yet available.",
key="colmap_file_upload",
st.subheader("Strategy")
preset_label = st.radio(
"Preset",
[
"rename-only (just rename, leave types alone, keep extras)",
"lenient-schema (rename + coerce + reorder, keep extras)",
"strict-schema (rename + coerce + reorder, drop extras)",
],
index=0,
)
preset_key = preset_label.split(" ", 1)[0]
options = MapOptions.from_preset(preset_key)
options.schema = schema
if uploaded is not None:
import pandas as pd
try:
if uploaded.name.endswith((".xlsx", ".xls")):
df = pd.read_excel(uploaded)
else:
df = pd.read_csv(uploaded)
st.subheader(f"Preview: {uploaded.name}")
st.caption(f"{len(df)} rows, {len(df.columns)} columns")
st.dataframe(df.head(10), use_container_width=True)
st.subheader("Column Mapping")
st.caption("Map source columns to target names. (Interactive mapping coming soon.)")
mapping_data = pd.DataFrame({
"Source Column": df.columns.tolist(),
"Target Column": df.columns.tolist(),
"Type": ["auto"] * len(df.columns),
})
st.dataframe(mapping_data, use_container_width=True, hide_index=True)
except Exception as e:
from src.core.errors import format_for_user
st.error(
f"**Could not read `{uploaded.name}`**\n\n"
f"```\n{format_for_user(e)}\n```"
with st.expander("Advanced options"):
col_a, col_b = st.columns(2)
with col_a:
options.unmapped = st.selectbox( # type: ignore[assignment]
"Unmapped source columns",
["keep", "drop", "error"],
index=["keep", "drop", "error"].index(options.unmapped),
)
options.coerce_types = st.checkbox(
"Coerce types per schema", value=options.coerce_types,
)
options.reorder_to_schema = st.checkbox(
"Reorder to schema order", value=options.reorder_to_schema,
)
with col_b:
options.auto_infer = st.checkbox(
"Auto-infer mapping (fuzzy match)", value=options.auto_infer,
)
options.fuzzy_threshold = st.slider(
"Fuzzy match threshold", 0.0, 1.0, options.fuzzy_threshold, 0.05,
)
options.enforce_required = st.checkbox(
"Enforce required fields", value=options.enforce_required,
)
# ---------------------------------------------------------------------------
# Placeholder options
# Mapping editor — show inferred and let user override
# ---------------------------------------------------------------------------
st.subheader("Schema Options")
st.subheader("Mapping")
st.file_uploader("Load target schema (JSON)", type=["json"], disabled=True, key="colmap_schema")
st.checkbox("Drop unmapped columns", value=False, disabled=True)
st.checkbox("Reorder to match schema", value=True, disabled=True)
st.divider()
st.button("Apply Column Mapping", type="primary", use_container_width=True, disabled=True)
if schema is None:
st.caption(
"No schema — define explicit renames below (left blank means keep "
"the source name)."
)
rename_initial = pd.DataFrame({
"source": list(df.columns),
"target": list(df.columns),
})
rename_edited = st.data_editor(
rename_initial,
use_container_width=True,
column_config={
"source": st.column_config.TextColumn("Source", disabled=True),
"target": st.column_config.TextColumn("Target"),
},
hide_index=True,
key="colmap_rename_only_editor",
)
explicit_mapping: dict[str, str] = {}
for _, row in rename_edited.iterrows():
src = str(row["source"])
tgt = str(row["target"]).strip()
if tgt and tgt != src:
explicit_mapping[src] = tgt
options.mapping = explicit_mapping
else:
inferred = (
infer_mapping(df, schema, threshold=options.fuzzy_threshold)
if options.auto_infer else {}
)
target_options = ["(unmapped)"] + schema.field_names()
map_initial = pd.DataFrame({
"source": list(df.columns),
"target": [inferred.get(c, "(unmapped)") for c in df.columns],
"auto": [c in inferred for c in df.columns],
})
map_edited = st.data_editor(
map_initial,
use_container_width=True,
column_config={
"source": st.column_config.TextColumn("Source", disabled=True),
"target": st.column_config.SelectboxColumn(
"Target", options=target_options,
),
"auto": st.column_config.CheckboxColumn("Auto-suggested", disabled=True),
},
hide_index=True,
key="colmap_schema_mapping_editor",
)
explicit_mapping = {}
for _, row in map_edited.iterrows():
src = str(row["source"])
tgt = str(row["target"])
if tgt and tgt != "(unmapped)":
explicit_mapping[src] = tgt
options.mapping = explicit_mapping
# Disable auto-infer for the actual run since the editor already shows
# the user's resolved choices (they can manually re-select to add).
options.auto_infer = False
# ---------------------------------------------------------------------------
# Footer
# Run
# ---------------------------------------------------------------------------
st.divider()
st.caption(
"Runs locally. Your data never leaves this computer. "
"| DataTools v3.0"
if st.button("Apply Column Mapping", type="primary", use_container_width=True):
with st.spinner("Mapping..."):
try:
result = map_columns(df, options)
except (ValueError, OSError) as e:
from src.core.errors import format_for_user
st.error(format_for_user(e))
st.stop()
st.session_state["colmap_result"] = result
st.session_state["colmap_input_name"] = uploaded.name
st.session_state["colmap_options"] = options.to_dict()
result = st.session_state.get("colmap_result")
if result is None:
st.info("Configure a mapping and click **Apply Column Mapping** to run.")
st.stop()
# ---------------------------------------------------------------------------
# Results
# ---------------------------------------------------------------------------
st.subheader("Results")
m1, m2, m3, m4 = st.columns(4)
m1.metric("Renamed", result.columns_renamed)
m2.metric("Dropped", len(result.columns_dropped))
m3.metric("Added", len(result.columns_added))
m4.metric(
"Coerce fails",
sum(result.coercion_failures.values()) if result.coercion_failures else 0,
)
if result.columns_dropped:
st.warning(f"Dropped columns: {', '.join(result.columns_dropped)}")
if result.columns_added:
st.info(f"Added (with defaults): {', '.join(result.columns_added)}")
if result.coercion_failures:
st.warning(
"Some cells could not be coerced and were left as NaN: "
+ ", ".join(f"{c} ({n})" for c, n in result.coercion_failures.items())
)
if result.mapping:
st.markdown("**Resolved mapping**")
map_df = pd.DataFrame(
[
{"source": s, "target": t, "auto": s in result.inferred_pairs}
for s, t in result.mapping.items()
],
)
st.dataframe(map_df, use_container_width=True, hide_index=True)
st.markdown("**Mapped preview (first 10 rows)**")
st.dataframe(result.mapped_df.head(10), use_container_width=True)
# ---------------------------------------------------------------------------
# Downloads
# ---------------------------------------------------------------------------
st.divider()
stem = Path(st.session_state.get("colmap_input_name", "input")).stem
dl_a, dl_b, dl_c = st.columns(3)
with dl_a:
mapped_bytes = result.mapped_df.to_csv(index=False).encode("utf-8-sig")
st.download_button(
"Download mapped CSV",
data=mapped_bytes,
file_name=f"{stem}_mapped.csv",
mime="text/csv",
)
with dl_b:
audit_bytes = json.dumps({
"mapping": result.mapping,
"inferred_pairs": result.inferred_pairs,
"columns_renamed": result.columns_renamed,
"columns_dropped": result.columns_dropped,
"columns_added": result.columns_added,
"coercion_failures": result.coercion_failures,
"unmapped_kept": result.unmapped_kept,
"missing_required_targets": result.missing_required_targets,
}, indent=2, default=str).encode("utf-8")
st.download_button(
"Download mapping audit",
data=audit_bytes,
file_name=f"{stem}_mapping.json",
mime="application/json",
)
with dl_c:
config_bytes = json.dumps(
st.session_state.get("colmap_options", {}), indent=2, default=str,
).encode("utf-8")
st.download_button(
"Download config JSON",
data=config_bytes,
file_name="column_map_config.json",
mime="application/json",
)
st.divider()
st.caption("Runs locally. Your data never leaves this computer. | DataTools v3.0")

View File

@@ -1,104 +1,370 @@
"""DataTools Pipeline Runner — stub page."""
"""DataTools Pipeline Runner — Streamlit page."""
from __future__ import annotations
import io
import json
import sys
from pathlib import Path
import pandas as pd
import streamlit as st
_project_root = Path(__file__).resolve().parent.parent.parent.parent
if str(_project_root) not in sys.path:
sys.path.insert(0, str(_project_root))
from src.gui.components import hide_streamlit_chrome, require_normalization_gate
from src.gui.components import (
hide_streamlit_chrome,
pickup_or_upload,
require_normalization_gate,
)
from src.core.pipeline import (
Pipeline,
SOFT_DEPENDENCIES,
Step,
TOOL_NAMES,
recommended_pipeline,
run_pipeline,
validate_pipeline,
)
hide_streamlit_chrome()
require_normalization_gate()
# ---------------------------------------------------------------------------
# Header
# ---------------------------------------------------------------------------
st.title("⚙️ Pipeline Runner")
st.caption("Chain tools in sequence and pass output between steps automatically.")
st.info("This tool is under development.")
# ---------------------------------------------------------------------------
# What this tool will do
# ---------------------------------------------------------------------------
st.markdown("""
**Features:**
- Select tools to run in sequence
- Recommended order: Text Cleaner → Format Standardizer → Missing Values → Deduplicator → Validator
- Each step's output feeds into the next step's input
- Per-step configuration overrides
- Progress tracking across all steps
- Final combined report
""")
st.divider()
# ---------------------------------------------------------------------------
# File upload (functional)
# ---------------------------------------------------------------------------
uploaded = st.file_uploader(
"Upload CSV or Excel file",
type=["csv", "tsv", "xlsx", "xls"],
help="Upload a file to preview. Processing is not yet available.",
key="pipeline_file_upload",
st.caption(
"Chain DataTools cleaning steps into one repeatable workflow. The "
"pipeline recommends an order; you stay in control."
)
if uploaded is not None:
import pandas as pd
# ---------------------------------------------------------------------------
# File upload
# ---------------------------------------------------------------------------
uploaded = pickup_or_upload(
label="Upload CSV or Excel file",
key="pipeline_file_upload",
types=["csv", "tsv", "xlsx", "xls"],
)
if uploaded is None:
st.info("Upload a CSV, TSV, or Excel file to begin.")
st.stop()
@st.cache_data(show_spinner=False)
def _read_uploaded(name: str, data: bytes) -> pd.DataFrame:
suffix = Path(name).suffix.lower()
bio = io.BytesIO(data)
if suffix in (".xlsx", ".xls"):
return pd.read_excel(bio)
for enc in ("utf-8", "utf-8-sig", "latin-1"):
try:
bio.seek(0)
sep = "\t" if suffix == ".tsv" else ","
return pd.read_csv(bio, encoding=enc, sep=sep, on_bad_lines="warn")
except UnicodeDecodeError:
continue
bio.seek(0)
return pd.read_csv(bio, encoding="latin-1")
try:
df = _read_uploaded(uploaded.name, uploaded.getvalue())
except Exception as e:
from src.core.errors import format_for_user
st.error(
f"**Could not read `{uploaded.name}`**\n\n"
f"```\n{format_for_user(e)}\n```"
)
st.stop()
st.subheader(f"Preview: {uploaded.name}")
st.caption(f"{len(df)} rows, {len(df.columns)} columns")
st.dataframe(df.head(10), use_container_width=True)
st.divider()
# ---------------------------------------------------------------------------
# Pipeline builder
# ---------------------------------------------------------------------------
st.subheader("Pipeline")
mode = st.radio(
"How would you like to define the pipeline?",
[
"Use the recommended default (text-clean → format → missing → dedup)",
"Build interactively",
"Upload a saved pipeline JSON",
],
index=0,
)
if "pipeline_rows" not in st.session_state:
default = recommended_pipeline()
st.session_state["pipeline_rows"] = pd.DataFrame([
{
"tool": s.tool, "enabled": s.enabled,
"options_json": json.dumps(s.options),
}
for s in default.steps
])
if mode.startswith("Use the recommended"):
default = recommended_pipeline()
st.session_state["pipeline_rows"] = pd.DataFrame([
{
"tool": s.tool, "enabled": s.enabled,
"options_json": json.dumps(s.options),
}
for s in default.steps
])
elif mode.startswith("Upload"):
pipeline_file = st.file_uploader(
"Pipeline JSON", type=["json"], key="pipeline_upload",
)
if pipeline_file is not None:
try:
data = json.loads(pipeline_file.getvalue())
uploaded_pipe = Pipeline.from_dict(data)
st.session_state["pipeline_rows"] = pd.DataFrame([
{
"tool": s.tool, "enabled": s.enabled,
"options_json": json.dumps(s.options),
}
for s in uploaded_pipe.steps
])
st.success(f"Loaded {len(uploaded_pipe.steps)} step(s).")
except Exception as e:
from src.core.errors import format_for_user
st.error(f"**Could not parse pipeline**\n\n```\n{format_for_user(e)}\n```")
st.caption(
"Edit the table to add, remove, reorder (drag the row index), enable, "
"or configure each step. Tool order is recommended, not enforced — "
"violations surface as warnings below the table."
)
edited = st.data_editor(
st.session_state["pipeline_rows"],
use_container_width=True,
num_rows="dynamic",
column_config={
"tool": st.column_config.SelectboxColumn(
"Tool", options=TOOL_NAMES, required=True,
),
"enabled": st.column_config.CheckboxColumn("Enabled"),
"options_json": st.column_config.TextColumn(
"Options (JSON)",
help='e.g. {"column_types": {"phone": "phone"}}',
),
},
key="pipeline_editor",
)
st.session_state["pipeline_rows"] = edited
# Build a Pipeline object from the editor state.
steps_list: list[Step] = []
parse_errors: list[str] = []
for i, row in edited.iterrows():
tool = row.get("tool")
if not tool or pd.isna(tool):
continue
raw_opts = row.get("options_json") or "{}"
if pd.isna(raw_opts):
raw_opts = "{}"
try:
if uploaded.name.endswith((".xlsx", ".xls")):
df = pd.read_excel(uploaded)
else:
df = pd.read_csv(uploaded)
st.subheader(f"Preview: {uploaded.name}")
st.caption(f"{len(df)} rows, {len(df.columns)} columns")
st.dataframe(df.head(10), use_container_width=True)
opts = json.loads(raw_opts) if isinstance(raw_opts, str) else dict(raw_opts)
if not isinstance(opts, dict):
raise ValueError("options must be a JSON object")
except Exception as e:
from src.core.errors import format_for_user
st.error(
f"**Could not read `{uploaded.name}`**\n\n"
f"```\n{format_for_user(e)}\n```"
parse_errors.append(f"Step {i + 1}: {e}")
continue
try:
steps_list.append(Step(
tool=str(tool),
options=opts,
enabled=bool(row.get("enabled", True)),
))
except Exception as e:
parse_errors.append(f"Step {i + 1}: {e}")
if parse_errors:
for err in parse_errors:
st.error(err)
current_pipeline = Pipeline(steps=steps_list) if steps_list else None
if current_pipeline is not None:
warnings = validate_pipeline(current_pipeline)
if warnings:
st.warning(
"Pipeline is out of recommended order:\n\n"
+ "\n".join(f"- {w}" for w in warnings)
+ "\n\nThe pipeline will still run — these are recommendations only."
)
# ---------------------------------------------------------------------------
# Pipeline steps (checklist)
# ---------------------------------------------------------------------------
st.subheader("Pipeline Steps")
st.caption("Select tools to include in the pipeline (recommended order):")
st.checkbox("1. Text Cleaner", value=True, disabled=True)
st.checkbox("2. Format Standardizer", value=True, disabled=True)
st.checkbox("3. Missing Value Handler", value=True, disabled=True)
st.checkbox("4. Column Mapper", value=False, disabled=True)
st.checkbox("5. Outlier Detector", value=False, disabled=True)
st.checkbox("6. Deduplicator", value=True, disabled=True)
st.checkbox("7. Multi-File Merger", value=False, disabled=True)
st.checkbox("8. Validator & Reporter", value=True, disabled=True)
st.subheader("Pipeline Configuration")
st.selectbox("On error", ["Stop pipeline", "Skip step and continue", "Prompt for decision"], disabled=True)
st.checkbox("Generate combined report at end", value=True, disabled=True)
with st.expander("Recommended tool order — why each step belongs where it does"):
st.markdown(
"\n".join(
f"- **{e}** before **{l}** — {why}"
for e, l, why in SOFT_DEPENDENCIES
)
)
st.divider()
st.button("Run Pipeline", type="primary", use_container_width=True, disabled=True)
# ---------------------------------------------------------------------------
# Footer
# Run
# ---------------------------------------------------------------------------
run_disabled = current_pipeline is None or not current_pipeline.steps
if st.button(
"Run Pipeline",
type="primary",
use_container_width=True,
disabled=run_disabled,
):
progress = st.progress(0.0, text="Starting...")
log_box = st.empty()
log_lines: list[str] = []
total_enabled = sum(1 for s in current_pipeline.steps if s.enabled)
completed = [0]
def _on_step(sr) -> None:
completed[0] += 1
if sr.skipped:
log_lines.append(f"{sr.step.display_name()} (skipped)")
elif sr.error:
log_lines.append(
f"{sr.step.display_name()}{sr.error.splitlines()[0]}"
)
else:
log_lines.append(
f"{sr.step.display_name()}{sr.elapsed_seconds*1000:.0f} ms"
)
log_box.markdown("\n".join(log_lines))
progress.progress(
completed[0] / max(total_enabled, 1),
text=f"Step {completed[0]}/{total_enabled}",
)
try:
result = run_pipeline(
df, current_pipeline,
on_step_complete=_on_step,
stop_on_error=False,
)
except Exception as e:
from src.core.errors import format_for_user
st.error(f"**Pipeline halted**\n\n```\n{format_for_user(e)}\n```")
st.stop()
progress.progress(1.0, text="Done")
st.session_state["pipeline_result"] = result
st.session_state["pipeline_input_name"] = uploaded.name
result = st.session_state.get("pipeline_result")
if result is None:
st.info(
"Configure the pipeline above and click **Run Pipeline** to "
"execute it on your file."
)
st.stop()
# ---------------------------------------------------------------------------
# Results
# ---------------------------------------------------------------------------
st.subheader("Results")
m1, m2, m3, m4 = st.columns(4)
m1.metric("Initial rows", result.initial_rows)
m2.metric("Final rows", result.final_rows)
m3.metric("Steps run", sum(1 for s in result.step_results if not s.skipped))
m4.metric("Elapsed", f"{result.total_elapsed:.2f} s")
st.markdown("**Per-step summary**")
step_df = pd.DataFrame([
{
"step": sr.step.display_name(),
"status": (
"skipped" if sr.skipped
else "error" if sr.error
else "ok"
),
"elapsed_ms": int(sr.elapsed_seconds * 1000),
"summary": json.dumps(sr.summary, default=str)[:200],
"error": sr.error or "",
}
for sr in result.step_results
])
st.dataframe(step_df, use_container_width=True, hide_index=True)
st.markdown("**Output preview (first 10 rows)**")
st.dataframe(result.final_df.head(10), use_container_width=True)
# ---------------------------------------------------------------------------
# Downloads
# ---------------------------------------------------------------------------
st.divider()
st.caption(
"Runs locally. Your data never leaves this computer. "
"| DataTools v3.0"
)
stem = Path(st.session_state.get("pipeline_input_name", "input")).stem
dl_a, dl_b, dl_c = st.columns(3)
with dl_a:
bytes_csv = result.final_df.to_csv(index=False).encode("utf-8-sig")
st.download_button(
"Download cleaned CSV",
data=bytes_csv,
file_name=f"{stem}_pipeline.csv",
mime="text/csv",
)
with dl_b:
pipeline_bytes = json.dumps(
current_pipeline.to_dict() if current_pipeline else {"steps": []},
indent=2, default=str,
).encode("utf-8")
st.download_button(
"Download pipeline JSON",
data=pipeline_bytes,
file_name="pipeline.json",
mime="application/json",
help="Save this and pass --pipeline pipeline.json to the CLI to re-run on next week's file.",
)
with dl_c:
audit_bytes = json.dumps({
"warnings": result.warnings,
"initial_rows": result.initial_rows,
"final_rows": result.final_rows,
"total_elapsed_seconds": result.total_elapsed,
"steps": [
{
"tool": sr.step.tool,
"name": sr.step.display_name(),
"enabled": sr.step.enabled,
"skipped": sr.skipped,
"elapsed_seconds": sr.elapsed_seconds,
"summary": sr.summary,
"error": sr.error,
}
for sr in result.step_results
],
}, indent=2, default=str).encode("utf-8")
st.download_button(
"Download run audit",
data=audit_bytes,
file_name=f"{stem}_pipeline_audit.json",
mime="application/json",
)
st.divider()
st.caption("Runs locally. Your data never leaves this computer. | DataTools v3.0")