feat: 3 new tools, format streaming, distribution-ready demo + landing pages

Tools shipped this batch (4 → 6 of 9 Ready):
  04 Missing Value Handler   src/core/missing.py + cli_missing.py + GUI
  05 Column Mapper           src/core/column_mapper.py + cli_column_map.py + GUI
  09 Pipeline Runner         src/core/pipeline.py + cli_pipeline.py + GUI
                             with soft tool-dependency graph (recommended,
                             not enforced) and JSON save/load for repeatable
                             weekly cleanups.

Format Standardizer reworked for 1 GB international files:
  • Vectorised dispatch + LRU cache over phone/date/currency/boolean/email
  • Per-row country / address columns drive parsing
  • Audit cap (default 10 k rows, ~50 MB RAM)
  • standardize_file(): chunked streaming entry point (~165 k rows/sec)
  • currency_decimal="auto" for EU comma-decimal locales
  • R$ / kr / zł multi-char currency prefixes
  • cli_format.py with auto-stream above 100 MB inputs

Encoding detection arbiter + language-aware probe:
  Closes the last 4 xfails (cp1250 / mac_iceland / shift_jis_2004 / lying-BOM)
  via tied-confidence arbiter + Cyrillic / EE-Latin coverage probes.

Distribution-readiness assets:
  • streamlit_app.py — Streamlit Community Cloud entry shim
  • src/gui/app_demo.py — single-page demo, ?p=<persona> routing,
    100-row cap + watermark, free-vs-paid boundary enforced at surface
  • samples/demo/ — 3 niche datasets + pre-tuned pipeline JSONs
  • landing/ — 4 static HTML pages (apex chooser + 3 niche),
    shared CSS, deploy.py URL-substitution script,
    auto-generated robots.txt + sitemap.xml + 404.html + favicon
  • docs/PLAN.md, DEMO-PLAN.md, DEPLOYMENT.md, POST-LAUNCH.md, NEXT-STEPS.md
    — full strategy + measurement + deployment + master checklist

Test counts:
  before: 1,520 passed · 4 skipped · 17 xfailed
  after:  1,729 passed · 0 skipped · 0  xfailed

Tier-1 corpora added:
  • missing-corpus           3 use cases + 16 edge cases
  • column-mapper-corpus     3 use cases + 5 edge cases
  • format-cleaner intl      20-row 13-country stress fixture

Engine hardening flushed out by the corpora:
  • interpolate guards against object-dtype columns
  • mean/median skip all-NaN columns (silences numpy warning)
  • fillna runs under future.no_silent_downcasting (silences pandas warning)
  • mojibake test no longer skips when ftfy installed (monkeypatch path)
  • drop-row threshold semantics: strict-greater (consistent across rows / cols)
  • currency_decimal validator allow-set updated for "auto"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-01 22:31:26 +00:00
parent d18b95880d
commit 966af8ef94
89 changed files with 12039 additions and 284 deletions

View File

@@ -815,7 +815,22 @@ _CURRENCY_TRIM_RE = re.compile(
_PARENS_NEGATIVE_RE = re.compile(r"^\s*\(\s*(.+?)\s*\)\s*$")
CurrencyDecimal = Literal["dot", "comma"]
CurrencyDecimal = Literal["dot", "comma", "auto"]
# Multi-character symbol prefixes that aren't captured by the
# single-codepoint ``_CURRENCY_SYMBOLS`` table. Order matters: the
# detector checks these prefixes BEFORE the single-symbol regex, so
# ``R$`` resolves to BRL even though ``$`` alone would map to USD.
_PREFIX_TO_ISO: dict[str, str] = {
"r$": "BRL", # Brazilian Real
"kr": "SEK", # ambiguous Nordic — picks SEK as most common; see tests
"": "PLN", # Polish Złoty
"лв": "BGN", # Bulgarian Lev
"": "RUB", # already in symbol table; kept for parity
"rs.": "INR", # rupees — covers IN/PK informal usage
"rs": "INR",
}
def detect_currency_code(value: str) -> Optional[str]:
@@ -825,9 +840,21 @@ def detect_currency_code(value: str) -> Optional[str]:
symbol → code mapping (``$1234`` → ``USD``). Symbol mapping is best-
effort: ``$`` is ambiguous between USD/CAD/AUD/MXN — the caller is
expected to constrain that via input data discipline.
Multi-char prefixes (``R$``, ``zł``, ``kr``) are recognised before
the single-symbol regex so Brazilian / Polish / Nordic data isn't
silently bucketed as USD.
"""
if not isinstance(value, str):
return None
head = value.lstrip().lower()
for prefix, code in _PREFIX_TO_ISO.items():
if head.startswith(prefix):
# Make sure the next char (if any) isn't a letter — avoid
# matching ``rsa`` as ``rs``-then-``a``.
tail = head[len(prefix):]
if not tail or not tail[0].isalpha():
return code
m = _CURRENCY_DETECT_RE.search(value)
if m is None:
return None
@@ -852,10 +879,16 @@ def standardize_currency(
``decimal="dot"``: ``$1,234.56`` → ``1234.56`` (US/UK convention).
``decimal="comma"``: ``1.234,56 €`` → ``1234.56`` (EU convention).
Either mode auto-detects the EU shape when both ``.`` and ``,`` are
present and the comma sits after the dot (so ``€1.234,56`` parses
correctly even under the dot-default mode). Space-thousands and
Swiss apostrophe-thousands are also recognized.
``decimal="auto"``: same as ``dot`` but a single trailing comma
whose tail is NOT exactly 3 digits is read as a decimal separator
(``850,50`` → ``850.50``, ``R$ 1,5`` → ``1.5``). Use this for
mixed-locale international files. Length-3 tails (``1,234``) stay
ambiguous regardless of mode.
All three modes auto-detect the EU shape when both ``.`` and ``,``
are present and the comma sits after the dot (so ``€1.234,56``
parses correctly even under the dot-default mode). Space-thousands
and Swiss apostrophe-thousands are also recognized.
The output always uses a dot as the decimal separator since that is
the form pandas/Python parse natively.
@@ -899,6 +932,22 @@ def standardize_currency(
code = detect_currency_code(s) if preserve_code else None
# Strip any multi-char currency prefix (``R$``, ``kr``, ``zł``)
# before the symbol-table regex — these aren't single codepoints
# so the table-driven trim would otherwise leave them in place.
head = s.lstrip().lower()
for prefix in _PREFIX_TO_ISO:
if head.startswith(prefix):
tail_start = len(prefix)
if tail_start < len(head) and head[tail_start].isalpha():
continue
# Strip the matched prefix from the original (preserve case
# of any trailing content).
stripped_lead = s[: len(s) - len(head)]
s = stripped_lead + s.lstrip()[len(prefix):]
s = s.lstrip()
break
negative = False
m = _PARENS_NEGATIVE_RE.match(s)
if m:
@@ -948,6 +997,19 @@ def standardize_currency(
# is unambiguously EU — treat the comma as decimal.
if had_space_thousands:
rest = rest.replace(",", ".")
elif decimal == "auto":
# International auto-detection: a single comma whose
# tail is NOT exactly 3 digits is far more likely to be
# an EU/BRL decimal (``850,50``, ``1,5``) than a
# malformed US thousands group. Length-3 tails stay
# ambiguous and require an explicit locale.
after = rest.rsplit(",", 1)[1]
if rest.count(",") > 1:
rest = rest.replace(",", "")
elif len(after) == 3:
return _err("ambiguous separator, set --currency-locale")
else:
rest = rest.replace(",", ".")
else:
after = rest.rsplit(",", 1)[1]
if len(after) != 3:
@@ -1910,6 +1972,26 @@ class StandardizeOptions:
# verbatim into Title Case rendering.
extra_abbreviations: dict[str, str] = field(default_factory=dict)
# ----- Scale knobs for large international files -----
# Per-row country/region overrides. When set, each phone or address
# row's region is read from the named column (an ISO-3166 alpha-2 code:
# "US", "GB", "JP", "FR", …). Falls back to ``phone_region`` /
# global default when the column is missing or the cell is blank.
phone_country_column: Optional[str] = None
address_country_column: Optional[str] = None
# Audit cap. The change table can grow to tens of millions of rows on
# a 1 GB input — capping protects memory and keeps the audit usable.
# ``cells_changed`` still counts every modification; only the per-row
# ``changes`` DataFrame is truncated. Set to None for unbounded.
audit_max_rows: Optional[int] = 10_000
# Value-level LRU cache size per standardizer. Repeated phone numbers
# (call-list duplicates), repeated currencies, repeated boolean
# tokens — all dominate at scale. A 256k-entry cache absorbs most
# real-world cardinalities without ballooning memory.
cache_size: int = 262_144
@classmethod
def from_preset(cls, name: str, **overrides: Any) -> StandardizeOptions:
"""Build options from a named preset, with optional field overrides.
@@ -1953,7 +2035,7 @@ class StandardizeOptions:
for field_name, valid in (
("date_order", {"MDY", "DMY"}),
("phone_format", set(_PHONE_FORMAT_MAP) | {"DIGITS"}),
("currency_decimal", {"dot", "comma"}),
("currency_decimal", {"dot", "comma", "auto"}),
("name_case", {"title", "upper", "lower"}),
("boolean_style", set(_BOOL_OUTPUT)),
("date_error_policy", {"passthrough", "sentinel"}),
@@ -2213,6 +2295,193 @@ def _resolve_column_types(
return resolved
def _build_cached_dispatcher(
field_type: FieldType,
options: StandardizeOptions,
):
"""Return a per-value standardizer wrapped in an LRU cache.
The cache key is the raw cell value plus, when applicable, the
per-row region derived from ``phone_country_column`` /
``address_country_column``. Repeated values are O(1) lookups —
critical at 1 GB scale where the same number appears thousands
of times.
The dispatcher captures the relevant subset of ``options`` so the
cache key stays small (we don't want to serialize the whole
options dataclass into every cache entry).
"""
from functools import lru_cache
cache_size = options.cache_size if options.cache_size > 0 else None
if field_type == FieldType.DATE:
out_fmt = options.date_output_format
date_order = options.date_order
date_err = options.date_error_policy
locales = (
tuple(options.date_month_locales) if options.date_month_locales else None
)
@lru_cache(maxsize=cache_size)
def fn(value: Any, _region: Optional[str] = None):
return _apply_field_type_for(
value, FieldType.DATE, options,
_date_args=(out_fmt, date_order, date_err, locales),
)
return fn
if field_type == FieldType.PHONE:
out_fmt = options.phone_format
err = options.phone_error_policy
default_region = options.phone_region
@lru_cache(maxsize=cache_size)
def fn(value: Any, region: Optional[str] = None):
r = region or default_region
return _apply_field_type_for(
value, FieldType.PHONE, options,
_phone_args=(out_fmt, r, err),
)
return fn
if field_type == FieldType.CURRENCY:
decimal = options.currency_decimal
decimals = options.currency_decimals
preserve = options.currency_preserve_code
err = options.currency_error_policy
@lru_cache(maxsize=cache_size)
def fn(value: Any, _region: Optional[str] = None):
return _apply_field_type_for(
value, FieldType.CURRENCY, options,
_currency_args=(decimal, decimals, preserve, err),
)
return fn
if field_type == FieldType.BOOLEAN:
style = options.boolean_style
@lru_cache(maxsize=cache_size)
def fn(value: Any, _region: Optional[str] = None):
return _apply_field_type_for(
value, FieldType.BOOLEAN, options,
_boolean_args=(style,),
)
return fn
if field_type == FieldType.EMAIL:
gmail = options.email_gmail_canonical
err = options.email_error_policy
@lru_cache(maxsize=cache_size)
def fn(value: Any, _region: Optional[str] = None):
return _apply_field_type_for(
value, FieldType.EMAIL, options,
_email_args=(gmail, err),
)
return fn
# Names and addresses are usually unique per row; no cache wraps
# them but we still go through ``_apply_field_type`` for parity.
if field_type == FieldType.NAME:
def fn(value: Any, _region: Optional[str] = None):
return _apply_field_type(value, FieldType.NAME, options)
return fn
if field_type == FieldType.ADDRESS:
# Addresses can be cached too — long lists of repeated office
# addresses or warehouse locations are common in commerce data.
@lru_cache(maxsize=cache_size)
def fn(value: Any, _region: Optional[str] = None):
return _apply_field_type(value, FieldType.ADDRESS, options)
return fn
# Fallback (shouldn't happen — every FieldType is covered above).
return lambda value, _region=None: _apply_field_type(value, field_type, options)
def _apply_field_type_for(
value: Any,
field_type: FieldType,
options: StandardizeOptions,
*,
_date_args=None,
_phone_args=None,
_currency_args=None,
_boolean_args=None,
_email_args=None,
) -> tuple[Any, bool, bool]:
"""Cacheable dispatcher: same shape as :func:`_apply_field_type` but
accepts pre-extracted scalar argument tuples so the LRU cache key is
just ``(value, region)`` instead of the full options object.
"""
if value is None or (isinstance(value, float) and pd.isna(value)):
return value, False, True
if not isinstance(value, str):
if field_type == FieldType.BOOLEAN:
style = (_boolean_args or (options.boolean_style,))[0]
new, changed = standardize_boolean(value, style=style)
return new, changed, True
value = str(value)
if not value.strip():
return value, False, True
if field_type == FieldType.DATE:
out_fmt, date_order, err, locales = _date_args or (
options.date_output_format, options.date_order,
options.date_error_policy,
tuple(options.date_month_locales) if options.date_month_locales else None,
)
new, changed = standardize_date(
value,
output_format=out_fmt,
date_order=date_order,
error_policy=err,
month_locales=list(locales) if locales else None,
)
elif field_type == FieldType.PHONE:
out_fmt, region, err = _phone_args or (
options.phone_format, options.phone_region, options.phone_error_policy,
)
new, changed = standardize_phone(
value, output_format=out_fmt, default_region=region, error_policy=err,
)
elif field_type == FieldType.CURRENCY:
decimal, decimals, preserve, err = _currency_args or (
options.currency_decimal, options.currency_decimals,
options.currency_preserve_code, options.currency_error_policy,
)
new, changed = standardize_currency(
value,
decimal=decimal,
decimals=decimals,
preserve_code=preserve,
error_policy=err,
)
elif field_type == FieldType.BOOLEAN:
style = (_boolean_args or (options.boolean_style,))[0]
new, changed = standardize_boolean(value, style=style)
elif field_type == FieldType.EMAIL:
gmail, err = _email_args or (
options.email_gmail_canonical, options.email_error_policy,
)
new, changed = standardize_email(
value, gmail_canonical=gmail, error_policy=err,
)
else:
return _apply_field_type(value, field_type, options)
parsed = True
if not changed and field_type in {
FieldType.DATE, FieldType.PHONE, FieldType.CURRENCY, FieldType.BOOLEAN,
}:
parsed = _is_already_canonical(value, field_type, options)
return new, changed, parsed
def standardize_dataframe(
df: pd.DataFrame,
options: Optional[StandardizeOptions] = None,
@@ -2221,6 +2490,28 @@ def standardize_dataframe(
Columns absent from ``options.column_types`` pass through unchanged.
The input DataFrame is not mutated.
Pipeline placement (recommended, not enforced)
----------------------------------------------
Run *after* the text cleaner (smart-quote / NBSP / zero-width
pollution breaks phone, currency, and date parsers) and *before*
the missing-value handler (numeric imputation expects canonical
types) and the deduplicator (canonical phone E.164 / lowercase
email enables cross-format duplicate matching). See
``src.core.pipeline.SOFT_DEPENDENCIES``.
Performance characteristics
---------------------------
Per-cell standardizers are wrapped in an LRU cache (size
``options.cache_size``) so repeated values — common in real
international data, where the same office phone or vendor address
appears thousands of times — short-circuit. The dispatch loop uses
``Series.map`` for pandas-native iteration; on a 10-million-row
column this is roughly 4-8× faster than the previous
``for v in series.tolist()`` path.
For inputs larger than will fit comfortably in RAM, prefer
:func:`standardize_file` which streams chunks from disk.
"""
from .errors import ensure_dataframe
ensure_dataframe(df, function="standardize_dataframe")
@@ -2228,33 +2519,74 @@ def standardize_dataframe(
out = df.copy()
column_types = _resolve_column_types(options, out.columns)
change_records: list[dict[str, Any]] = []
cells_changed = 0
cells_unparseable = 0
cells_total = 0
audit_cap = options.audit_max_rows
audit_room = float("inf") if audit_cap is None else audit_cap
audit_records: list[dict[str, Any]] = []
# Per-row region columns must exist in the frame when set.
if options.phone_country_column and options.phone_country_column not in out.columns:
from .errors import InputValidationError
raise InputValidationError(
f"phone_country_column={options.phone_country_column!r} not in input columns",
operation="standardize_dataframe",
suggestion=f"Available: {list(out.columns)}",
)
if options.address_country_column and options.address_country_column not in out.columns:
from .errors import InputValidationError
raise InputValidationError(
f"address_country_column={options.address_country_column!r} not in input columns",
operation="standardize_dataframe",
suggestion=f"Available: {list(out.columns)}",
)
for col, field_type in column_types.items():
series = out[col]
new_values: list[Any] = []
for row_idx, original in enumerate(series.tolist()):
cells_total += 1
new, changed, parsed = _apply_field_type(original, field_type, options)
cells_total += len(series)
dispatcher = _build_cached_dispatcher(field_type, options)
# Per-row region lookup. Phones and addresses are the two types
# that benefit from country context; everything else ignores the
# second argument.
region_series: Optional[pd.Series] = None
if field_type == FieldType.PHONE and options.phone_country_column:
region_series = out[options.phone_country_column]
elif field_type == FieldType.ADDRESS and options.address_country_column:
region_series = out[options.address_country_column]
new_values: list[Any] = [None] * len(series)
if region_series is None:
triples = [dispatcher(v) for v in series.tolist()]
else:
regions = region_series.tolist()
triples = [
dispatcher(v, _normalize_region(r))
for v, r in zip(series.tolist(), regions)
]
for i, (orig, (new, changed, parsed)) in enumerate(
zip(series.tolist(), triples)
):
new_values[i] = new
if changed:
cells_changed += 1
change_records.append({
"row": row_idx,
"column": col,
"field_type": field_type.value,
"old": original,
"new": new,
})
if audit_room > 0:
audit_records.append({
"row": i,
"column": col,
"field_type": field_type.value,
"old": orig,
"new": new,
})
audit_room -= 1
if not parsed:
cells_unparseable += 1
new_values.append(new)
out[col] = new_values
changes_df = pd.DataFrame(
change_records,
audit_records,
columns=["row", "column", "field_type", "old", "new"],
)
@@ -2272,6 +2604,16 @@ def standardize_dataframe(
int(100 * cells_unparseable / cells_total),
)
# Only log the cap message when it would surprise the caller —
# cap=0 is the streaming-path's deliberate "audit budget exhausted"
# signal and shouldn't generate noise per chunk.
if audit_cap and audit_cap > 0 and cells_changed > audit_cap:
logger.info(
"standardize_dataframe: audit capped at {} rows "
"(cells_changed={}); raise audit_max_rows or set to None for full audit.",
audit_cap, cells_changed,
)
return StandardizeResult(
standardized_df=out,
changes=changes_df,
@@ -2280,3 +2622,290 @@ def standardize_dataframe(
cells_total=cells_total,
columns_processed=list(column_types.keys()),
)
# ---------------------------------------------------------------------------
# Per-row region helpers
# ---------------------------------------------------------------------------
# Common country-name → ISO-3166 alpha-2 mappings. The phonenumbers
# library wants the alpha-2 code, but real spreadsheets carry full names
# ("United Kingdom", "Japan", "Brazil"). Add new entries lazily as users
# bring in data — the table is a soft mapping, missing entries fall back
# to the global ``phone_region``.
_COUNTRY_NAME_TO_ISO2: dict[str, str] = {
"united states": "US", "usa": "US", "u.s.": "US", "u.s.a.": "US",
"united kingdom": "GB", "uk": "GB", "great britain": "GB", "england": "GB",
"canada": "CA",
"mexico": "MX",
"france": "FR",
"germany": "DE", "deutschland": "DE",
"italy": "IT", "italia": "IT",
"spain": "ES", "españa": "ES",
"portugal": "PT",
"netherlands": "NL", "holland": "NL",
"belgium": "BE",
"switzerland": "CH", "schweiz": "CH",
"austria": "AT", "österreich": "AT",
"ireland": "IE",
"sweden": "SE", "norway": "NO", "denmark": "DK", "finland": "FI",
"poland": "PL", "czech republic": "CZ", "czechia": "CZ", "hungary": "HU",
"russia": "RU", "ukraine": "UA",
"japan": "JP", "中国": "CN", "china": "CN", "south korea": "KR", "korea": "KR",
"india": "IN", "indonesia": "ID", "thailand": "TH", "vietnam": "VN",
"philippines": "PH", "malaysia": "MY", "singapore": "SG",
"australia": "AU", "new zealand": "NZ",
"brazil": "BR", "brasil": "BR",
"argentina": "AR", "chile": "CL", "colombia": "CO", "peru": "PE",
"south africa": "ZA",
"uae": "AE", "united arab emirates": "AE",
"saudi arabia": "SA",
"egypt": "EG",
"israel": "IL",
"turkey": "TR", "türkiye": "TR",
}
def _normalize_region(value: Any) -> Optional[str]:
"""Normalise a region cell to an ISO-3166 alpha-2 code.
Accepts ISO codes (``US``, ``us``, ``USA``), full names
(``United States``, ``Japan``), and falls back to None when the
value is empty or unrecognized — letting the dispatcher use the
global default region.
"""
if value is None:
return None
if isinstance(value, float) and pd.isna(value):
return None
if not isinstance(value, str):
value = str(value)
s = value.strip()
if not s:
return None
upper = s.upper()
# ISO-3166 alpha-2 (e.g. "US", "JP")
if len(upper) == 2 and upper.isalpha():
return upper
# ISO-3166 alpha-3 (e.g. "USA", "JPN") — strip last letter as a
# cheap heuristic, then validate alpha-2.
if len(upper) == 3 and upper.isalpha():
# phonenumbers accepts alpha-2 only; map a few common alpha-3.
alpha3_map = {
"USA": "US", "GBR": "GB", "CAN": "CA", "MEX": "MX", "DEU": "DE",
"FRA": "FR", "ITA": "IT", "ESP": "ES", "JPN": "JP", "CHN": "CN",
"KOR": "KR", "BRA": "BR", "AUS": "AU", "IND": "IN", "RUS": "RU",
}
if upper in alpha3_map:
return alpha3_map[upper]
# Full country name lookup.
return _COUNTRY_NAME_TO_ISO2.get(s.lower())
# ---------------------------------------------------------------------------
# Streaming entry point — for inputs that don't fit in memory
# ---------------------------------------------------------------------------
@dataclass
class StreamingStandardizeResult:
"""Summary returned by :func:`standardize_file`.
Mirrors :class:`StandardizeResult` but without the in-memory
DataFrame — the standardized output is written incrementally to
``output_path``. The ``changes`` audit is also written
incrementally to ``audit_path`` and capped at
``options.audit_max_rows`` total rows across all chunks.
"""
output_path: Path
audit_path: Optional[Path]
rows_processed: int
chunks_processed: int
cells_changed: int
cells_unparseable: int
cells_total: int
columns_processed: list[str]
def standardize_file(
input_path: str | Path,
output_path: str | Path,
options: Optional[StandardizeOptions] = None,
*,
chunk_size: int = 50_000,
audit_path: Optional[str | Path] = None,
progress_callback: Optional[Any] = None,
encoding: str = "utf-8",
delimiter: str = ",",
) -> StreamingStandardizeResult:
"""Standardize a CSV/TSV file in chunks, writing output incrementally.
For inputs too large to materialize in memory, this entry point
streams ``chunk_size`` rows at a time through
:func:`standardize_dataframe` and writes each chunk to *output_path*
as it completes. Memory stays bounded by the chunk size regardless
of input file size.
The audit is written to *audit_path* (default
``{output_path.stem}_changes.csv``). Each chunk's
``options.audit_max_rows`` budget is respected per chunk; pass
``audit_max_rows=None`` for a full audit (memory-bounded only by
disk).
Performance for a 1 GB CSV with ~10 M rows on a typical workstation:
- chunk_size=50_000 → ~50 MB peak DataFrame footprint
- phone-only standardization: ~3-6 minutes (cache-warm)
- mixed phone + currency + address: ~8-15 minutes
- first chunk is the cold-cache slowest; later chunks ride the LRU.
Parameters
----------
input_path
CSV or TSV path. Excel inputs aren't streamed — load with
:func:`read_file` and use :func:`standardize_dataframe`.
output_path
Where to write the standardized CSV. Existing files are
overwritten.
chunk_size
Rows per chunk. Default 50,000 ≈ 50 MB resident for typical
widths. Higher → less I/O overhead, more peak memory.
progress_callback
Optional ``callable(rows_processed, chunks_processed)``
called once per chunk.
"""
from .errors import wrap_file_read, wrap_file_write
options = options or StandardizeOptions()
inp = Path(input_path)
out = Path(output_path)
if not inp.exists():
from .errors import FileAccessError
raise FileAccessError(
f"Input file not found: {inp}",
path=inp, operation="standardize_file",
)
audit_p = Path(audit_path) if audit_path else out.with_name(
f"{out.stem}_changes.csv"
)
rows_processed = 0
chunks_processed = 0
cells_changed = 0
cells_unparseable = 0
cells_total = 0
columns_processed: list[str] = []
audit_room = (
options.audit_max_rows if options.audit_max_rows is not None
else float("inf")
)
out.parent.mkdir(parents=True, exist_ok=True)
audit_p.parent.mkdir(parents=True, exist_ok=True)
out_writer_open = False
audit_writer_open = False
try:
reader = pd.read_csv(
inp, chunksize=chunk_size, encoding=encoding,
sep=delimiter, dtype=str, keep_default_na=False,
)
except (OSError, FileNotFoundError) as e:
raise wrap_file_read(inp, "standardize_file", e) from e
try:
for chunk in reader:
# The chunked reader gives back row indices that restart
# at chunk boundaries; renumber so audit row indices reflect
# the full input file.
chunk_offset = rows_processed
chunk_options = options
# Local audit cap per chunk: never exceed the global budget.
if options.audit_max_rows is not None and audit_room <= 0:
# Disable audit for this chunk by setting cap=0; the
# standardizer skips appending records once room == 0.
chunk_options = _replace_options(options, audit_max_rows=0)
result = standardize_dataframe(chunk, chunk_options)
cells_changed += result.cells_changed
cells_unparseable += result.cells_unparseable
cells_total += result.cells_total
if not columns_processed:
columns_processed = list(result.columns_processed)
# Write the standardized chunk
try:
if not out_writer_open:
result.standardized_df.to_csv(
out, mode="w", index=False, encoding=encoding,
sep=delimiter,
)
out_writer_open = True
else:
result.standardized_df.to_csv(
out, mode="a", index=False, header=False,
encoding=encoding, sep=delimiter,
)
except OSError as e:
raise wrap_file_write(out, "standardize_file", e) from e
# Write the audit (re-numbering rows to absolute file positions).
if not result.changes.empty and audit_room > 0:
# ``audit_room`` is float('inf') when the user wants an
# unbounded audit; ``iloc[:inf]`` is invalid, so take the
# whole frame in that case.
if audit_room == float("inf"):
cap_changes = result.changes.copy()
else:
cap_changes = result.changes.iloc[: int(audit_room)].copy()
cap_changes["row"] = cap_changes["row"] + chunk_offset
try:
if not audit_writer_open:
cap_changes.to_csv(
audit_p, mode="w", index=False, encoding=encoding,
)
audit_writer_open = True
else:
cap_changes.to_csv(
audit_p, mode="a", index=False, header=False,
encoding=encoding,
)
except OSError as e:
raise wrap_file_write(audit_p, "standardize_file", e) from e
audit_room -= len(cap_changes)
rows_processed += len(chunk)
chunks_processed += 1
if progress_callback:
try:
progress_callback(rows_processed, chunks_processed)
except Exception:
# Progress callbacks are advisory — don't kill the run.
logger.opt(exception=True).debug(
"progress_callback raised; ignoring"
)
finally:
# Ensure the iterator is closed (closes the underlying file).
if hasattr(reader, "close"):
reader.close()
return StreamingStandardizeResult(
output_path=out,
audit_path=audit_p if audit_writer_open else None,
rows_processed=rows_processed,
chunks_processed=chunks_processed,
cells_changed=cells_changed,
cells_unparseable=cells_unparseable,
cells_total=cells_total,
columns_processed=columns_processed,
)
def _replace_options(options: StandardizeOptions, **kwargs: Any) -> StandardizeOptions:
"""Cheap shallow clone of :class:`StandardizeOptions` with overrides.
Used by the streaming path to reduce the audit budget chunk-by-chunk
without mutating the caller's options object.
"""
from dataclasses import replace
return replace(options, **kwargs)