Commit Graph

215 Commits

Author SHA1 Message Date
64452dd783 perf: dedup blocking, column-parallel scaffolding, lazy-copy pipelines
Three follow-on wins from the audit, each with shape-pinning tests.

1. Dedup blocking
   - Exact-only strategies (every column EXACT @ 100 — covers strong-
     key dedup like email/phone, the drop-duplicates fallback, and
     explicit "match on this exact column" calls) now route through
     an O(n) groupby fast path. Lossless; no API change required.
     Measured: 10k-row email-exact dedup → 73 ms (was ~30 minutes
     via the O(n²) pair compare).
   - Fuzzy strategies still pair-compare, with opt-in prefix blocking
     via deduplicate(..., blocking_columns=[...], blocking_prefix_len=1).
     Measured: 5k-row fuzzy-name → 25.6s with blocking vs 179s
     without (7x). Trade-off: cross-block matches missed.

2. Column-parallel standardize
   - StandardizeOptions.parallel_columns (default 1) lands a
     ThreadPoolExecutor over the column loop. Output order and
     audit-record order are preserved deterministically via a merge
     step keyed off column_types order. Honest doc: under CPython
     3.12's GIL the win is roughly neutral (phonenumbers/dateutil
     hold the GIL); the API is ready for free-threaded Python 3.13+.

3. Lazy-copy in missing / column_mapper
   - _standardize_sentinels now builds per-column changes in a dict
     and only materialises the output frame when at least one column
     actually changed. On a clean 1 GB file this skips a 1 GB
     allocation.
   - handle_missing carries an out_is_owned flag, copying on demand
     before any mutating step. No-op runs return the input frame.
   - map_columns drops the unconditional upfront df.copy(); rename
     and drop both return fresh frames already, and schema-add /
     coerce trigger _ensure_owned() lazily.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 15:54:25 +00:00
e5f632bcd6 docs(perf): publish 1.5 GB target and the new measured throughputs
REQUIREMENTS §10 reflects the post-optimisation numbers and the
known O(n²) dedup match step (flagged for a future blocking pass).
en/es upload-limit copy and uploader help now say 1.5 GB.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 15:37:26 +00:00
5b672370a6 perf: cache hot paths, drop wasted allocations, lift 1 GB → 1.5 GB
Five targeted wins driven by an end-to-end audit, with shape-pinning
regression tests so reverts are loud:

- format_standardize: fuse the dispatcher loop into one pass — was
  calling Series.tolist() three times per typed column and materialising
  an intermediate triples list; now one tolist, one walk. On a
  synthetic 1M-row phone+email frame this measures ~2.7M rows/sec
  (vs. the previous 150k/sec doc target).
- dedup: wrap normalizers in a per-call lru_cache so repeat phones /
  emails / addresses skip re-parsing. phonenumbers.parse is the
  expensive call; ~2–5x faster on the normalisation step for realistic
  workloads.
- analyze: _detect_near_duplicates no longer copies the full input
  frame; builds only the normalised string columns via a dict and
  references non-string columns by view. Skips the redundant
  astype(str) when a column is already pandas string dtype.
- text_clean: hoist _build_pipeline out of the per-cell loop and add a
  per-call string cache so 100k repeats of "Active" only run the
  pipeline once. ~1M rows/sec on repetition-heavy columns.
- io.repair_bytes: the non-UTF-8 smart-quote fold path used a
  Python-level zip walk over the entire decoded string to count
  replacements — replaced with sum(text.count(c) ...) which runs in
  C at ~GB/s. Was a latent ~100s on a 1 GB cp1252 file; now <1s.

Updates REQUIREMENTS §10 with measured numbers and bumps the buyer-
facing upload limit from 1 GB to 1.5 GB across the i18n packs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 15:37:26 +00:00
318b9b45dc docs(i18n): ship Spanish translations of buyer-facing docs
Adds README.es.md, docs/README.es.md, docs/USER-GUIDE.es.md, and
docs/CLI-REFERENCE.es.md mirroring the English client-facing set.
Each English doc gains a one-line language-switch banner pointing at
its Spanish counterpart; the docs index advertises both language sets
in the buyer-facing section. Internal docs (TECHNICAL, DECISIONS,
REQUIREMENTS, BUSINESS, RECOVERY) stay English-only by design — they
don't ship with the product.

The CLI itself emits English only, so CLI-REFERENCE.es.md notes that
flags and values are language-invariant while translating the prose.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 15:21:18 +00:00
38011872e1 docs(i18n): document language packs across user, dev, and marketing docs
README + USER-GUIDE describe the sidebar picker and current coverage
(home + shared chrome, per-tool bodies pending). DEVELOPER gains a
how-to for adding packs and keys with the parity-test guarantee.
TECHNICAL §10b records the in-house-JSON architecture and locks in the
no-gettext decision (also logged in DECISIONS). REQUIREMENTS reflects
the new interface surface and updated test count. COPY.md adds a
"Language claim" slot so landing/email work can pick it up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 15:16:24 +00:00
c4ce86bd64 feat(i18n): add language-pack scaffold with English and Spanish
Introduces ``src/i18n`` with a tiny JSON-backed t() lookup, an in-session
language preference, and a sidebar selector wired through
``hide_streamlit_chrome`` so every page picks up the same picker. Covers
home, tool cards, findings panel, gate, shutdown, and pickup banner
strings. Tests pin pack parity and the farewell-overlay JS escape so
future packs can't silently regress.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 15:11:30 +00:00
4706ed571e build: wire desktop-bundle pipeline (CI matrix + per-platform installers)
Stand up the seamless-download path for non-technical buyers:

* .github/workflows/build.yml — matrix CI (mac/win/linux) that builds
  PyInstaller bundles and packages them per platform on tag push,
  attaching the resulting installers to a GitHub Release.
* build/installer.iss — Inno Setup script for the Windows installer
  (per-user install, optional desktop shortcut, runs on finish).
* build/macos/build_dmg.sh — wraps DataTools.app into a .dmg with a
  drag-to-/Applications layout.
* build/appimage/{AppRun,datatools.desktop,build.sh} — AppImage recipe.
* src/__init__.py — single source of truth for __version__; the spec
  reads it (was hardcoded), CI passes it through to all packagers.

Buyer download path now lives in the top-level README. Per-build
README documents the Phase 2 step (signing/notarization) that needs
the owner's Apple Developer + Windows code-signing credentials —
those are intentionally not in CI yet because they require setup
outside this repo.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 13:58:43 +00:00
ea89c4d399 ui(gui): say 'window' instead of 'browser tab' in shutdown copy
Update the Close page intro, the shutdown overlay, and the toast so
they all read "you can close this window" — clearer for users running
the app in a dedicated browser window rather than a tab.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 13:51:32 +00:00
701108c9d5 fix(gui): inject farewell overlay into parent DOM on shutdown
Replaces the data:-URL navigation (blocked by Chrome since v60 for
top-frame navigation) with a direct DOM-append of a full-screen
overlay onto the parent document. Uses z-index 2147483647 so it sits
above Streamlit's connection-error banner when the websocket drops.

Note: still doesn't fully suppress the connection-error banner in
testing — the next iteration will render the overlay through
Streamlit's own page rather than via a component iframe.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 13:49:48 +00:00
340614e642 feat(gui): promote Quit to a 'Close' menu item in the sidebar nav
Move the shutdown control out of the inline sidebar widget and into
its own page (pages/99_Close.py), so it appears in the sidebar nav
alongside the tool pages. An explicit confirm button on the page
prevents accidental nav clicks from killing a live session.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 13:38:02 +00:00
58c0195def fix(gui): make Quit button actually terminate the server
Signalling the process with SIGTERM/SIGINT didn't reliably shut Streamlit
down — its tornado/asyncio loop swallowed or deferred the signal, so the
browser saw the websocket drop ("Connection error") while the python
process kept running. Replace the signal with a daemon-thread
``os._exit(0)`` after a short delay so the current rerun can paint the
"shutting down" message before the process is hard-killed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 13:36:36 +00:00
30e257cc44 fix(gui): move Quit button to sidebar so it shows on every page
The footer placement was easy to miss (below all tool cards) and only
rendered on the home page. Hook the button into hide_streamlit_chrome()
so every page that hides default chrome — home + all 9 tool pages — gets
the Quit button at the bottom of the sidebar without per-page edits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 13:33:32 +00:00
0c25d80146 fix(gui): keep sidebar reopenable + add clean Quit button
The chrome-hiding CSS was removing the Streamlit header wholesale,
which also took the sidebar's expand chevron with it — a collapsed
sidebar became unreopenable. Make the header transparent instead and
explicitly preserve the sidebar collapsed-control.

Also add a Quit button in the app footer that signals the Streamlit
server (SIGTERM, falling back to SIGINT) so closing the GUI returns
the shell prompt cleanly instead of leaving Python hung.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 13:30:10 +00:00
e1f364f010 feat: Tier B operator scaffolding — bundle, copy SoT, posts, emails
Pick up and finish yesterday's cut-off Tier B pass.

- build/: PyInstaller scaffold (datatools.spec + launcher.py +
  hook-streamlit.py + README) — folder-mode bundle, locked
  127.0.0.1, per-OS recipe
- marketing/COPY.md: single source of truth for every customer-facing
  string — landing H1/sub/CTAs, demo CTAs, email subjects, Gumroad
  listing, banned phrases
- marketing/community-posts/: 9 drafts (3 posts × 3 niches:
  bookkeeper, revops, shopify-pet) — story / tip / soft-offer
- marketing/emails/: 18 drafts (Gumroad delivery + 5-touch
  onboarding × 3 niches), per-niche segmentation guidance
- docs/NEXT-STEPS.md: flip 2.2 / 2.4 / 3.1 / 3.4 to done with
  pointers to the new assets; add Phase 0 inventory rows
- .gitignore: narrow `build/` ignore so PyInstaller spec + launcher
  + hooks get tracked, only generated artifacts (build/build/,
  build/__pycache__/, build/dist/) stay ignored

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 14:04:37 +00:00
966af8ef94 feat: 3 new tools, format streaming, distribution-ready demo + landing pages
Tools shipped this batch (4 → 6 of 9 Ready):
  04 Missing Value Handler   src/core/missing.py + cli_missing.py + GUI
  05 Column Mapper           src/core/column_mapper.py + cli_column_map.py + GUI
  09 Pipeline Runner         src/core/pipeline.py + cli_pipeline.py + GUI
                             with soft tool-dependency graph (recommended,
                             not enforced) and JSON save/load for repeatable
                             weekly cleanups.

Format Standardizer reworked for 1 GB international files:
  • Vectorised dispatch + LRU cache over phone/date/currency/boolean/email
  • Per-row country / address columns drive parsing
  • Audit cap (default 10 k rows, ~50 MB RAM)
  • standardize_file(): chunked streaming entry point (~165 k rows/sec)
  • currency_decimal="auto" for EU comma-decimal locales
  • R$ / kr / zł multi-char currency prefixes
  • cli_format.py with auto-stream above 100 MB inputs

Encoding detection arbiter + language-aware probe:
  Closes the last 4 xfails (cp1250 / mac_iceland / shift_jis_2004 / lying-BOM)
  via tied-confidence arbiter + Cyrillic / EE-Latin coverage probes.

Distribution-readiness assets:
  • streamlit_app.py — Streamlit Community Cloud entry shim
  • src/gui/app_demo.py — single-page demo, ?p=<persona> routing,
    100-row cap + watermark, free-vs-paid boundary enforced at surface
  • samples/demo/ — 3 niche datasets + pre-tuned pipeline JSONs
  • landing/ — 4 static HTML pages (apex chooser + 3 niche),
    shared CSS, deploy.py URL-substitution script,
    auto-generated robots.txt + sitemap.xml + 404.html + favicon
  • docs/PLAN.md, DEMO-PLAN.md, DEPLOYMENT.md, POST-LAUNCH.md, NEXT-STEPS.md
    — full strategy + measurement + deployment + master checklist

Test counts:
  before: 1,520 passed · 4 skipped · 17 xfailed
  after:  1,729 passed · 0 skipped · 0  xfailed

Tier-1 corpora added:
  • missing-corpus           3 use cases + 16 edge cases
  • column-mapper-corpus     3 use cases + 5 edge cases
  • format-cleaner intl      20-row 13-country stress fixture

Engine hardening flushed out by the corpora:
  • interpolate guards against object-dtype columns
  • mean/median skip all-NaN columns (silences numpy warning)
  • fillna runs under future.no_silent_downcasting (silences pandas warning)
  • mojibake test no longer skips when ftfy installed (monkeypatch path)
  • drop-row threshold semantics: strict-greater (consistent across rows / cols)
  • currency_decimal validator allow-set updated for "auto"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 22:31:26 +00:00
d18b95880d feat(format-i18n): broaden international coverage across all domains
Closes ~17 high-value international gaps surfaced by parallel review.
Adds 93 regression tests; full project suite now 1323 / 0 / 17 (passed
/ failed / xfailed).

DATES
- Adds Portuguese, Italian, Dutch, Russian month dictionaries to the
  opt-in ``month_locales`` set (now: en, fr, de, es, pt, it, nl, ru).
- Adds localized weekday recognition for those locales — "Lundi",
  "Montag", "lunedì", "понедельник", etc. all strip cleanly before
  format matching.
- New CJK separator normalization: Japanese ``2024年01月15日`` and
  fullwidth digits ``2024/01/15`` fold to ASCII before parsing.
- New named-timezone resolution: EST/PST/JST/CET/IST/GMT/etc. map to
  fixed UTC offsets via ``_NAMED_TZ_OFFSETS`` so the trailing TZ
  doesn't block format matching.
- New ISO 8601 extended formats: week date (``2024-W03-1``) and
  ordinal date (``2024-015``), plus RFC 2822 mail-header form
  (``Mon, 15 Jan 2024 10:30:00``).
- New ``two_digit_year_cutoff`` parameter on ``standardize_date()`` —
  defaults to Python's stdlib 69; lower it for birth-year columns
  where most subjects were born ≤ 1999.

NAMES
- Particles set extended with Arabic patronymic markers (bin, ibn,
  bint, abu, abd, al, al-, el-) and Hebrew (ben, bat, ha, ha-).
- Title set extended with German (Herr, Frau), French (M., Mme,
  Mlle), Spanish (Sr., Sra., Srta., Don, Doña), Italian (Sig., Sig.ra,
  Dott.), Portuguese.
- Acronym map extended with international academic credentials
  (Dipl, Ing, Mag, Habil, MSc, BSc, LLB, LLM).
- New East Asian honorific suffix handler: ``Tanaka-san``,
  ``Lee-ssi``, ``Park-nim`` keep the suffix lowercase after the
  hyphen instead of being title-cased into ``Tanaka-San``.
- Hyphenated-segment handler now keeps Arabic prefixes ``al-`` /
  ``el-`` lowercase per Arabic transliteration convention.
- New ``family_first`` parameter on ``standardize_name()`` and matching
  ``name_family_first`` field on ``StandardizeOptions`` — set
  per-column for East Asian data to skip Western comma-format reversal
  (``Kim, Min-jae`` stays ``Kim, …`` instead of becoming ``Min-jae Kim``).

CURRENCY
- Symbol map extended: ฿(THB), ₫(VND), ₮(MNT), ₴(UAH), ₦(NGN),
  ₱(PHP), ₲(PYG), ﷼(SAR), ₨(PKR), ₵(GHS) — covers SE Asia, Africa,
  Eastern Europe, Latin America gaps.
- ISO 4217 code list extended from 23 to ~50: SAR, AED, QAR, KWD,
  BHD, OMR, ARS, CLP, COP, EGP, IDR, MYR, PHP, THB, VND, NGN, GHS,
  KES, HUF, CZK, RON, UAH, KZT, etc.

EMAIL
- New BIDI / RTL override stripping (``standardize_email``):
  U+202A-U+202E and U+2066-U+2069 stripped from every email. These
  are a known phishing vector — ``alice‮@example.com`` displays as
  ``alice@elpmaxe.com`` to RTL-aware renderers.

ADDRESS
- Canadian provinces: 13 codes + names → 2-letter (Ontario → ON).
- UK postcode pattern recognition (``SW1A 2AA`` shape).
- Australian states: 8 codes + names (NSW, VIC, QLD, … + full names).
- German Bundesland: 16 codes + names (Bayern → BY, etc.).
- International PO Box variants: ``Postfach`` (DE), ``Boîte postale``
  (FR), ``Apartado`` (ES), ``Casella postale`` (IT), ``Caixa postal``
  (PT) — all fold to canonical ``PO Box``.
- ``_INTL_STATE_CODES`` now combines US/CA/AU/DE codes; the position
  check that preserves state codes regardless of input case applies
  to all four jurisdictions.
- ``_is_state_code_position`` postal pattern broadened to recognize
  US ZIP, AU 4-digit, CA first half, and UK outward code.

CONSTANTS
- ``src/core/_constants.py`` gains: ``CA_PROVINCE_CODES`` /
  ``CA_PROVINCE_NAMES``, ``AU_STATE_CODES`` / ``AU_STATE_NAMES``,
  ``DE_STATE_CODES`` / ``DE_STATE_NAMES``, ``POSTAL_PATTERNS``
  (us/ca/uk/de/au/fr), ``INTL_PO_BOX_PATTERNS`` (per-language regex),
  ``INTL_STREET_SUFFIXES`` (de/fr/es/it/uk dictionaries — ready for
  use when address takes a `country_hint` parameter in a future pass).

DOCS
- TECHNICAL.md §11.3 domain table updated with the new handling per
  domain plus a new "International coverage" sub-section listing the
  supported locales / symbols / jurisdictions.

DEFERRED (out of scope or rare)
- Alternative calendars (Japanese era, Hijri, Hebrew, Buddhist) —
  corpus § 3.5 marks out of scope.
- Persian/Arabic-Indic digit conversion — rare in tabular data.
- Trailing-minus RTL currency convention.
- Punycode ↔ Unicode IDN normalization.
- Mixed-country phone column auto-detection (user can override
  ``default_region`` per column).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 03:06:03 +00:00
abb720997e docs: tight, scannable rewrite — every item earns its place
Refactors all 10 docs (README, USER-GUIDE, CLI-REFERENCE, REQUIREMENTS,
TECHNICAL, DEVELOPER, BUSINESS, DECISIONS, RECOVERY, docs/README) from
prose-heavy to bullet-heavy + table-heavy. Same information density,
significantly less reading load.

Net: 2600 → 1652 lines (~37% reduction) WHILE adding the new content
that landed since v1.6:

- Format Standardizer (3rd Ready tool)
- 199-row buyer corpus
- src/core/errors.py structured hierarchy + ensure_dataframe /
  ensure_choice / wrap_file_read|write / format_for_user helpers
- src/core/_constants.py shared USPS/state lookup tables
- Cross-tool audit fixes (NaN matching, removed_df schema, validation,
  enum-bounds checks, forward-compat config)
- Per-domain error_policy across format standardizers
- Inconsistent-date-format detector
- Excel header-row auto-detection + write_file delimiter param

Per-doc changes:

- README.md (175 → 71): 9-tool table at top, status column, 3 CLI
  entry points listed, dropped repeated marketing prose.
- docs/README.md (38 → 27): pure index — buyer-facing vs creator-only
  split + version footer.
- USER-GUIDE.md (208 → 118): tool table replaces script descriptions,
  troubleshooting compressed to bullets, gate explanation tightened.
- CLI-REFERENCE.md (451 → 235): collapsed flag tables, removed
  redundant intro text, kept full recipes section.
- REQUIREMENTS.md (146 → 129): 18 numbered sections (was 17), added
  §18 Error Handling, formatting tightened to single-line entries.
- TECHNICAL.md (570 → 350): collapsed §3 build pipeline tables, merged
  redundant §3.5-3.7 OS sections, added §7 (Error handling) +
  §11.3 (Format Standardizer spec) + §11.4-11.7 (analyzer / gate /
  Review page / repair_bytes promoted from §10.2.x sub-numbering).
- DEVELOPER.md (285 → 161): module map table replaces per-file prose,
  extension recipes condensed, new §Errors covers when to use each
  hierarchy class.
- BUSINESS.md (278 → 225): collapsed prose to tables (use cases,
  competitive landscape, costs, risks); honest-status updated.
- DECISIONS.md (269 → 189): scoring rubric + GUI matrix preserved,
  decision log compressed to single-line entries, added v1.6 entries
  (Format Standardizer Ready, errors module).
- RECOVERY.md (180 → 147): rebuild steps as numbered + tabular,
  external dependencies as one table, recovery priorities tightened.

No information removed; redundancy compressed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 02:49:29 +00:00
26b9771625 feat(errors): structured error hierarchy + helpful messages everywhere
Introduces src/core/errors.py with a small structured error hierarchy
that every public entry point now uses. Each error carries the
context a user needs to fix it and the context a maintainer needs to
trace it.

The hierarchy:
  DataToolsError  (base — formats path, column, operation, suggestion)
    InputValidationError  (extends ValueError — bad arg / wrong type)
    ConfigError           (extends ValueError — bad config / options)
    FileFormatError       (extends ValueError — file is not what we expected)
    FileAccessError       (extends OSError   — file I/O failure)

Subclassing the stdlib bases means existing `except OSError` /
`except ValueError` handlers still catch them — no breaking change.

Helpers:
- ensure_dataframe(value, function=...)  — uniform DataFrame guard
- ensure_choice(value, name=, choices=)  — uniform enum/literal guard
- wrap_file_read(path, op, exc)          — tag OSError with hint + path
- wrap_file_write(path, op, exc)         — same, with Windows-aware tip
- format_for_user(exc, context=)         — user-facing string for st.error / stderr

Library hardening:
- io.read_file: missing files surface FileAccessError listing whether
  the parent directory exists, and the suggestion to check the path.
- io.read_file: chunk_size <= 0 now raises InputValidationError with
  a positive-integer suggestion.
- io._read_excel: openpyxl BadZipFile / InvalidFileException / pandas
  ValueError ("sheet not found") wrapped as FileFormatError listing
  the path and a "list sheets with list_sheets()" hint.
- io._detect_excel_header_row: bare except narrowed to specific
  openpyxl exceptions; falls back gracefully and logs at debug so
  the real error surfaces from pd.read_excel.
- io.write_file: OSError / PermissionError on to_csv/to_excel wrapped
  with file path and Windows-aware "file may be open in another
  program" hint.
- dedup._parse_date: bare `except Exception` narrowed to
  (TypeError, ValueError, OutOfBoundsDatetime); failed values
  logged at debug for survivor-selection forensics.
- dedup._select_survivor: KEEP_MOST_RECENT now raises
  InputValidationError instead of silently falling back to keep_first.
- dedup.deduplicate: input validation errors are InputValidationError
  with operation/column/suggestion fields.
- format_standardize.from_dict: invalid FieldType for a column raises
  ConfigError naming the column AND the bad value AND listing valid
  values; same for date_order / phone_format / etc.
- format_standardize.from_file: OSError / JSON decode wrapped with
  path AND line/column where parsing failed.
- format_standardize.to_file: TypeError on json.dumps wrapped as
  ConfigError with the suspected source (extra_abbreviations).
- format_standardize._apply_field_type: dispatcher's "unknown field
  type" branch now raises AssertionError (it's an internal invariant,
  not user error — a new enum value was added without a branch).
- format_standardize._resolve_column_types: missing-column error now
  InputValidationError with a "check for typos / unparsed header"
  suggestion.
- format_standardize.standardize_dataframe: ensure_dataframe at entry.
- text_clean.clean_dataframe: ensure_dataframe at entry.
- config.to_strategies: invalid Algorithm/NormalizerType wrapped as
  ConfigError naming the strategy index AND the column.
- config.to_survivor_rule: invalid SurvivorRule wrapped as ConfigError
  listing valid values.
- config.from_file: OSError / JSON decode wrapped (mirror of
  StandardizeOptions.from_file).
- fixes.repair_mojibake: ImportError on ftfy now logged at info level
  with the underlying ImportError so a corrupt-package vs not-installed
  distinction is visible in the logs.
- normalizers.normalize_phone: phonenumbers.NumberParseException now
  logged at debug when the digits-only fallback drops extension /
  country-code information — gives a trail when matching results
  look wrong.

GUI / CLI surfaces:
- All 9 page handlers (`except Exception as e: st.error(...)`) now
  use format_for_user(), which renders DataToolsError fields nicely
  and falls back to "ClassName: message" for unrecognized errors.
- 2_Text_Cleaner and 3_Format_Standardizer additionally distinguish
  UnicodeDecodeError with an "re-save as UTF-8" suggestion before
  the generic handler.
- cli.py's "Error reading file" handler now uses format_for_user()
  and includes the input path in the prefix.

Tests:
- tests/test_errors.py — 22 new tests covering: base class formatting,
  stdlib inheritance, ensure_dataframe / ensure_choice helpers,
  wrap_file_read / wrap_file_write, format_for_user behavior, and
  end-to-end integration (missing file, missing dir, bad JSON, bad
  algorithm, bad enum, missing column).
- tests/test_audit_fixes.py + tests/test_io.py — updated 4 tests for
  the new exception types (InputValidationError replaces TypeError,
  FileAccessError extends OSError).

Full project suite: 1230 passed, 4 skipped, 17 xfailed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 02:35:42 +00:00
2eece6467d refactor: dedup, consolidate, harden public APIs across core modules
Closes 16 high-value findings from a parallel cross-module review.

Refactors:
- New src/core/_constants.py centralizes USPS street-suffix
  abbreviations, US state names, and 2-letter postal codes — one source
  of truth for both normalize_address (matching keys) and
  standardize_address (display formatting). Eliminates ~80 lines of
  duplicated dicts across normalizers.py and format_standardize.py.
- format_standardize.py: collapse 4 identical nested _err() helpers
  into one shared _err_or_passthrough() module function; drop a dead
  duplicate `return _err("not a phone number")` branch in
  standardize_phone.
- format_standardize.py: precompile per-locale month-name regexes
  (_MONTH_LOCALE_PATTERNS) and per-state-name regexes
  (_STATE_NAME_PATTERNS) at import time — they were rebuilt on every
  cell, a measurable hot path on million-row inputs.
- dedup.py: extract _is_missing(value) helper; one definition of
  "this cell is None / NaN / pd.NA" instead of two.
- fixes.py: extract _is_string_column(ser) helper; one dtype check
  instead of three duplicates across _apply_to_strings,
  _vectorized_translate, _vectorized_regex_sub.

Production-readiness:
- format_standardize.standardize_dataframe now logs a warning when
  more than 10% of typed cells are unparseable — surfaces the
  silently-broken-pipeline failure mode.
- StandardizeOptions.from_dict validates date_order / phone_format /
  currency_decimal / name_case / boolean_style / *_error_policy
  enum values up front, with a clear error message instead of a deep
  crash inside the per-cell function.
- StandardizeOptions.from_file and DeduplicationConfig.from_file wrap
  read + json.loads with descriptive OSError / ValueError messages
  including the file path.
- standardize_date(month_locales=...) validates locale codes against
  the available set instead of silently passing through unknown ones.
- io.read_file rejects chunk_size <= 0 (was silently failing inside
  pandas) and logs the resolved suffix + chunk_size at info level so
  data-pipeline runs are debuggable.
- io.read_file's FileNotFoundError gains explanatory context.
- io.write_file, text_clean.clean_dataframe, and dedup.deduplicate
  now reject non-DataFrame inputs with clear TypeError instead of
  cryptic pandas tracebacks downstream.
- dedup.deduplicate validates that survivor_rule=KEEP_MOST_RECENT has
  a usable date_column up front; the helper _select_survivor now
  raises (instead of silently falling back to keep_first) when called
  directly with bad arguments.
- dedup.deduplicate gains a structured no-op return when strategies
  is empty after auto-detection — preserves schema instead of crashing.
- analyze._detect_inconsistent_date_format narrows its bare except to
  (TypeError, ValueError) and logs a debug line so genuine bugs don't
  hide behind silent skip.

Tests:
- tests/test_audit_fixes.py grows by 11 cases covering the new
  validation paths (chunk_size, DataFrame guards, KEEP_MOST_RECENT
  date_column, enum validation, locale validation, JSON error wrapping).

Full project suite: 1208 passed, 4 skipped, 17 xfailed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 02:23:09 +00:00
b23a27d4e3 fix: cross-tool audit findings + alignment with format standardizer
Closes 12 bugs and 8 gaps surfaced by parallel audits across all core
modules, plus aligns the dedup-side normalizers with the new
format_standardize behavior where they had silently diverged.

Bugs (data integrity / correctness):
- dedup: NaN/None values matched as duplicates because str(None)='None'.
  Two rows with missing email silently merged.
- dedup: removed_df had 0 columns when nothing was removed; downstream
  code expecting matching schema broke. Now preserves column shape.
- dedup: ColumnMatchStrategy threshold accepted any value; out-of-range
  silently broke matching. Validated to [0, 100] in __post_init__.
- dedup: strategy referencing a missing column was silently skipped.
  Now raises ValueError listing available columns.
- fixes: replace_null_sentinels crashed on non-string sentinels (int/None
  from JSON payload). Coerced to str.
- fixes: _vectorized_regex_sub raised raw re.error on bad patterns. Now
  wraps as ValueError with clear message.
- io: detect_header_row mis-identified all-empty and metadata-only rows
  as headers (all([]) is True). Now requires ≥2 non-empty cells.
- config: from_dict crashed when JSON had unknown fields, breaking
  forward compat. Now filters to known fields.
- analyze: mixed-case email detector flagged all-None columns because
  str(None)='None' contains both N and one. Now drops NaN before stringify.

New features and gap closures:
- io: _detect_excel_header_row mirrors detect_header_row for Excel via
  openpyxl read-only; _read_excel uses it when header_row=None.
- io: write_file gains delimiter + encoding params; .tsv extension
  defaults to tab.
- normalizers: normalize_phone preserves extensions as ;ext=N suffix.
- normalizers: normalize_address folds spelled-out US state names to
  2-letter codes (California ≡ CA).
- normalizers: normalize_name drops surname particles (van, de, von)
  so "Charles de Gaulle" ≡ "Charles Gaulle" for matching.
- analyze: new _detect_inconsistent_date_format detector flags columns
  with mixed ISO/US/EU date shapes; routes to format standardizer.
- analyze: _NULL_LIKE recognizes "<na>" (pd.NA repr).
- analyze: duplicate-row finding renamed count → n_extra (rows that
  would actually be removed) with clarified description.
- dedup: group_confidence no longer falsely 100.0 when transitive group
  members lack a recorded direct pair; falls back to 100.0 only when
  truly no pairs were observed.
- dedup: MatchResult / DeduplicationResult docstrings clarify that
  row_indices refer to the input frame's positional index (output index
  is reset).
- text_clean: visualize_hidden_html(None) now returns None (matches
  visualize_hidden_text); strip_bom strips at most one BOM per call;
  sentence_case dead elif branch removed.

Tests:
- tests/test_audit_fixes.py — 28 regression tests, one or more per
  numbered finding, named after BUG/GAP/NIT tags so future readers
  can trace each test back to its audit.
- tests/test_fixes_unit.py — 26 isolated unit tests for previously
  integration-only fix functions (trim_whitespace, strip_nbsp,
  strip_zero_width, normalize_line_endings, clean_headers,
  repair_mojibake — last skipped if ftfy unavailable).
- tests/test_io.py — adds CSV / TSV / semicolon / UTF-8-BOM round-trip
  tests + Excel auto-header-detection tests.
- tests/test_normalizers.py — adds 8 tests for the alignment work
  above (phone extension, state names, particles).

Adds .claude/ to .gitignore (agent worktrees + local settings).

Full project suite: 1197 passed, 4 skipped, 17 xfailed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 02:11:57 +00:00
4adeb5c7f3 feat(format): per-cell standardizers + 199-row buyer corpus
Adds src/core/format_standardize.py — a per-cell standardizer for dates,
phones, emails, addresses, names, currencies, booleans — wired through
StandardizeOptions / standardize_dataframe with FieldType registry.

Includes:
- Date parser handles ISO/US/EU/longform/excel-serial/unix-timestamp/
  partial-precision/quarter notation; opt-in French/German/Spanish month
  dictionaries via month_locales.
- Phone via libphonenumber with extension preservation (;ext=N), 001
  international prefix handling, error sentinels for placeholders /
  multi-number cells.
- Email lowercase/trim/mailto/angle-bracket strip with optional
  --gmail-canonical mode.
- Address USPS abbreviation expansion or compression (expand=False per
  corpus § 6.3), state-name → 2-letter conversion, multi-line collapse,
  PO Box normalization, state-code preservation regardless of input case.
- Name handler: Mc/Mac/O'/D' inner caps, hyphen segments, particle
  lowercasing (von/van/de/da), comma-format reversal, period stripping
  for titles/suffixes/initials, PhD/MD acronym preservation, conservative
  mode for mixed-case input.
- Currency: auto-detect EU vs US separators, space-thousands, Swiss
  apostrophe, accounting parens, optional ISO code preservation, error
  sentinels for percentages/ranges/word-values/ambiguous separators.
- Per-domain error_policy ("passthrough" | "sentinel") for surfacing
  malformed values as <error: reason> per corpus § 0.3.

Test corpus from Business/DataTools/test-cases-format-cleaner copied to
test-cases/format-cleaner-corpus/ — 7 fixtures plus FORMATS-CASES.md.
tests/test_format_standardize_corpus.py drives all 199 rows through the
per-cell standardizers; 0 xfailed.

Wires the GUI page (3_Format_Standardizer.py) to "Ready" status.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 02:11:24 +00:00
3f007ef3d6 feat(gui): 1 GB upload cap + delimiter / encoding diversity caption
Streamlit's default file_uploader footer reads "Limit 200MB per file —
CSV, TSV, XLSX, XLS" which contradicts the 1 GB efficiency target shipped
in 438bc0f and codified in docs/REQUIREMENTS.md §1.1.

Three changes:

1. .streamlit/config.toml — set [server] maxUploadSize = 1024. Footer
   now reads "Limit 1024MB per file".

2. upload_and_analyze_section (home page) — adds an explicit caption
   above the uploader stating size limit, supported formats, the four
   auto-detected delimiters, and the 13 auto-detected encodings (with
   the Review-page override as the safety net).

3. pickup_or_upload (every tool page that falls back to its own
   uploader when no home-page upload is present) — same caption,
   only rendered when the upload accepts CSV/TSV/XLSX/XLS so JSON
   schema / config uploaders aren't decorated.

Test suite: 765 passed, 17 xfailed (no regressions). Home + Review +
Deduplicator pages all serve HTTP 200 under the new config.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 21:23:21 +00:00
3dd924474c docs: short-form numbered requirements list
New docs/REQUIREMENTS.md catalogs every shipped capability in 17 numbered
categories — file handling, input/output encodings, delimiters, line
endings, detectors, finding schema, confidence tiers, decisions,
performance targets (1 GB), tools, gate behavior, interfaces, platforms,
deps, test coverage, privacy. Linked from README and USER-GUIDE so a
buyer / integrator can scan compliance in under a minute.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 21:19:21 +00:00
ec56b1994b chore: remove one-time 1.25GB stress harness
The stress benchmark served its purpose — perf findings shipped in
438bc0f (1 GB-class file efficiency for the analyzer + gate pipeline).
Removing the script and the (already auto-deleted) test fixture so the
repo doesn't carry one-time scaffolding. Future ad-hoc benchmarks can
resurrect this from git history.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 21:15:58 +00:00
438bc0f84d perf: 1 GB-class file efficiency for the analyzer + gate pipeline
Six targeted changes that drop the user-visible analyzer scan time from
"go for coffee" to sub-second on 1 GB inputs and reduce peak RSS by ~10×.

src/core/io.py
  - detect_encoding: open + read sample bytes instead of read_bytes()[:N].
    Was allocating the full file in memory just to slice the head; on a
    1 GB input this saves a 1 GB intermediate allocation.
  - repair_bytes: byte-level smart-quote fold via bytes.replace when the
    input is UTF-8. The probe (b"\\xe2\\x80" / b"\\xc2\\xab" / b"\\xc2\\xbb")
    is a single C-implemented contains check that skips the entire fold
    stage on files with no smart quotes — most of them.
  - repair_bytes: skip the per-row csv.reader walk unless a cheap byte
    scan finds a currency sigil ($/€/£), the delimiter is non-comma, the
    decoder substituted U+FFFD, or _has_field_count_mismatch detects an
    unquoted-delimiter row. csv.reader was the dominant cost in
    repair_bytes on big files (materializes a list of every row).
  - _has_field_count_mismatch: hand-rolled quote-state walker; one pass,
    no allocation, returns True at first mismatch. False positives just
    fall through to the slower _repair_rows pass.

src/core/analyze.py
  - _load_for_analysis: read only ~max(4KB, sample_rows × 256B × 2) head
    bytes for the analyzer's sample-mode scan. Drops analyze(sample_rows
    =1000) from "read + repair full file" to "read + repair 500KB" —
    150× faster on a 1.25 GB file. Falls back to a single full-file
    retry if pandas reports fewer rows than the cap.
  - Compiled regex character classes for hot-path detectors and a
    _vec_match_count helper that runs Series.str.contains in C instead
    of Python per-cell loops. Detectors converted: smart_punctuation,
    invisible_chars (NBSP + zero-width), whitespace_padding,
    null_like_sentinels, mojibake, encoding_uncertainty,
    mixed_case_email, leading_zero_ids.

src/core/fixes.py
  - _vectorized_translate / _vectorized_regex_sub: pandas-native string
    transforms for the fixes that are pure character maps (strip_nbsp,
    fold_smart_punctuation, strip_zero_width). Series.str.translate
    runs in C — 10-50× faster than per-cell Python.
  - _apply_to_strings: replaced inner per-cell loops with Series.map +
    boolean-mask diff for the count.
  - All fix entry points read an "inplace" flag from payload and thread
    it through the helpers.

src/core/normalize.py
  - apply_decisions: takes a single working copy at the top, then sets
    payload["inplace"] = True so each chained fix mutates that copy.
    Previously every fix did df.copy(); N fixes × 6 GB DataFrame =
    30+ GB peak. Now: one 6 GB allocation.

Validation: 765 passed, 17 xfailed (no regressions). 100 MB benchmark:

  stage                              before       after
  ------------------------------     -------      --------
  detect_encoding                    0.97s+1.3GB  ~0s + 0 MB
  analyze (sample_rows=1000)         235.76s      0.08s
  _load_for_analysis (1000 rows)     148.17s      0.01s
  repair_bytes (full file)           150s/1.25GB  2.91s/100MB

The user-visible analyzer scan dropped from minutes to sub-second on
1 GB-class files. Full-DataFrame analyze + auto_fix improvements are
more modest (~25%) because trim_whitespace and replace_null_sentinels
still need per-cell Python for the structural-shape checks, but the
hot path through these is now bounded by pandas' .map rather than a
manual for loop.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 21:13:47 +00:00
f891c6116d refactor(gui): tool registry + components package for per-tool builds
Two low-risk seam moves to enable selling per-tool subsets without
breaking the existing all-in-one bundle. Behaviour identical; every
existing import still resolves; full pytest suite + every page returns
HTTP 200.

1. **Tool registry** (src/gui/tools_registry.py) — replaces the
   inline dict-of-dicts in app.py with a Tool dataclass and a TOOLS
   list. Adds a tier field ("core" today, "pro" / "enterprise" later)
   and tools_for_tier() / tool_by_id() / display_name() helpers. A
   per-tool build slices TOOLS at import time without code changes.

2. **components package** (src/gui/components/) — converts the former
   single components.py into a package with:
     _legacy.py        — original file, unchanged.
     __init__.py       — re-exports the legacy surface; existing
                         "from src.gui.components import …" calls
                         continue to work.
     shared.py         — hide_streamlit_chrome, pickup_or_upload
                         (every build needs these).
     gate.py           — require_normalization_gate (Pro / Suite SKUs).
     findings.py       — analyzer-finding widgets (drops out of a
                         standalone-Dedup build).
     dedup_review.py   — match-group cards + apply pipeline (drops out
                         of a non-dedup build).

   The seam modules are narrow re-exports today. As code migrates out
   of _legacy.py into the focused modules, the public import path
   stays stable via the shim.

E2E: 765 passed, 17 xfailed (unchanged); home page + all 9 tool pages
+ Review page render HTTP 200; full pipeline (analyze → auto_fix →
apply_decisions → output bytes) round-trips on the kitchen-sink
fixture with zero high-confidence findings remaining post-fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 20:56:21 +00:00
70ed695027 test(scripts): one-shot 1.25GB stress harness for the gate pipeline
Generates a synthetic messy CSV at the target size, then runs every
pipeline stage end-to-end (detect_encoding, repair_bytes, analyze,
auto_fix on sample + full file) capturing wall-clock and peak RSS at
each stage. Not part of the automated suite — invoke directly via
``python scripts/stress_1_25gb.py``. ``--keep`` to preserve the file
between runs, ``--target-gb`` to tune the size.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 20:52:27 +00:00
82d7fef21e feat(gate): CSV-normalization gate with confidence-tiered findings
Adds a Review & Normalize page that sits between upload and every tool
page. The analyzer now tags each finding with confidence (high/medium/low)
and a fix_action; the gate auto-applies high-confidence fixes, surfaces
medium/low ones for user review, and blocks tool pages on error-level
findings until resolved or waived.

Core (src/core/):
  - analyze.py: Finding gains confidence, fix_action, pre_applied; new
    detectors for encoding_uncertain, encoding_decode_failed; new top-
    level encoding_override parameter.
  - fixes.py: registry of fix algorithms keyed by fix_action id.
  - normalize.py: auto_fix(), apply_decisions(), is_normalized(), and
    the NormalizationResult / Decision dataclasses the gate consumes.
  - io.py: detect_encoding tries strict UTF-8 first; repair_bytes now
    transcodes UTF-16/32 to UTF-8 before NUL-strip (fixes UTF-16 corruption)
    and normalizes line endings (fixes bare-CR parser crash); empty file
    handled gracefully instead of EmptyDataError traceback.

GUI (src/gui/):
  - pages/0_Review.py: gate page with per-finding decision controls,
    encoding override picker (16 codepages + custom), and Advanced output
    options (encoding, delimiter, line terminator) on the download.
  - components.py: require_normalization_gate() helper.
  - pages/1-9: gate guard wired on every tool page.

Test corpora:
  - test-cases/encodings-corpus/: 31 encoded CSV fixtures + 9 reference
    UTF-8 files + manifest, synced from Business/DataTools.
  - test-cases/text-cleaner-corpus/test_data/17: synced malformed input
    (unquoted $1,500.00) for the unquoted-delimiter detector.

Tests (94 new):
  - test_normalize.py (48): finding fields, fix registry, auto_fix scope,
    decision paths, gate idempotency, output-options helper.
  - test_encodings_corpus.py (90, 16 xfailed): parametric detection +
    decode + analyzer-no-crash sweep against the manifest.
  - test_analyze.py: encoding override + encoding_uncertain detectors.
  - test_corpus.py: pre-parse repair in the strict reader.

run_tests.py: new aliases --tool normalize, --tool encodings, --tool gate;
encodings corpus added to --fixtures category.

Docs: USER-GUIDE §3.3 covers the gate workflow, encoding override, and
output options; TECHNICAL §10.2.1-10.2.4 documents the analyzer schema,
gate API, Review page, and pre-parse repair pipeline; CLI-REFERENCE adds
the analyzer JSON schema with the new fields; README links to all of it.

Suite: 765 passed, 17 xfailed (was 458 passed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 20:35:27 +00:00
e9c490ae1b feat(gui): hidden-char-aware preview tables in Text Cleaner
The Text Cleaner had two st.dataframe previews — the initial upload
preview ("Preview: filename") and the post-clean "Cleaned preview"
table — that both rendered cells with the same browser-collapses-
whitespace, hides-invisibles problem the analyzer findings panel had
before commit 1049c03.

components.render_hidden_aware_preview(df, n_rows, caption) renders a
DataFrame as an HTML table where:
  - every cell uses visualize_hidden_html(mark_outer_whitespace=True),
    so leading/trailing ASCII spaces appear as per-character "·" badges
  - white-space: pre-wrap on every cell preserves internal multi-space
    runs and embedded newlines visually
  - headers route through the same visualizer so dirty column names
    (NBSP padding, ZWSP, smart quotes) show their badges too
  - NaN cells render as a faint "NaN" placeholder
  - rows are sticky-headed and scrollable inside a 26rem capped
    container so a 10-row preview doesn't push the rest of the UI off
    screen

2_Text_Cleaner.py wires it into both previews:
  - The upload preview gains its own "Show hidden characters in preview"
    toggle (default on).
  - The cleaned preview reuses the existing show_hidden toggle that
    already governs the Examples changes table, so one switch controls
    the whole results section.

Either toggle off falls back to the original st.dataframe view.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 16:26:30 +00:00
1049c033cb feat(gui): visualize leading/trailing whitespace in analyzer findings
The analyzer's "Run Analysis" panel rendered sample cells via st.dataframe,
which (a) silently collapses leading/trailing ASCII whitespace and (b)
displays NBSP/ZWSP/control chars as nothing. The user couldn't see the
exact pollution they were being told about.

visualize_hidden_html gains a mark_outer_whitespace=True option that
wraps each leading and trailing ASCII space/tab in its own badge with a
"SP LEAD" / "SP TRAIL" tooltip. The badges are per-character so the
user can count exactly how much padding the cleaner will strip.

components.render_findings_panel now:
  - injects hidden_char_css() once at the top of the panel
  - replaces st.dataframe(samples) with a custom HTML table
  - renders the value column with mark_outer_whitespace=True
  - applies white-space: pre-wrap on value cells so any internal ASCII
    whitespace also stays visible (browsers collapse runs by default)

Four new tests cover: leading+trailing badge counts, default-off
behaviour, leading tab badge, all-whitespace string treated entirely
as leading.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 16:21:39 +00:00
e12615357d fix(gui): use page paths relative to streamlit entrypoint
st.page_link resolves paths from the directory of the entrypoint file
(src/gui/app.py), so the existing "src/gui/{page_slug}" prefix doubled
up and produced StreamlitPageNotFoundError on first upload + analysis
(reproducible on Windows; the stack trace from a Windows install
surfaced the bug).

The _TOOL_PAGE_PATHS map already stores the correct relative form
("pages/2_Text_Cleaner.py"); just pass the slug straight to
st.page_link.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 16:17:50 +00:00
90ceada2d1 feat(text_clean): visualize hidden characters in the cleaner GUI
The whole point of the cleaner is to remove characters the user can't
see — which makes the "before / after" preview nearly useless by default.
A cell with NBSP padding looks identical to a cell with regular spaces.

Two new helpers in src.core.text_clean:

  visualize_hidden_text(s)
    Plain-text rendering: each invisible/control/smart character is
    replaced by a glyph + [LABEL] (e.g. "·[NBSP]", "→[TAB]", "∅[ZWSP]",
    """[L DQUOTE]"). Suitable for terminal output, CSV exports, anywhere
    HTML is wrong. Unmapped C0 controls render as [U+XXXX].

  visualize_hidden_html(s) + hidden_char_css()
    HTML rendering: every flagged character is wrapped in a <span> with
    a CSS class and a tooltip showing the codepoint and label. Pair with
    hidden_char_css() to inject the matching styles. Three colour bands
    (whitespace, special, control) so the user can scan an audit table
    and spot what's being changed at a glance.

Mapping covers: ASCII tab/LF/CR, every NBSP variant (U+00A0, U+202F,
U+2009, …), zero-width family (ZWSP/ZWNJ/ZWJ/WJ/BOM/SHY), bidi marks
(LRM/RLM), all smart quotes, en/em dashes, ellipsis, prime/double-prime,
and guillemets. ASCII printable text passes through; HTML output also
escapes &/</> .

GUI wiring (src/gui/pages/2_Text_Cleaner.py)
  The "Examples" changes table now defaults to a hidden-char-rendered
  HTML view: every NBSP/ZWSP/smart-quote/control char is shown with its
  badge and codepoint tooltip. A "Show hidden characters" toggle lets
  the user fall back to the raw st.dataframe view if they prefer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 16:14:14 +00:00
794d4cda94 feat(gui): tool pages pick up the home-page upload via session_state
Closes the last UX gap from the analyzer review: each tool page had its
own st.file_uploader, so users had to upload the same file twice (once
on the home page for analysis, once on each tool page).

components.pickup_or_upload(label, key, types) returns either:
  - a _StashedUpload shim wrapping the home-page bytes (when present and
    the user hasn't asked for a different file on this page), or
  - the standard st.file_uploader (when nothing is stashed or the user
    clicked "Use a different file").

_StashedUpload duck-types Streamlit's UploadedFile (.name, .size,
.getvalue(), .read()) so existing tool-page code consumes it without
changes. A "Use a different file" button per page sets a session-state
override flag; a "Switch back to upload-screen file" button clears it.

Wired into 2_Text_Cleaner.py and 1_Deduplicator.py — the two pages with
working uploaders today. The remaining stub pages adopt it when they're
implemented; the helper is the public surface they'll use.

Verified by smoke-launching streamlit headless and curling the home,
text-cleaner, and deduplicator routes — all return 200 with no errors
in the server log.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 16:09:51 +00:00
8dfc6ad8ae feat(analyze): add mixed_line_endings + near_duplicate_rows detectors
Two more detectors close the analyzer gap list:

mixed_line_endings (warn, tool=02): scans raw bytes for combinations of
  CRLF / LF / bare CR. Disaster pattern after multi-source concat
  (Windows + macOS + Linux exports stitched together). Operates on raw
  bytes only — DataFrame-mode analyze() skips it because raw bytes
  aren't available. _load_for_analysis now returns the raw bytes
  alongside the DataFrame and repair result so the detector has them.

near_duplicate_rows (info, tool=01): cheap dedup signal — strip and
  lowercase every string column, then count df.duplicated(). Catches the
  most common case (same customer entered twice with subtle formatting
  differences) without paying for fuzzy matching. Anything more
  sophisticated stays in tool 01.

Six new tests cover both detectors plus the dataframe-mode skip path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 16:09:42 +00:00
0671ef277e feat(io): route read_file through pre-parse repair by default
Previously only analyze() and direct read_csv_repaired() callers got the
byte-level repair pass (BOM strip, NUL strip, smart-double-quote fold,
unquoted-delimiter merge). The dedup CLI and any other read_file consumer
silently missed it.

read_file gains a repair=True default. CSV/TSV inputs run through
repair_bytes before pandas sees them; Excel inputs still pass through
unchanged. Chunked reads (chunk_size set) bypass repair because the pre-
parse pass loads the whole file — preserving streaming behavior on huge
files. Repair actions and unrepairable lines are logged at INFO/WARNING.

cli_text_clean opts out (repair=False): the cleaner offers fine-grained
control via --preset and per-op flags, and a byte-level smart-quote fold
under the user's "minimal" preset would violate that contract. The
cell-level cleaner does the equivalent work itself when its options ask
for it.

Tests: read_file default strips BOM and folds curly double quotes;
repair=False preserves smart quotes; chunked reads still work and skip
repair as documented.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 16:09:35 +00:00
0b959dee93 feat(text_clean): preserve internal whitespace in numeric/date/phone cells
Closes the §4.17 spec gap that test_gap_coverage.py was tracking via xfail:
collapse_whitespace must NOT touch cells whose shape carries meaningful
internal whitespace.

Adds _looks_structured(s) — returns True when s matches:
  - numeric (currency optional, thousand-grouping by , . or single space)
  - date (ISO/slash/dot separator, or 'Mon DD YYYY' / 'DD Mon YYYY')
  - phone (digits + parens/dots/dashes/+/spaces, >= 7 digits, no letters)

The pipeline uses a new _smart_collapse_whitespace wrapper that defers to
collapse_whitespace only when _looks_structured returns False. The raw
collapse_whitespace function is unchanged so direct callers and existing
unit tests remain valid.

Five new positive tests replace the xfail:
  - "(555)  123-4567" preserved (phone, double space inside)
  - "1 234" preserved (European thousands)
  - "2024-01-15" preserved (ISO date)
  - "Jan 15 2024" preserved (textual date)
  - "hello  world" still collapsed to "hello world" (free-text negative case)

Conservative on purpose: a false negative just collapses (existing
behavior); a false positive leaves intentional double spaces in prose.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 16:09:25 +00:00
4687cf87b4 test: single-command runner, cross-platform automation, fixture auto-discovery
Adds a top-level test infrastructure layer addressing four needs at once:
a single command to run anything, cross-platform automation, install/e2e
sanity, and zero-config pickup of new fixtures dropped into test-cases/.

Top-level runner — run_tests.py
  python run_tests.py                # everything (default)
  python run_tests.py --tool dedup   # one tool's tests
  python run_tests.py --unit         # category scopes
  python run_tests.py --e2e          # end-to-end CLI
  python run_tests.py --install      # import / dependency sanity
  python run_tests.py --fixtures     # corpus + dropped-file sweep
  python run_tests.py --coverage     # term-missing report
  python run_tests.py --quick        # skip @pytest.mark.slow
Tools: analyze, cli, config, dedup, io, normalizers, text_clean.

Cross-platform — tox.ini
  Envs for py310-py313 plus install / e2e / fixtures / coverage / lint.
  Forces UTF-8 (PYTHONUTF8=1, PYTHONIOENCODING=utf-8) so identical fixture
  bytes parse the same on Linux/macOS/Windows.

Shared config — pytest.ini
  testpaths, python_files conventions, custom markers (slow, e2e, install,
  fixture_sweep), warning filters that fail on our own DeprecationWarnings
  while tolerating third-party ones.

New test layers
  tests/test_install.py — required deps import; project modules import;
    src.core public API surface; CLI --help exits 0; streamlit app.py
    parses as valid Python; run_tests.py --help works.
  tests/test_e2e.py — CLI roundtrips: cli_analyze table + JSON, cli_text_clean
    --apply writes a real file with NBSP/smart-quote folded, dedup CLI
    removes duplicates, run_tests.py self-tests.
  tests/test_fixtures_sweep.py — parametrizes over every CSV/TSV/XLSX
    inside test-cases/ (excluding text-cleaner-corpus/, which has its own
    suite). Each fixture must: load through repair_bytes, run analyze()
    cleanly, and survive clean_dataframe() with row/col counts unchanged
    plus idempotency. Drop a CSV in, re-run — no test code changes needed.
  tests/test_gap_coverage.py — closes audit gaps: clean_headers=False
    toggle, repair_bytes with tab/semicolon delimiters, BOM+NUL+smart-
    quote combined-fix scenario, analyze() over an XLSX path, sample_rows
    larger than the DataFrame, mid-cell BOM, findings_by_tool edges, plus
    a strict xfail documenting the known §4.17 numeric/phone whitespace
    heuristic gap.

Test count
  Before: 288 passed + 1 xfailed
  After:  475 passed + 2 xfailed (the second xfail is the documented
          collapse_whitespace gap on phone-shaped cells; spec §4.17 calls
          for a heuristic that hasn't been implemented yet).

Functional gaps surfaced (not fixed in this commit):
  - Text cleaner: collapse_whitespace runs unconditionally on every string
    cell, including phone/numeric/date-shaped ones. Spec §4.17 requires a
    skip heuristic. Captured as strict xfail so the gap stays visible.
  - io.read_file does not run pre-parse repair; only analyze() and direct
    callers of read_csv_repaired() get it. CLI tool pages and the dedup
    CLI miss the safety net.
  - Analyzer has no mixed_line_endings detector or near_duplicate_rows
    detector; both planned but require additional plumbing.
  - GUI tool pages each have their own uploader instead of picking up the
    home-page upload through session_state.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 16:01:06 +00:00
a8943f29eb feat(gui): wire analyzer into home page with findings panel and tool badges
Home page (src/gui/app.py) gains an upload + analyze section above the tool
grid: file uploader, "Run analysis" / "Skip" buttons, and a findings panel
grouped by destination tool. Tool cards now carry a "N findings" badge
when the active session's findings reference that tool, so the user sees
at a glance which tools their just-uploaded file would benefit from.

src/gui/components.py adds the shared GUI surface:
  - TOOL_DISPLAY_NAMES + tool_display_name() — single source of truth for
    GUI labels, keeping detector tool ids decoupled from the UI.
  - render_findings_panel(findings) — severity icons, expander per tool,
    open-tool page link, sample-cells dataframe.
  - upload_and_analyze_section() — the home-page widget; stashes file
    bytes and findings in session_state so future tool pages can pick up
    the existing upload instead of re-prompting.
  - findings_count_for_tool(tool_id) — used by app.py to badge cards.

CSV/TSV uploads run through repair_bytes() before analysis, so the user
also sees csv_bom_stripped / csv_smart_quotes_folded findings synthesized
from the pre-parse repair pass. Excel uploads skip that step.

The Text Cleaner tool card flips from "Coming Soon" to "Ready" — that has
been true since the v3.0 implementation and the home page just hadn't been
updated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 15:53:22 +00:00
5c62fb6117 feat(cli): src.cli_analyze — Typer CLI for the analyzer
python -m src.cli_analyze input.csv             # rich table per tool
python -m src.cli_analyze input.csv --json      # array of finding dicts
python -m src.cli_analyze input.csv --strict    # exit 1 on warn/error
python -m src.cli_analyze input.csv -n 50000    # cap rows scanned

Findings are grouped by destination tool so the user can see at a glance
which tool to open next. Read-only; exit code 0 unless --strict is set.
The CLI keeps its own tool-id -> display-name map so it doesn't depend on
the GUI module.

7 tests cover: clean-file passthrough, dirty-file table, --json round-trip,
missing-file (exit 2), --strict exit code, --sample-rows cap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 15:53:11 +00:00
edf6ccf90b feat(analyze): upload-time data quality analyzer
Pure, advisory scan over an uploaded file or DataFrame that returns a list of
Finding objects naming each issue, the affected count, and which downstream
tool can fix it. The GUI uses this to badge tool nav items at upload; the CLI
will print findings as a table or JSON.

src/core/analyze.py:
  Finding dataclass (id, severity, tool, count, description, column, samples)
  analyze(source, *, sample_rows=1000, repair_result=None) -> list[Finding]
    - source: DataFrame, path, or str. Path scans first 1000 rows.
    - When source is a path, runs the same pre-parse repair the tool pages
      will use; the resulting RepairResult is auto-surfaced as csv_*
      findings. A caller-supplied repair_result wins so non-default repair
      flags are respected.
  Detectors (each independent, samples capped at 5):
    - smart_punctuation_in_data        -> 02
    - nbsp_or_unicode_whitespace       -> 02
    - zero_width_or_invisible          -> 02
    - dirty_column_headers             -> 02
    - whitespace_padding               -> 02
    - null_like_sentinels              -> 04
    - suspected_mojibake               -> 02 (Tier 2)
    - mixed_case_email_column          -> 02 case op
    - leading_zero_ids                 -> informational, no tool
  Helpers: findings_by_tool() for sidebar grouping, to_dict() for JSON.

Detectors are decoupled from the GUI display layer — they emit stable tool
ids ("02_text_cleaner") and the GUI maps those to display names.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 15:41:36 +00:00
b8a9fa1b09 feat(io): pre-parse CSV repair (BOM/NUL/smart-quotes/unquoted-delim)
Some pollution patterns block pandas before the cell-level cleaner can run.
Add a pre-parse pass on raw bytes that fixes only what breaks parsing, and
returns a structured action log the GUI/CLI can surface to the user.

repair_bytes(raw, *, encoding, delimiter, fold_quotes, strip_nul, repair_delims):
  1. Strip leading UTF-8 BOM.
  2. Strip embedded NUL bytes (the C parser truncates fields at NUL).
  3. Fold smart double quotes (curly, guillemet, double-prime) to ASCII '"'.
     Curly singles are NOT folded here; they don't conflict with CSV and the
     cell-level cleaner handles them more accurately.
  4. Per-row repair when one rogue delimiter is embedded in a field that
     looks like currency or thousands-grouped digits. Tiered scoring keeps
     "  $1,500.00  ,7" unambiguous: the strict currency regex match wins
     over the loose digit/sigil heuristic.

read_csv_repaired(path) -> (DataFrame, RepairResult). RepairResult exposes
.actions, .unrepairable_lines, and a summary() grouped by kind.

Out of scope for this pass: encoding repair, delimiter conversion, multi-
delimiter merges (k>1) — logged as unrepairable so callers can see what was
left alone instead of silently parsing wrong.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 15:37:49 +00:00
c349a90e18 test: add text-cleaner corpus and close gaps surfaced by it
The 21-fixture corpus (test-cases/text-cleaner-corpus/) exercises the cleaner
end-to-end against the spec in TEST-CASES.md. Closing the failing cases drove
five small cleaner fixes plus two fixture-generation fixes:

- _SMART_CHARS: add prime, double prime, guillemets (case 03)
- _ZERO_WIDTH: add soft hyphen U+00AD (case 05)
- clean_dataframe: clean column headers via the same pipeline (cases 16/19/20),
  with a clean_headers toggle on CleanOptions
- smart_title_case: title-case full-shout strings ("ALICE SMITH" -> "Alice
  Smith") while still preserving embedded acronyms; preserve uppercase after
  apostrophe in names ("O'CONNOR" -> "O'Connor", "o'neil" -> "O'neil")
- test_corpus.py reader: pre-strip NUL bytes (C parser truncates at NUL,
  python engine is too strict about embedded literal "), per spec case 06
- generate_test_data.py: properly CSV-escape literal-quote cells in case 03
  expected; quote the rogue-comma price field in case 17 input

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 15:37:35 +00:00
54f92ae47e feat: implement text cleaner (script 02) with CLI, GUI, and tests
Builds 02_text_cleaner.py from stub to working: character-level hygiene
for CSV/Excel inputs covering trim, whitespace collapse, smart-character
folding, Unicode NFC/NFKC, BOM strip, zero-width strip, control-char
strip, line-ending normalization, and per-column case conversion. Three
presets (minimal/excel-hygiene/paranoid) keep the buyer surface small.

- src/core/text_clean.py: pure helpers + CleanOptions/CleanResult +
  clean_dataframe with dtype-safe column selection
- src/cli_text_clean.py: Typer CLI mirroring the dedup CLI shape
  (dry-run by default, --apply writes cleaned + changes audit, JSON
  config save/load)
- src/gui/pages/2_Text_Cleaner.py: real Streamlit page with preset
  picker, advanced toggles, preview, before/after metrics, and three
  download buttons
- tests/test_text_clean.py + test_cli_text_clean.py: 92 new tests
  covering edge cases E1-E50 from the spec
- samples/messy_text.csv: demo dataset surfacing UC1, UC3, UC6, UC10
  in 10 rows
- test-cases/uc16-uc26 + ec05-ec09: per-use-case and per-edge-case
  fixtures

Docs: TECHNICAL.md §10.2 (full Tier 1/2/3 spec), DECISIONS.md v1.7
entry locking the spec, CLI-REFERENCE.md gains the text cleaner
section, README.md gains a top-level Text Cleaner block, USER-GUIDE.md
status row 02 promoted Skeleton -> Working.

200/200 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 15:14:15 +00:00
b2ca04e6f4 fix: scale app content to 85% zoom
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 01:30:58 +00:00
223148283d revert: remove 75% zoom, 100% fits correctly with chrome hidden
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 01:29:42 +00:00
1c609214b0 fix: scale app content to 75% to fit window
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 01:28:12 +00:00
dc48578c7e feat: launch Chrome in app mode for chromeless window
python -m src.gui now opens Chrome with --app flag, hiding the address
bar, tabs, and bookmarks bar. Falls back to default browser if Chrome
is not found. Headless flag passed via CLI so streamlit run directly
still auto-opens normally.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 01:24:54 +00:00
28bda8d624 fix: remove headless=true so browser opens on launch
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 01:23:07 +00:00
35ea21ad33 feat: hide Streamlit chrome for app-like appearance
Add shared hide_streamlit_chrome() helper that removes header bar,
hamburger menu, footer, and deploy button via CSS injection. Called
on every page. Add .streamlit/config.toml with minimal toolbar mode.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 01:20:54 +00:00
f2fdc10af7 feat: refactor GUI to multi-page Streamlit app with 9 tool pages
Convert single-page deduplicator into a multi-page suite. Home page shows
tool card grid. Deduplicator extracted to its own page (fully working).
8 stub pages added for Text Cleaner, Format Standardizer, Missing Values,
Column Mapper, Outlier Detector, Multi-File Merger, Validator & Reporter,
and Pipeline Runner — each with functional file upload and coming-soon UI.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-29 01:16:12 +00:00