Commit Graph

10 Commits

Author SHA1 Message Date
5b672370a6 perf: cache hot paths, drop wasted allocations, lift 1 GB → 1.5 GB
Five targeted wins driven by an end-to-end audit, with shape-pinning
regression tests so reverts are loud:

- format_standardize: fuse the dispatcher loop into one pass — was
  calling Series.tolist() three times per typed column and materialising
  an intermediate triples list; now one tolist, one walk. On a
  synthetic 1M-row phone+email frame this measures ~2.7M rows/sec
  (vs. the previous 150k/sec doc target).
- dedup: wrap normalizers in a per-call lru_cache so repeat phones /
  emails / addresses skip re-parsing. phonenumbers.parse is the
  expensive call; ~2–5x faster on the normalisation step for realistic
  workloads.
- analyze: _detect_near_duplicates no longer copies the full input
  frame; builds only the normalised string columns via a dict and
  references non-string columns by view. Skips the redundant
  astype(str) when a column is already pandas string dtype.
- text_clean: hoist _build_pipeline out of the per-cell loop and add a
  per-call string cache so 100k repeats of "Active" only run the
  pipeline once. ~1M rows/sec on repetition-heavy columns.
- io.repair_bytes: the non-UTF-8 smart-quote fold path used a
  Python-level zip walk over the entire decoded string to count
  replacements — replaced with sum(text.count(c) ...) which runs in
  C at ~GB/s. Was a latent ~100s on a 1 GB cp1252 file; now <1s.

Updates REQUIREMENTS §10 with measured numbers and bumps the buyer-
facing upload limit from 1 GB to 1.5 GB across the i18n packs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 15:37:26 +00:00
966af8ef94 feat: 3 new tools, format streaming, distribution-ready demo + landing pages
Tools shipped this batch (4 → 6 of 9 Ready):
  04 Missing Value Handler   src/core/missing.py + cli_missing.py + GUI
  05 Column Mapper           src/core/column_mapper.py + cli_column_map.py + GUI
  09 Pipeline Runner         src/core/pipeline.py + cli_pipeline.py + GUI
                             with soft tool-dependency graph (recommended,
                             not enforced) and JSON save/load for repeatable
                             weekly cleanups.

Format Standardizer reworked for 1 GB international files:
  • Vectorised dispatch + LRU cache over phone/date/currency/boolean/email
  • Per-row country / address columns drive parsing
  • Audit cap (default 10 k rows, ~50 MB RAM)
  • standardize_file(): chunked streaming entry point (~165 k rows/sec)
  • currency_decimal="auto" for EU comma-decimal locales
  • R$ / kr / zł multi-char currency prefixes
  • cli_format.py with auto-stream above 100 MB inputs

Encoding detection arbiter + language-aware probe:
  Closes the last 4 xfails (cp1250 / mac_iceland / shift_jis_2004 / lying-BOM)
  via tied-confidence arbiter + Cyrillic / EE-Latin coverage probes.

Distribution-readiness assets:
  • streamlit_app.py — Streamlit Community Cloud entry shim
  • src/gui/app_demo.py — single-page demo, ?p=<persona> routing,
    100-row cap + watermark, free-vs-paid boundary enforced at surface
  • samples/demo/ — 3 niche datasets + pre-tuned pipeline JSONs
  • landing/ — 4 static HTML pages (apex chooser + 3 niche),
    shared CSS, deploy.py URL-substitution script,
    auto-generated robots.txt + sitemap.xml + 404.html + favicon
  • docs/PLAN.md, DEMO-PLAN.md, DEPLOYMENT.md, POST-LAUNCH.md, NEXT-STEPS.md
    — full strategy + measurement + deployment + master checklist

Test counts:
  before: 1,520 passed · 4 skipped · 17 xfailed
  after:  1,729 passed · 0 skipped · 0  xfailed

Tier-1 corpora added:
  • missing-corpus           3 use cases + 16 edge cases
  • column-mapper-corpus     3 use cases + 5 edge cases
  • format-cleaner intl      20-row 13-country stress fixture

Engine hardening flushed out by the corpora:
  • interpolate guards against object-dtype columns
  • mean/median skip all-NaN columns (silences numpy warning)
  • fillna runs under future.no_silent_downcasting (silences pandas warning)
  • mojibake test no longer skips when ftfy installed (monkeypatch path)
  • drop-row threshold semantics: strict-greater (consistent across rows / cols)
  • currency_decimal validator allow-set updated for "auto"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 22:31:26 +00:00
26b9771625 feat(errors): structured error hierarchy + helpful messages everywhere
Introduces src/core/errors.py with a small structured error hierarchy
that every public entry point now uses. Each error carries the
context a user needs to fix it and the context a maintainer needs to
trace it.

The hierarchy:
  DataToolsError  (base — formats path, column, operation, suggestion)
    InputValidationError  (extends ValueError — bad arg / wrong type)
    ConfigError           (extends ValueError — bad config / options)
    FileFormatError       (extends ValueError — file is not what we expected)
    FileAccessError       (extends OSError   — file I/O failure)

Subclassing the stdlib bases means existing `except OSError` /
`except ValueError` handlers still catch them — no breaking change.

Helpers:
- ensure_dataframe(value, function=...)  — uniform DataFrame guard
- ensure_choice(value, name=, choices=)  — uniform enum/literal guard
- wrap_file_read(path, op, exc)          — tag OSError with hint + path
- wrap_file_write(path, op, exc)         — same, with Windows-aware tip
- format_for_user(exc, context=)         — user-facing string for st.error / stderr

Library hardening:
- io.read_file: missing files surface FileAccessError listing whether
  the parent directory exists, and the suggestion to check the path.
- io.read_file: chunk_size <= 0 now raises InputValidationError with
  a positive-integer suggestion.
- io._read_excel: openpyxl BadZipFile / InvalidFileException / pandas
  ValueError ("sheet not found") wrapped as FileFormatError listing
  the path and a "list sheets with list_sheets()" hint.
- io._detect_excel_header_row: bare except narrowed to specific
  openpyxl exceptions; falls back gracefully and logs at debug so
  the real error surfaces from pd.read_excel.
- io.write_file: OSError / PermissionError on to_csv/to_excel wrapped
  with file path and Windows-aware "file may be open in another
  program" hint.
- dedup._parse_date: bare `except Exception` narrowed to
  (TypeError, ValueError, OutOfBoundsDatetime); failed values
  logged at debug for survivor-selection forensics.
- dedup._select_survivor: KEEP_MOST_RECENT now raises
  InputValidationError instead of silently falling back to keep_first.
- dedup.deduplicate: input validation errors are InputValidationError
  with operation/column/suggestion fields.
- format_standardize.from_dict: invalid FieldType for a column raises
  ConfigError naming the column AND the bad value AND listing valid
  values; same for date_order / phone_format / etc.
- format_standardize.from_file: OSError / JSON decode wrapped with
  path AND line/column where parsing failed.
- format_standardize.to_file: TypeError on json.dumps wrapped as
  ConfigError with the suspected source (extra_abbreviations).
- format_standardize._apply_field_type: dispatcher's "unknown field
  type" branch now raises AssertionError (it's an internal invariant,
  not user error — a new enum value was added without a branch).
- format_standardize._resolve_column_types: missing-column error now
  InputValidationError with a "check for typos / unparsed header"
  suggestion.
- format_standardize.standardize_dataframe: ensure_dataframe at entry.
- text_clean.clean_dataframe: ensure_dataframe at entry.
- config.to_strategies: invalid Algorithm/NormalizerType wrapped as
  ConfigError naming the strategy index AND the column.
- config.to_survivor_rule: invalid SurvivorRule wrapped as ConfigError
  listing valid values.
- config.from_file: OSError / JSON decode wrapped (mirror of
  StandardizeOptions.from_file).
- fixes.repair_mojibake: ImportError on ftfy now logged at info level
  with the underlying ImportError so a corrupt-package vs not-installed
  distinction is visible in the logs.
- normalizers.normalize_phone: phonenumbers.NumberParseException now
  logged at debug when the digits-only fallback drops extension /
  country-code information — gives a trail when matching results
  look wrong.

GUI / CLI surfaces:
- All 9 page handlers (`except Exception as e: st.error(...)`) now
  use format_for_user(), which renders DataToolsError fields nicely
  and falls back to "ClassName: message" for unrecognized errors.
- 2_Text_Cleaner and 3_Format_Standardizer additionally distinguish
  UnicodeDecodeError with an "re-save as UTF-8" suggestion before
  the generic handler.
- cli.py's "Error reading file" handler now uses format_for_user()
  and includes the input path in the prefix.

Tests:
- tests/test_errors.py — 22 new tests covering: base class formatting,
  stdlib inheritance, ensure_dataframe / ensure_choice helpers,
  wrap_file_read / wrap_file_write, format_for_user behavior, and
  end-to-end integration (missing file, missing dir, bad JSON, bad
  algorithm, bad enum, missing column).
- tests/test_audit_fixes.py + tests/test_io.py — updated 4 tests for
  the new exception types (InputValidationError replaces TypeError,
  FileAccessError extends OSError).

Full project suite: 1230 passed, 4 skipped, 17 xfailed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 02:35:42 +00:00
2eece6467d refactor: dedup, consolidate, harden public APIs across core modules
Closes 16 high-value findings from a parallel cross-module review.

Refactors:
- New src/core/_constants.py centralizes USPS street-suffix
  abbreviations, US state names, and 2-letter postal codes — one source
  of truth for both normalize_address (matching keys) and
  standardize_address (display formatting). Eliminates ~80 lines of
  duplicated dicts across normalizers.py and format_standardize.py.
- format_standardize.py: collapse 4 identical nested _err() helpers
  into one shared _err_or_passthrough() module function; drop a dead
  duplicate `return _err("not a phone number")` branch in
  standardize_phone.
- format_standardize.py: precompile per-locale month-name regexes
  (_MONTH_LOCALE_PATTERNS) and per-state-name regexes
  (_STATE_NAME_PATTERNS) at import time — they were rebuilt on every
  cell, a measurable hot path on million-row inputs.
- dedup.py: extract _is_missing(value) helper; one definition of
  "this cell is None / NaN / pd.NA" instead of two.
- fixes.py: extract _is_string_column(ser) helper; one dtype check
  instead of three duplicates across _apply_to_strings,
  _vectorized_translate, _vectorized_regex_sub.

Production-readiness:
- format_standardize.standardize_dataframe now logs a warning when
  more than 10% of typed cells are unparseable — surfaces the
  silently-broken-pipeline failure mode.
- StandardizeOptions.from_dict validates date_order / phone_format /
  currency_decimal / name_case / boolean_style / *_error_policy
  enum values up front, with a clear error message instead of a deep
  crash inside the per-cell function.
- StandardizeOptions.from_file and DeduplicationConfig.from_file wrap
  read + json.loads with descriptive OSError / ValueError messages
  including the file path.
- standardize_date(month_locales=...) validates locale codes against
  the available set instead of silently passing through unknown ones.
- io.read_file rejects chunk_size <= 0 (was silently failing inside
  pandas) and logs the resolved suffix + chunk_size at info level so
  data-pipeline runs are debuggable.
- io.read_file's FileNotFoundError gains explanatory context.
- io.write_file, text_clean.clean_dataframe, and dedup.deduplicate
  now reject non-DataFrame inputs with clear TypeError instead of
  cryptic pandas tracebacks downstream.
- dedup.deduplicate validates that survivor_rule=KEEP_MOST_RECENT has
  a usable date_column up front; the helper _select_survivor now
  raises (instead of silently falling back to keep_first) when called
  directly with bad arguments.
- dedup.deduplicate gains a structured no-op return when strategies
  is empty after auto-detection — preserves schema instead of crashing.
- analyze._detect_inconsistent_date_format narrows its bare except to
  (TypeError, ValueError) and logs a debug line so genuine bugs don't
  hide behind silent skip.

Tests:
- tests/test_audit_fixes.py grows by 11 cases covering the new
  validation paths (chunk_size, DataFrame guards, KEEP_MOST_RECENT
  date_column, enum validation, locale validation, JSON error wrapping).

Full project suite: 1208 passed, 4 skipped, 17 xfailed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 02:23:09 +00:00
b23a27d4e3 fix: cross-tool audit findings + alignment with format standardizer
Closes 12 bugs and 8 gaps surfaced by parallel audits across all core
modules, plus aligns the dedup-side normalizers with the new
format_standardize behavior where they had silently diverged.

Bugs (data integrity / correctness):
- dedup: NaN/None values matched as duplicates because str(None)='None'.
  Two rows with missing email silently merged.
- dedup: removed_df had 0 columns when nothing was removed; downstream
  code expecting matching schema broke. Now preserves column shape.
- dedup: ColumnMatchStrategy threshold accepted any value; out-of-range
  silently broke matching. Validated to [0, 100] in __post_init__.
- dedup: strategy referencing a missing column was silently skipped.
  Now raises ValueError listing available columns.
- fixes: replace_null_sentinels crashed on non-string sentinels (int/None
  from JSON payload). Coerced to str.
- fixes: _vectorized_regex_sub raised raw re.error on bad patterns. Now
  wraps as ValueError with clear message.
- io: detect_header_row mis-identified all-empty and metadata-only rows
  as headers (all([]) is True). Now requires ≥2 non-empty cells.
- config: from_dict crashed when JSON had unknown fields, breaking
  forward compat. Now filters to known fields.
- analyze: mixed-case email detector flagged all-None columns because
  str(None)='None' contains both N and one. Now drops NaN before stringify.

New features and gap closures:
- io: _detect_excel_header_row mirrors detect_header_row for Excel via
  openpyxl read-only; _read_excel uses it when header_row=None.
- io: write_file gains delimiter + encoding params; .tsv extension
  defaults to tab.
- normalizers: normalize_phone preserves extensions as ;ext=N suffix.
- normalizers: normalize_address folds spelled-out US state names to
  2-letter codes (California ≡ CA).
- normalizers: normalize_name drops surname particles (van, de, von)
  so "Charles de Gaulle" ≡ "Charles Gaulle" for matching.
- analyze: new _detect_inconsistent_date_format detector flags columns
  with mixed ISO/US/EU date shapes; routes to format standardizer.
- analyze: _NULL_LIKE recognizes "<na>" (pd.NA repr).
- analyze: duplicate-row finding renamed count → n_extra (rows that
  would actually be removed) with clarified description.
- dedup: group_confidence no longer falsely 100.0 when transitive group
  members lack a recorded direct pair; falls back to 100.0 only when
  truly no pairs were observed.
- dedup: MatchResult / DeduplicationResult docstrings clarify that
  row_indices refer to the input frame's positional index (output index
  is reset).
- text_clean: visualize_hidden_html(None) now returns None (matches
  visualize_hidden_text); strip_bom strips at most one BOM per call;
  sentence_case dead elif branch removed.

Tests:
- tests/test_audit_fixes.py — 28 regression tests, one or more per
  numbered finding, named after BUG/GAP/NIT tags so future readers
  can trace each test back to its audit.
- tests/test_fixes_unit.py — 26 isolated unit tests for previously
  integration-only fix functions (trim_whitespace, strip_nbsp,
  strip_zero_width, normalize_line_endings, clean_headers,
  repair_mojibake — last skipped if ftfy unavailable).
- tests/test_io.py — adds CSV / TSV / semicolon / UTF-8-BOM round-trip
  tests + Excel auto-header-detection tests.
- tests/test_normalizers.py — adds 8 tests for the alignment work
  above (phone extension, state names, particles).

Adds .claude/ to .gitignore (agent worktrees + local settings).

Full project suite: 1197 passed, 4 skipped, 17 xfailed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 02:11:57 +00:00
438bc0f84d perf: 1 GB-class file efficiency for the analyzer + gate pipeline
Six targeted changes that drop the user-visible analyzer scan time from
"go for coffee" to sub-second on 1 GB inputs and reduce peak RSS by ~10×.

src/core/io.py
  - detect_encoding: open + read sample bytes instead of read_bytes()[:N].
    Was allocating the full file in memory just to slice the head; on a
    1 GB input this saves a 1 GB intermediate allocation.
  - repair_bytes: byte-level smart-quote fold via bytes.replace when the
    input is UTF-8. The probe (b"\\xe2\\x80" / b"\\xc2\\xab" / b"\\xc2\\xbb")
    is a single C-implemented contains check that skips the entire fold
    stage on files with no smart quotes — most of them.
  - repair_bytes: skip the per-row csv.reader walk unless a cheap byte
    scan finds a currency sigil ($/€/£), the delimiter is non-comma, the
    decoder substituted U+FFFD, or _has_field_count_mismatch detects an
    unquoted-delimiter row. csv.reader was the dominant cost in
    repair_bytes on big files (materializes a list of every row).
  - _has_field_count_mismatch: hand-rolled quote-state walker; one pass,
    no allocation, returns True at first mismatch. False positives just
    fall through to the slower _repair_rows pass.

src/core/analyze.py
  - _load_for_analysis: read only ~max(4KB, sample_rows × 256B × 2) head
    bytes for the analyzer's sample-mode scan. Drops analyze(sample_rows
    =1000) from "read + repair full file" to "read + repair 500KB" —
    150× faster on a 1.25 GB file. Falls back to a single full-file
    retry if pandas reports fewer rows than the cap.
  - Compiled regex character classes for hot-path detectors and a
    _vec_match_count helper that runs Series.str.contains in C instead
    of Python per-cell loops. Detectors converted: smart_punctuation,
    invisible_chars (NBSP + zero-width), whitespace_padding,
    null_like_sentinels, mojibake, encoding_uncertainty,
    mixed_case_email, leading_zero_ids.

src/core/fixes.py
  - _vectorized_translate / _vectorized_regex_sub: pandas-native string
    transforms for the fixes that are pure character maps (strip_nbsp,
    fold_smart_punctuation, strip_zero_width). Series.str.translate
    runs in C — 10-50× faster than per-cell Python.
  - _apply_to_strings: replaced inner per-cell loops with Series.map +
    boolean-mask diff for the count.
  - All fix entry points read an "inplace" flag from payload and thread
    it through the helpers.

src/core/normalize.py
  - apply_decisions: takes a single working copy at the top, then sets
    payload["inplace"] = True so each chained fix mutates that copy.
    Previously every fix did df.copy(); N fixes × 6 GB DataFrame =
    30+ GB peak. Now: one 6 GB allocation.

Validation: 765 passed, 17 xfailed (no regressions). 100 MB benchmark:

  stage                              before       after
  ------------------------------     -------      --------
  detect_encoding                    0.97s+1.3GB  ~0s + 0 MB
  analyze (sample_rows=1000)         235.76s      0.08s
  _load_for_analysis (1000 rows)     148.17s      0.01s
  repair_bytes (full file)           150s/1.25GB  2.91s/100MB

The user-visible analyzer scan dropped from minutes to sub-second on
1 GB-class files. Full-DataFrame analyze + auto_fix improvements are
more modest (~25%) because trim_whitespace and replace_null_sentinels
still need per-cell Python for the structural-shape checks, but the
hot path through these is now bounded by pandas' .map rather than a
manual for loop.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 21:13:47 +00:00
82d7fef21e feat(gate): CSV-normalization gate with confidence-tiered findings
Adds a Review & Normalize page that sits between upload and every tool
page. The analyzer now tags each finding with confidence (high/medium/low)
and a fix_action; the gate auto-applies high-confidence fixes, surfaces
medium/low ones for user review, and blocks tool pages on error-level
findings until resolved or waived.

Core (src/core/):
  - analyze.py: Finding gains confidence, fix_action, pre_applied; new
    detectors for encoding_uncertain, encoding_decode_failed; new top-
    level encoding_override parameter.
  - fixes.py: registry of fix algorithms keyed by fix_action id.
  - normalize.py: auto_fix(), apply_decisions(), is_normalized(), and
    the NormalizationResult / Decision dataclasses the gate consumes.
  - io.py: detect_encoding tries strict UTF-8 first; repair_bytes now
    transcodes UTF-16/32 to UTF-8 before NUL-strip (fixes UTF-16 corruption)
    and normalizes line endings (fixes bare-CR parser crash); empty file
    handled gracefully instead of EmptyDataError traceback.

GUI (src/gui/):
  - pages/0_Review.py: gate page with per-finding decision controls,
    encoding override picker (16 codepages + custom), and Advanced output
    options (encoding, delimiter, line terminator) on the download.
  - components.py: require_normalization_gate() helper.
  - pages/1-9: gate guard wired on every tool page.

Test corpora:
  - test-cases/encodings-corpus/: 31 encoded CSV fixtures + 9 reference
    UTF-8 files + manifest, synced from Business/DataTools.
  - test-cases/text-cleaner-corpus/test_data/17: synced malformed input
    (unquoted $1,500.00) for the unquoted-delimiter detector.

Tests (94 new):
  - test_normalize.py (48): finding fields, fix registry, auto_fix scope,
    decision paths, gate idempotency, output-options helper.
  - test_encodings_corpus.py (90, 16 xfailed): parametric detection +
    decode + analyzer-no-crash sweep against the manifest.
  - test_analyze.py: encoding override + encoding_uncertain detectors.
  - test_corpus.py: pre-parse repair in the strict reader.

run_tests.py: new aliases --tool normalize, --tool encodings, --tool gate;
encodings corpus added to --fixtures category.

Docs: USER-GUIDE §3.3 covers the gate workflow, encoding override, and
output options; TECHNICAL §10.2.1-10.2.4 documents the analyzer schema,
gate API, Review page, and pre-parse repair pipeline; CLI-REFERENCE adds
the analyzer JSON schema with the new fields; README links to all of it.

Suite: 765 passed, 17 xfailed (was 458 passed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 20:35:27 +00:00
0671ef277e feat(io): route read_file through pre-parse repair by default
Previously only analyze() and direct read_csv_repaired() callers got the
byte-level repair pass (BOM strip, NUL strip, smart-double-quote fold,
unquoted-delimiter merge). The dedup CLI and any other read_file consumer
silently missed it.

read_file gains a repair=True default. CSV/TSV inputs run through
repair_bytes before pandas sees them; Excel inputs still pass through
unchanged. Chunked reads (chunk_size set) bypass repair because the pre-
parse pass loads the whole file — preserving streaming behavior on huge
files. Repair actions and unrepairable lines are logged at INFO/WARNING.

cli_text_clean opts out (repair=False): the cleaner offers fine-grained
control via --preset and per-op flags, and a byte-level smart-quote fold
under the user's "minimal" preset would violate that contract. The
cell-level cleaner does the equivalent work itself when its options ask
for it.

Tests: read_file default strips BOM and folds curly double quotes;
repair=False preserves smart quotes; chunked reads still work and skip
repair as documented.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 16:09:35 +00:00
b8a9fa1b09 feat(io): pre-parse CSV repair (BOM/NUL/smart-quotes/unquoted-delim)
Some pollution patterns block pandas before the cell-level cleaner can run.
Add a pre-parse pass on raw bytes that fixes only what breaks parsing, and
returns a structured action log the GUI/CLI can surface to the user.

repair_bytes(raw, *, encoding, delimiter, fold_quotes, strip_nul, repair_delims):
  1. Strip leading UTF-8 BOM.
  2. Strip embedded NUL bytes (the C parser truncates fields at NUL).
  3. Fold smart double quotes (curly, guillemet, double-prime) to ASCII '"'.
     Curly singles are NOT folded here; they don't conflict with CSV and the
     cell-level cleaner handles them more accurately.
  4. Per-row repair when one rogue delimiter is embedded in a field that
     looks like currency or thousands-grouped digits. Tiered scoring keeps
     "  $1,500.00  ,7" unambiguous: the strict currency regex match wins
     over the loose digit/sigil heuristic.

read_csv_repaired(path) -> (DataFrame, RepairResult). RepairResult exposes
.actions, .unrepairable_lines, and a summary() grouped by kind.

Out of scope for this pass: encoding repair, delimiter conversion, multi-
delimiter merges (k>1) — logged as unrepairable so callers can see what was
left alone instead of silently parsing wrong.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 15:37:49 +00:00
b871ab24fc feat: add documentation, Streamlit GUI, and full source tree
- Rewrite README.md with project overview, quick-start, and CLI summary
- Add docs/CLI-REFERENCE.md with full flag reference and 8 recipe sections
- Add docs/DEVELOPER.md with architecture, data flow, and extension guides
- Rewrite src/core/__init__.py with public API exports and module docstring
- Add Streamlit GUI (src/gui/) with file upload, advanced options, interactive
  match group review with side-by-side diff, and download buttons
- Add .gitignore, requirements.txt, all source code, tests, and sample data
- Add streamlit to requirements.txt

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-28 23:06:39 +00:00