README + USER-GUIDE describe the sidebar picker and current coverage
(home + shared chrome, per-tool bodies pending). DEVELOPER gains a
how-to for adding packs and keys with the parity-test guarantee.
TECHNICAL §10b records the in-house-JSON architecture and locks in the
no-gettext decision (also logged in DECISIONS). REQUIREMENTS reflects
the new interface surface and updated test count. COPY.md adds a
"Language claim" slot so landing/email work can pick it up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes ~17 high-value international gaps surfaced by parallel review.
Adds 93 regression tests; full project suite now 1323 / 0 / 17 (passed
/ failed / xfailed).
DATES
- Adds Portuguese, Italian, Dutch, Russian month dictionaries to the
opt-in ``month_locales`` set (now: en, fr, de, es, pt, it, nl, ru).
- Adds localized weekday recognition for those locales — "Lundi",
"Montag", "lunedì", "понедельник", etc. all strip cleanly before
format matching.
- New CJK separator normalization: Japanese ``2024年01月15日`` and
fullwidth digits ``2024/01/15`` fold to ASCII before parsing.
- New named-timezone resolution: EST/PST/JST/CET/IST/GMT/etc. map to
fixed UTC offsets via ``_NAMED_TZ_OFFSETS`` so the trailing TZ
doesn't block format matching.
- New ISO 8601 extended formats: week date (``2024-W03-1``) and
ordinal date (``2024-015``), plus RFC 2822 mail-header form
(``Mon, 15 Jan 2024 10:30:00``).
- New ``two_digit_year_cutoff`` parameter on ``standardize_date()`` —
defaults to Python's stdlib 69; lower it for birth-year columns
where most subjects were born ≤ 1999.
NAMES
- Particles set extended with Arabic patronymic markers (bin, ibn,
bint, abu, abd, al, al-, el-) and Hebrew (ben, bat, ha, ha-).
- Title set extended with German (Herr, Frau), French (M., Mme,
Mlle), Spanish (Sr., Sra., Srta., Don, Doña), Italian (Sig., Sig.ra,
Dott.), Portuguese.
- Acronym map extended with international academic credentials
(Dipl, Ing, Mag, Habil, MSc, BSc, LLB, LLM).
- New East Asian honorific suffix handler: ``Tanaka-san``,
``Lee-ssi``, ``Park-nim`` keep the suffix lowercase after the
hyphen instead of being title-cased into ``Tanaka-San``.
- Hyphenated-segment handler now keeps Arabic prefixes ``al-`` /
``el-`` lowercase per Arabic transliteration convention.
- New ``family_first`` parameter on ``standardize_name()`` and matching
``name_family_first`` field on ``StandardizeOptions`` — set
per-column for East Asian data to skip Western comma-format reversal
(``Kim, Min-jae`` stays ``Kim, …`` instead of becoming ``Min-jae Kim``).
CURRENCY
- Symbol map extended: ฿(THB), ₫(VND), ₮(MNT), ₴(UAH), ₦(NGN),
₱(PHP), ₲(PYG), ﷼(SAR), ₨(PKR), ₵(GHS) — covers SE Asia, Africa,
Eastern Europe, Latin America gaps.
- ISO 4217 code list extended from 23 to ~50: SAR, AED, QAR, KWD,
BHD, OMR, ARS, CLP, COP, EGP, IDR, MYR, PHP, THB, VND, NGN, GHS,
KES, HUF, CZK, RON, UAH, KZT, etc.
EMAIL
- New BIDI / RTL override stripping (``standardize_email``):
U+202A-U+202E and U+2066-U+2069 stripped from every email. These
are a known phishing vector — ``alice@example.com`` displays as
``alice@elpmaxe.com`` to RTL-aware renderers.
ADDRESS
- Canadian provinces: 13 codes + names → 2-letter (Ontario → ON).
- UK postcode pattern recognition (``SW1A 2AA`` shape).
- Australian states: 8 codes + names (NSW, VIC, QLD, … + full names).
- German Bundesland: 16 codes + names (Bayern → BY, etc.).
- International PO Box variants: ``Postfach`` (DE), ``Boîte postale``
(FR), ``Apartado`` (ES), ``Casella postale`` (IT), ``Caixa postal``
(PT) — all fold to canonical ``PO Box``.
- ``_INTL_STATE_CODES`` now combines US/CA/AU/DE codes; the position
check that preserves state codes regardless of input case applies
to all four jurisdictions.
- ``_is_state_code_position`` postal pattern broadened to recognize
US ZIP, AU 4-digit, CA first half, and UK outward code.
CONSTANTS
- ``src/core/_constants.py`` gains: ``CA_PROVINCE_CODES`` /
``CA_PROVINCE_NAMES``, ``AU_STATE_CODES`` / ``AU_STATE_NAMES``,
``DE_STATE_CODES`` / ``DE_STATE_NAMES``, ``POSTAL_PATTERNS``
(us/ca/uk/de/au/fr), ``INTL_PO_BOX_PATTERNS`` (per-language regex),
``INTL_STREET_SUFFIXES`` (de/fr/es/it/uk dictionaries — ready for
use when address takes a `country_hint` parameter in a future pass).
DOCS
- TECHNICAL.md §11.3 domain table updated with the new handling per
domain plus a new "International coverage" sub-section listing the
supported locales / symbols / jurisdictions.
DEFERRED (out of scope or rare)
- Alternative calendars (Japanese era, Hijri, Hebrew, Buddhist) —
corpus § 3.5 marks out of scope.
- Persian/Arabic-Indic digit conversion — rare in tabular data.
- Trailing-minus RTL currency convention.
- Punycode ↔ Unicode IDN normalization.
- Mixed-country phone column auto-detection (user can override
``default_region`` per column).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New docs/REQUIREMENTS.md catalogs every shipped capability in 17 numbered
categories — file handling, input/output encodings, delimiters, line
endings, detectors, finding schema, confidence tiers, decisions,
performance targets (1 GB), tools, gate behavior, interfaces, platforms,
deps, test coverage, privacy. Linked from README and USER-GUIDE so a
buyer / integrator can scan compliance in under a minute.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a Review & Normalize page that sits between upload and every tool
page. The analyzer now tags each finding with confidence (high/medium/low)
and a fix_action; the gate auto-applies high-confidence fixes, surfaces
medium/low ones for user review, and blocks tool pages on error-level
findings until resolved or waived.
Core (src/core/):
- analyze.py: Finding gains confidence, fix_action, pre_applied; new
detectors for encoding_uncertain, encoding_decode_failed; new top-
level encoding_override parameter.
- fixes.py: registry of fix algorithms keyed by fix_action id.
- normalize.py: auto_fix(), apply_decisions(), is_normalized(), and
the NormalizationResult / Decision dataclasses the gate consumes.
- io.py: detect_encoding tries strict UTF-8 first; repair_bytes now
transcodes UTF-16/32 to UTF-8 before NUL-strip (fixes UTF-16 corruption)
and normalizes line endings (fixes bare-CR parser crash); empty file
handled gracefully instead of EmptyDataError traceback.
GUI (src/gui/):
- pages/0_Review.py: gate page with per-finding decision controls,
encoding override picker (16 codepages + custom), and Advanced output
options (encoding, delimiter, line terminator) on the download.
- components.py: require_normalization_gate() helper.
- pages/1-9: gate guard wired on every tool page.
Test corpora:
- test-cases/encodings-corpus/: 31 encoded CSV fixtures + 9 reference
UTF-8 files + manifest, synced from Business/DataTools.
- test-cases/text-cleaner-corpus/test_data/17: synced malformed input
(unquoted $1,500.00) for the unquoted-delimiter detector.
Tests (94 new):
- test_normalize.py (48): finding fields, fix registry, auto_fix scope,
decision paths, gate idempotency, output-options helper.
- test_encodings_corpus.py (90, 16 xfailed): parametric detection +
decode + analyzer-no-crash sweep against the manifest.
- test_analyze.py: encoding override + encoding_uncertain detectors.
- test_corpus.py: pre-parse repair in the strict reader.
run_tests.py: new aliases --tool normalize, --tool encodings, --tool gate;
encodings corpus added to --fixtures category.
Docs: USER-GUIDE §3.3 covers the gate workflow, encoding override, and
output options; TECHNICAL §10.2.1-10.2.4 documents the analyzer schema,
gate API, Review page, and pre-parse repair pipeline; CLI-REFERENCE adds
the analyzer JSON schema with the new fields; README links to all of it.
Suite: 765 passed, 17 xfailed (was 458 passed).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Update README, CLI reference, and developer guide to cover delimiter
selector, inline checkboxes/dropdowns, live surviving rows preview,
multi-row survivors, and apply_review_decisions(). Remove dead link.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Rewrite README.md with project overview, quick-start, and CLI summary
- Add docs/CLI-REFERENCE.md with full flag reference and 8 recipe sections
- Add docs/DEVELOPER.md with architecture, data flow, and extension guides
- Rewrite src/core/__init__.py with public API exports and module docstring
- Add Streamlit GUI (src/gui/) with file upload, advanced options, interactive
match group review with side-by-side diff, and download buttons
- Add .gitignore, requirements.txt, all source code, tests, and sample data
- Add streamlit to requirements.txt
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>