Files
datatools-dev/test-cases/format-cleaner-corpus/FORMATS-CASES.md
Michael 4adeb5c7f3 feat(format): per-cell standardizers + 199-row buyer corpus
Adds src/core/format_standardize.py — a per-cell standardizer for dates,
phones, emails, addresses, names, currencies, booleans — wired through
StandardizeOptions / standardize_dataframe with FieldType registry.

Includes:
- Date parser handles ISO/US/EU/longform/excel-serial/unix-timestamp/
  partial-precision/quarter notation; opt-in French/German/Spanish month
  dictionaries via month_locales.
- Phone via libphonenumber with extension preservation (;ext=N), 001
  international prefix handling, error sentinels for placeholders /
  multi-number cells.
- Email lowercase/trim/mailto/angle-bracket strip with optional
  --gmail-canonical mode.
- Address USPS abbreviation expansion or compression (expand=False per
  corpus § 6.3), state-name → 2-letter conversion, multi-line collapse,
  PO Box normalization, state-code preservation regardless of input case.
- Name handler: Mc/Mac/O'/D' inner caps, hyphen segments, particle
  lowercasing (von/van/de/da), comma-format reversal, period stripping
  for titles/suffixes/initials, PhD/MD acronym preservation, conservative
  mode for mixed-case input.
- Currency: auto-detect EU vs US separators, space-thousands, Swiss
  apostrophe, accounting parens, optional ISO code preservation, error
  sentinels for percentages/ranges/word-values/ambiguous separators.
- Per-domain error_policy ("passthrough" | "sentinel") for surfacing
  malformed values as <error: reason> per corpus § 0.3.

Test corpus from Business/DataTools/test-cases-format-cleaner copied to
test-cases/format-cleaner-corpus/ — 7 fixtures plus FORMATS-CASES.md.
tests/test_format_standardize_corpus.py drives all 199 rows through the
per-cell standardizers; 0 xfailed.

Wires the GUI page (3_Format_Standardizer.py) to "Ready" status.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 02:11:24 +00:00

514 lines
32 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# FORMATS-CASES.md - `03_format_standardizer.py` Test Corpus
**Version**: 1.0
**Last updated**: April 30, 2026
**Companion to**: TEST-CASES.md (cleaning rules), QUOTE-CASES.md (parser robustness), ENCODINGS-CASES.md (I/O layer).
This corpus tests `03_format_standardizer.py`, which owns "what's there but in the wrong format." Six domains: dates, phones, emails, addresses, names, currencies. Plus a cross-domain integration fixture.
---
## 0. Scope clarifications you should read first
Three issues to surface before the per-domain sections, because they affect what tests are valid in the first place.
### 0.1 Email scope conflict with TECHNICAL.md
USER-GUIDE.md Section 2 lists 03's purpose as "dates, currencies, names, phone numbers, addresses." TECHNICAL.md Section 10.1 item 8 puts email normalization inside `01_deduplicator`'s Tier 1 spec. **Email appears in neither place as part of 03.**
This corpus tests email normalization as if it lives in 03. The reasoning: 03 is "format standardizer" and email is a format like any other. Putting it in 01 means there's no public API for the buyer to normalize emails outside of running dedup, which is a weird ergonomic for the GUI ("To clean my emails I have to run the deduplicator?"). Better factoring: 03 owns email normalization as a public operation; 01 calls into the same `core/` function for matching.
If you disagree, fixture `26_format_emails.csv` and its expected output drop out cleanly without affecting the other five domains. If you agree, update USER-GUIDE.md Section 2 and TECHNICAL.md Section 7's per-bundle technical notes.
### 0.2 Schema preservation rule (TECHNICAL.md Section 9 invariant)
03 changes cell content, never schema. Row count, column count, column order all unchanged. This rules out a few tempting designs:
- Currency normalization that splits `$1,234.56` into separate amount and currency columns — **rejected**. Output stays in one cell.
- Address normalization that splits a single-line address into structured street/city/state/zip columns — **rejected**. Output stays in one cell.
- Phone normalization that splits phone + extension into two columns — **rejected**. Extension goes inline as `;ext=123` (RFC 3966 syntax).
If you want structured output, that's a different script (a parser, not a standardizer).
### 0.3 Boundary with neighboring scripts
| If the cell is... | Owner | 03's behavior |
|---|---|---|
| Empty string | 04 (missing values) | Pass through unchanged. Don't decide if it means "missing." |
| Whitespace-only | 02 (text cleaner) | Should already be empty by the time 03 sees it. If not (CLI user skipped 02), trim defensively. |
| Statistically extreme but format-valid (date in year 1700, phone with 10 zeros) | 06 (outliers) | Format-normalize anyway. Don't flag unusual values. |
| Format-invalid (Feb 30, missing @, letters in numeric) | 03 | Emit error sentinel `<error: <reason>>`. |
| Already correctly formatted | 03 | Pass through. Idempotency required. |
---
## 1. Default configuration
Tests assume the defaults below. Per-flag deviations are called out per case.
| Setting | Default | Notes |
|---|---|---|
| `--date-format` | ISO 8601 | `YYYY-MM-DD` for dates, `YYYY-MM-DDTHH:MM:SS[+ZZ:ZZ]` for datetimes |
| `--locale` | auto-detect | Per-column. Falls back to error if column has no disambiguating value |
| `--two-digit-year-cutoff` | 69 | Python default: years 00-68 → 2000-2068, 69-99 → 1969-1999 |
| `--phone-format` | E.164 | `+<country><digits>`, extensions via `;ext=` |
| `--default-country` | US | Used for phones with no country code |
| `--gmail-canonical` | off | Strip Gmail dots and +tags. Destructive, opt-in |
| `--expand-abbrev` | off | Expand St → Street etc. USPS abbreviation is the default |
| `--name-conservative` | on | Title-case only ALL CAPS or all-lowercase input |
| `--currency-locale` | auto-detect | Per-column. Same fallback as date locale |
| `--error-policy` | sentinel | Errors written as `<error: reason>`. Alternative: raise, skip-row |
| `--columns` | all | All text columns processed; `--columns date,phone` restricts |
**Idempotency requirement**: `format(format(x)) == format(x)` for every cell. Already-clean input passes through unchanged.
---
## 2. Test corpus index
| File | Domain | Cases | Expected outputs |
|---|---|---|---|
| `24_format_dates.csv` | Dates | 45 | Single column |
| `25_format_phones.csv` | Phones | 31 | Single column |
| `26_format_emails.csv` | Emails | 31 | Two columns (default + gmail-canonical) |
| `27_format_addresses.csv` | Addresses | 31 | Two columns (default + expand-abbrev) |
| `28_format_names.csv` | Names | 34 | Single column |
| `29_format_currencies.csv` | Currencies | 27 | Single column |
| `30_format_integration.csv` | Cross-domain | 5 | Multi-column (full row) |
All input fixtures share the schema `case_id, category, description, input` (except integration, which has the full multi-column shape). Expected output files key by `case_id` for diff-by-join testing.
---
## 3. DATES (`24_format_dates.csv`)
### 3.1 Use cases by buyer persona
- **Shopify**: Order export dates joined against manual entries that used a different format. Bookkeeping reports needing consistent date format for sorting.
- **Bookkeeper**: Bank export reconciliation across multiple banks, each using its own date convention. Tax reports requiring consistent year-month grouping.
- **Freelancer**: Client data dumps where the date column is in whatever format the client's locale or software produces.
- **Marketing agency**: Campaign performance data joined across platforms (Google Ads, Facebook Ads, Mailchimp) that all use different date formats.
### 3.2 Test categories
| Category | Cases | What it tests |
|---|---|---|
| iso | FD01-FD06 | ISO 8601 baseline. Already-clean and minor variants (Z vs offset, T vs space) |
| us | FD07-FD10 | M/D/Y format with 2-digit and 4-digit years. Includes one unambiguous case (day > 12) |
| eu | FD11-FD15 | D/M/Y format with various separators. Includes one unambiguous case |
| longform | FD16-FD21 | Month-name formats (full, abbreviated, with weekday, all caps) |
| excel | FD22-FD23 | Excel serial numbers (45306 = 2024-01-15). Critical: Excel CSV exports often have date columns leak through as numbers |
| unix | FD24-FD25 | Unix timestamps in seconds and milliseconds |
| partial | FD26-FD29 | Year-month, quarter, year-only. Coarser-than-day precision |
| edge | FD30-FD40 | Two-digit year ambiguity, leap day validity, Excel 1900 leap year bug, invalid dates, dates buried in other text |
| locale | FD41-FD42 | French and German month names |
| timezone | FD43-FD44 | Named time zones, DST transitions |
| padding | FD45 | Already-clean idempotency check |
### 3.3 Critical policy decisions
**Locale ambiguity (M/D/Y vs D/M/Y)**: Per-column inspection. The cleaner scans all values in the column; if any value has day > 12, locale is unambiguously D/M/Y; if any has month > 12 (impossible in M/D/Y), locale is unambiguously D/M/Y. If nothing disambiguates, error out and require `--locale us|eu`. **Do not silently guess.** Fixture row FD13 (`15/01/2024`) is ambiguous in isolation; FD14 (`30/05/2024`) makes the column unambiguously D/M/Y; in a real column containing both, FD13 resolves to `2024-01-15`.
**Two-digit year cutoff**: Python's default of 69 (years 00-68 → 2000s, 69-99 → 1969-1999). FD30 is `1/15/69` and resolves to `1969-01-15`. This is opinionated and frequently wrong for birth-year columns. Document the flag clearly; the buyer cleaning customer DOB data needs to override.
**Excel serial dates** (FD22, FD23): Detection heuristic — column header contains "date", or all values are integers/floats in range 2556973050 (Jan 1 1970 to Jan 1 2099 in Excel serial). Outside that heuristic the cleaner can't distinguish a date serial from any other number.
**Excel 1900 leap year bug** (FD33): Excel claims 1900-02-29 exists; it doesn't. Detect and emit error. Don't silently accept and roll over to March 1.
**Localized month names** (FD41, FD42): Default cleaner ships with English month names. French/German/Spanish/etc. require a locale dictionary. Either ship one (adds size) or document the limitation. **Recommendation**: ship English + opt-in `--month-locale=fr|de|es` for the others. This corpus tests as if French and German are supported.
**Time zones** (FD43, FD44): Named zones (EST, PST) resolve to fixed offsets, NOT dynamically interpreted with DST rules. EST → -05:00 always. If buyers need DST-aware handling, that's a 04-bundle (out of scope) or an opt-in pyzoneinfo flag.
### 3.4 Edge case: dates buried in text (FD36, FD37)
`Date: 2024-01-15` and `2024-01-15 (verified)` extract to `2024-01-15`. The cleaner uses regex extraction for date-shaped substrings before parsing. **Risk**: false positives from random number sequences. Mitigation: require an unambiguous date pattern (4-digit year + valid month + valid day with explicit separator).
### 3.5 What's not tested
- Calendar systems other than Gregorian (Hijri, Hebrew, Japanese era). Out of scope.
- Recurring date strings (`every 1st of month`). Not a date.
- Date ranges (`2024-01-01 to 2024-01-15`). Out of scope; would require a different cell semantic.
- Sub-millisecond precision. Pandas/datetime tolerate but aren't tested here.
---
## 4. PHONES (`25_format_phones.csv`)
### 4.1 Use cases by buyer persona
- **Shopify**: Customer phone list normalization before Klaviyo/Mailchimp import. SMS campaigns require E.164.
- **Bookkeeper**: Vendor phone deduplication where same vendor has multiple format variants in QuickBooks vs. spreadsheets.
- **Freelancer**: Lead lists from clients in arbitrary formats.
- **Marketing agency**: Multi-platform audience reconciliation; ad platforms increasingly require E.164 for matching.
### 4.2 Test categories
| Category | Cases | What it tests |
|---|---|---|
| us | FP01-FP08 | Common US format variants — plain digits, parens-dash, dots, spaces, country code prefixes |
| ext | FP09-FP11 | Extensions in three syntactic forms (`ext`, `x`, `#`) |
| vanity | FP12-FP13 | Letter-to-digit conversion (1-800-FLOWERS) |
| intl | FP14-FP19 | UK, Germany, France, Japan, Australia |
| e164 | FP20 | Already-E.164 idempotency |
| edge | FP21-FP31 | Insufficient/excess digits, placeholders, multiple numbers per cell, NBSP, smart-quote contamination |
### 4.3 Critical policy decisions
**Default output: E.164** (`+<country><digits>`). Universal storage format. Reverses cleanly to any presentation format if the buyer wants display formatting later.
**Default country**: US, configurable via `--default-country=GB|DE|...`. For mixed-country columns, cleaner needs explicit country detection per-row, which is hard without context. Real-world advice for the buyer: split phone columns by country before normalizing.
**Vanity numbers** (FP12, FP13): Letters convert via standard phone keypad: 2=ABC, 3=DEF, ..., 9=WXYZ. `FLOWERS``3569377`. Loses some information (you can't reverse 3569377 to FLOWERS). Acceptable tradeoff for storage normalization.
**Trunk prefix dropping**: UK domestic format `020 7946 0958` (FP15) has a leading `0` that's a domestic trunk prefix, not part of the actual number. E.164 strips it: `+442079460958`. Same logic for other countries with trunk prefixes.
**Placeholders** (FP23, FP24): All-zeros `000-000-0000` and all-nines `999-999-9999` are conventional "no phone" sentinels in some CRMs. Emit error rather than silently producing a syntactically valid E.164 that's semantically meaningless. **Tradeoff**: a real number that happens to be `999-999-9999` (which doesn't exist in NANP, by the way; 999 is reserved) would error too. Acceptable.
**Multiple numbers** (FP25): Cell containing `555-123-4567 / 555-987-6543`. Don't silently pick one; emit error and tell the user to split first. Splitting is a structural change, not a format change, so it belongs upstream of 03.
**NBSP and smart-quote contamination** (FP27, FP31): Should not reach 03 if 02 ran first. Defensive cleanup is fine; emit a debug log noting the upstream pollution.
### 4.4 What's not tested
- SMS-vs-voice number distinction.
- Carrier lookup. Out of scope; would require a paid service.
- Number portability validation.
- Toll-free number recognition (888, 877, 866, 855, 844, 833) beyond accepting them as valid digits.
---
## 5. EMAILS (`26_format_emails.csv`) — see Section 0.1 for scope caveat
### 5.1 Use cases by buyer persona
- **Shopify**: Customer list cleanup before email-marketing platform import (every duplicate costs money on per-contact pricing). Pre-flight check on order export before re-engagement campaigns.
- **Bookkeeper**: Vendor email list consolidation.
- **Freelancer**: Client communication list normalization.
- **Marketing agency**: List hygiene across multiple lead sources before campaign send.
### 5.2 Test categories
| Category | Cases | What it tests |
|---|---|---|
| basic | FE01-FE04 | Plain ASCII, mixed case, whitespace |
| displayname | FE05-FE07 | RFC display-name forms `Name <email>`, with and without quotes |
| prefix | FE08-FE09 | mailto: prefix |
| gmail | FE10-FE14 | Gmail-specific dot-equivalence and +tag handling. Includes negative cases (non-Gmail domains) that must NOT be touched |
| idn | FE15-FE16 | Internationalized domain names; Unicode in local part |
| trailing | FE17-FE20 | Punctuation contamination from copy-paste contexts |
| smartquote | FE21 | Word-paste damage |
| invalid | FE22-FE26 | Missing @, double @, multiple @, internal whitespace, no TLD |
| multiple | FE27-FE28 | Multiple emails in one cell |
| edge | FE29-FE31 | Empty, whitespace-only, already-perfect |
### 5.3 Critical policy decisions
**Default behavior**: lowercase, trim, strip `mailto:`, strip wrapping `<>`, extract from `Display Name <email>` form. **Does NOT strip Gmail dots or +tags by default.** Those normalizations are destructive (`alice` and `a.l.i.c.e` aren't the same email per RFC; only Gmail's specific provider policy treats them as equivalent).
**Aggressive mode (`--gmail-canonical`)**: Strip dots and +tags for `@gmail.com` only. Preserve them for all other domains, even if those domains have similar policies (some custom Google Workspace domains, some other providers). Don't second-guess provider policy.
**FE13 and FE14 are critical negative tests**: a non-Gmail domain with dots or +tag must NOT be touched even in `--gmail-canonical` mode. Many cleaners get this wrong — they apply Gmail's policy to all domains, which corrupts data.
**IDN handling** (FE15, FE16): Don't punycode-convert by default. Buyers who need ASCII-only output for legacy systems can opt in via `--punycode`. Default is to preserve Unicode in domain and local parts.
**Display-name extraction** (FE05, FE06): Drop the display name. The cleaner extracts the email and discards `Alice Smith`. **Tradeoff**: information loss. Alternative would be to preserve display name in a separate column, but that violates schema preservation (Section 0.2). Buyers who want to keep display names should split the column upstream.
**Multiple emails per cell** (FE27, FE28): Error, don't pick one. Same rationale as multiple phones.
### 5.4 What's not tested
- Email syntax validation per full RFC 5321/5322 (which permits all sorts of legitimately weird inputs like quoted-string locals). The cleaner uses a "good enough for 99% of real data" regex, not a full RFC parser.
- Disposable-email-domain detection. Out of scope for format cleaning; that's data quality.
- DNS / MX validation. Out of scope; requires network access.
- Email-address-as-username (where domain is a hostname not an internet domain). Errors as TLD-less.
---
## 6. ADDRESSES (`27_format_addresses.csv`)
### 6.1 Use cases by buyer persona
- **Shopify**: Customer address normalization for shipping label generation; reduces failed deliveries.
- **Bookkeeper**: Vendor master record cleanup; consistent format for bookkeeping software import.
- **Freelancer**: Client address book consolidation.
- **Marketing agency**: Direct mail audience cleanup.
### 6.2 Test categories
| Category | Cases | What it tests |
|---|---|---|
| clean | FA01 | Already-USPS-formatted idempotency |
| case | FA02-FA04 | All-caps, all-lowercase, mixed-case (preserve) |
| abbrev | FA05-FA08 | Street type expansion/abbreviation, periods after abbreviations |
| directional | FA09-FA11 | North/N, NORTH/N, NE compounds |
| unit | FA12-FA14 | Apartment/Apt, # / Apt, Suite/Ste |
| state | FA15-FA16 | State name → 2-letter code |
| zip | FA17-FA18 | ZIP+4, leading-zero ZIPs (Massachusetts 02xxx) |
| multiline | FA19 | `\n`-separated address fields |
| pobox | FA20-FA22 | Post Office Box variants |
| housenum | FA23-FA25 | Letter suffix, hyphen, half-number |
| non_us | FA26-FA28 | UK, Canada, Japan (minimal handling) |
| edge | FA29-FA31 | Empty, partial, trailing comma |
### 6.3 Critical policy decisions
**US-first scope**: USPS abbreviations and state codes are the default. International addresses get whitespace + capitalization only. Document this clearly; buyers with significant non-US data should expect format drift.
**USPS abbreviations as the default** (St, Ave, Blvd) rather than spelled-out forms. Reasoning: USPS recommends abbreviations; most CRMs expect them; they save space in tabular display. The `--expand-abbrev` flag inverts this for buyers whose downstream system requires full forms.
**Multi-line collapse** (FA19): `123 Main St\nApt 4B\nNew York, NY 10001` becomes `123 Main St, Apt 4B, New York, NY 10001`. Consistent comma-separated single-line format. **Reverse direction not supported** — the cleaner doesn't take a single-line address and split into multi-line (that's structural).
**State expansion vs abbreviation** (FA15, FA16): Default is 2-letter code (`NY`). The `--expand-abbrev` flag expands to full state name. Note: this is the OPPOSITE direction from street type abbreviations. State codes are universally expected in tabular data; full state names are only preferred in some downstream systems' "pretty" formats.
**ZIP leading zeros** (FA18): If the column is already a ZIP-shaped string with leading zeros, preserve them. **Cannot restore lost leading zeros** — Excel-stripped `2101` (Massachusetts) cannot be confidently recovered to `02101` because `2101` could legitimately be `2101` (Idaho). Mention this as a known limitation; recommend the buyer fix at the source.
**Canada handling** (FA27): Canadian addresses use the same street-type conventions as US, so `St``St` works. Postal code format is preserved as-is.
**Japan / non-Western** (FA28): Field order is reversed (postal code first, then large-to-small geography). Default cleaner doesn't try to restructure; minimal handling only.
### 6.4 What's not tested
- Address verification against USPS database. Out of scope; would require a paid service or local USPS data.
- Geocoding to lat/long. Out of scope.
- Unit number parsing for buildings with non-standard nomenclatures.
- Military addresses (APO, FPO, DPO) beyond accepting them.
- Rural Route, Highway Contract, General Delivery formats.
---
## 7. NAMES (`28_format_names.csv`)
### 7.1 Use cases by buyer persona
- **Shopify**: Customer list display normalization. ALL-CAPS imports from older systems become readable.
- **Bookkeeper**: Vendor name consistency across QuickBooks and spreadsheets.
- **Freelancer**: Client list capitalization cleanup.
- **Marketing agency**: First-name personalization in email campaigns (`Hi alice` vs `Hi Alice`).
### 7.2 Test categories
| Category | Cases | What it tests |
|---|---|---|
| case | FN01-FN04 | All-caps, all-lowercase, already-correct, random-case |
| scots | FN05-FN08 | Mc and Mac prefixes |
| irish | FN09-FN11 | O' prefix |
| hyphen | FN12-FN13 | Hyphenated names |
| particle | FN14-FN17 | von, van, de, da (Germanic, Dutch, French, Italian) |
| title | FN18-FN20 | Mr, Dr, Prof |
| suffix | FN21-FN23 | Jr, III, PhD |
| comma | FN24-FN26 | "Last, First" reversal to "First Last" |
| initial | FN27-FN28 | Middle initial, multi-initial |
| nonlatin | FN29-FN31 | Korean, Japanese, Russian (preserve) |
| edge | FN32-FN34 | Single name, empty, whitespace-only |
### 7.3 Critical policy decisions
**Conservative by default**: Title-case ONLY when input is ALL CAPS or all lowercase. Mixed-case input is preserved as-is (FN04: `aLiCe SmItH``aLiCe SmItH`). Reasoning: people have idiosyncratic spellings (`danah boyd`, `bell hooks`) that the cleaner should never overwrite. If the buyer wants aggressive title-casing, that's `--name-aggressive`.
**Mc vs Mac** (FN05-FN08): Default convention is `McDonald` (cap after Mc) and `MacDonald` (cap after Mac). Some Mac-prefixed names should be `Macdonald` (cap only on Mac). Without a names dictionary, the cleaner can't distinguish. Default to capitalizing — produces `MacDonald` for ambiguous cases. Buyers with significant Scottish/Irish customer bases may need a custom override list.
**Particles** (FN14-FN17): Particles like `von`, `van`, `de`, `da` stay lowercase. This is the convention for people with surnames containing these words (`Vincent van Gogh`, `Charles de Gaulle`). **Note**: at the start of a sentence or in last-name-first contexts (`De Gaulle, Charles`), capitalization rules invert. This corpus tests the natural-order case only.
**Comma format reversal** (FN24-FN26): `Smith, John``John Smith`. **Tradeoff**: irreversibly destroys the comma-format. If the buyer's downstream system expects "Last, First" format, they need `--name-format=last-first`. Default is natural reading order.
**Titles and suffixes**:
- Title period stripping: `Mr.``Mr`. Some style guides keep the period; this corpus drops it for consistency. `--keep-title-periods` flag if buyers prefer.
- Roman numerals (`II`, `III`, `IV`) stay all-caps. They aren't names; they're numerals.
- `PhD`, `MD`, `Esq` keep their conventional case. Don't lower-case them.
**Non-Latin scripts** (FN29-FN31): Pass through unchanged. Title-casing rules don't apply to scripts without case (Korean, Japanese, Chinese, Arabic, Hebrew, etc.). Cyrillic does have case but the conservative-by-default rule applies — only ALL CAPS gets title-cased.
**Single names** (FN32): Madonna, Cher, Pelé. Pass through unchanged when input is already title-case.
### 7.4 What's not tested
- Honorific stacking (`Dr. Mr. Jane Smith` — pathological, rare, hard).
- Cultural name-order detection (East Asian family-first vs Western given-first). Without a column-level signal the cleaner can't guess.
- Nickname expansion (`Bob``Robert`). Out of scope; that's data enrichment, not standardization.
- Name part identification (which token is given, family, middle). Belongs to a parser, not a standardizer.
---
## 8. CURRENCIES (`29_format_currencies.csv`)
### 8.1 Use cases by buyer persona
- **Shopify**: Order amount normalization across multi-currency stores.
- **Bookkeeper**: Bank export reconciliation; mixed bank formats produce different currency representations.
- **Freelancer**: Invoice data normalization.
- **Marketing agency**: Campaign spend normalization across ad platforms.
### 8.2 Test categories
| Category | Cases | What it tests |
|---|---|---|
| us | FC01-FC07 | $ prefix/suffix, comma thousands, dot decimal, USD code prefix/suffix |
| eu | FC08-FC11 | € prefix, dot thousands and comma decimal, space thousands, Swiss apostrophe |
| intl | FC12-FC14 | £, ¥ (no decimal), ₹ (lakhs grouping) |
| negative | FC15-FC17 | Leading minus, accounting parens, sign after symbol |
| edge | FC18-FC25 | Zero, scientific, percentage, range, word values, empty, idempotency |
| ambig | FC26-FC27 | Locale-ambiguous separator (`1,234` could be 1234 or 1.234) |
### 8.3 Critical policy decisions
**Output format**: `<symbol_or_code><normalized_number>`. Number uses dot decimal, no thousand separators, leading minus for negative. Currency symbol or code preserved if present in input; if no currency indicator, output is just the number.
**Locale ambiguity** (FC26, FC27): `1,234` is `1234` in US English and `1.234` in German. `1.234` is `1.234` in US English and `1234` in German. Per-column inspection: any value with both `,` and `.` (like `1,234.56`) locks the locale unambiguously; otherwise the cleaner errors and demands `--currency-locale=us|eu`. **Do not silently guess.**
**Accounting parens** (FC16): `($100.00)``-$100.00`. Standard accounting convention. The leading minus is more universally readable than the parens.
**Currency symbol position**: Preserved. `$100` stays prefix-symbol; `100$` (rare but seen) stays suffix-symbol; `100 USD` keeps the suffix-code form. Reasoning: changing position is destructive and the buyer can do it themselves with a simple find-replace if they want.
**Indian lakhs grouping** (FC14): `₹1,23,456.78` flattens to `₹123456.78`. Lakhs grouping (groups of 2 after the first 3) is unusual outside India and breaks downstream tools that expect Western thousand-grouping.
**JPY no decimal** (FC13): Japanese yen conventionally has no fractional part. `¥1,234``¥1234`. The cleaner doesn't add a decimal that wasn't there.
**Scientific notation** (FC19): `1.5e6``1500000`. Expand to plain notation for spreadsheet compatibility. Loses the "this was scientific" information; acceptable tradeoff.
**Percentages** (FC20): Error. Percentage and currency are different domains. If the column is meant for percentages, that's not currency.
**Ranges** (FC21): Error. Same reasoning as multi-emails; structural split needed.
**Word values** (FC22, FC23): `Free`, `TBD`, `N/A`. Error. The buyer might want these mapped to `0` (Free) or empty (TBD/N/A), but those are domain decisions the cleaner can't make safely.
### 8.4 What's not tested
- Cross-currency conversion (USD to EUR via exchange rate). Massively out of scope.
- Cryptocurrency formats (BTC, ETH amounts with high decimal precision). Out of scope.
- Historical currency notation (pre-decimalization £.s.d). Out of scope.
- Currency code standardization (USD vs US$ vs $US). Default: pass through whatever's there.
---
## 9. INTEGRATION (`30_format_integration.csv`)
### 9.1 Purpose
Five rows, each a complete record with one or more format issues across multiple columns. Tests that running 03 across multiple columns in one pass produces consistent output and doesn't drop or scramble fields.
### 9.2 Per-row test goals
| Row | What it tests |
|---|---|
| FI01 | Standard messy-but-cleanable record. All six format types in one row. Tests that no domain's normalizer interferes with another's. |
| FI02 | International record (UK address, EUR currency, German-format date, mailto-prefixed Gmail address, comma-format Mc-name). Tests cross-domain locale handling. |
| FI03 | Errors (insufficient phone digits) and complex name (DR + JANE DOE + PHD title+name+suffix). Tests error handling and complex name parsing. |
| FI04 | All empty. Tests that empty cells pass through without errors. |
| FI05 | Already-clean record. Idempotency check — the entire row should round-trip unchanged. |
### 9.3 What this fixture catches that single-domain fixtures don't
- **Cross-column interference**: a name normalizer that reaches into the email column, or vice versa.
- **Schema drift**: a normalizer that adds, removes, or reorders columns.
- **Error-handling consistency**: when one column errors (FI03's phone), other columns in the same row still process correctly.
- **Idempotency at the row level**: FI05 must produce byte-identical output.
---
## 10. Suggested test workflow
```python
import csv
from pathlib import Path
from src.core.format_standardizer import standardize # your impl
FORMATS = Path("test_data/formats")
EXPECTED = Path("expected/formats")
def test_single_column_domain(domain):
"""Test FD/FP/FE/FA/FN/FC fixtures with single-column expected output."""
inp = FORMATS / f"{domain}.csv"
exp = EXPECTED / f"{domain}_expected.csv"
with inp.open() as f:
cases = {r["case_id"]: r for r in csv.DictReader(f)}
with exp.open() as f:
expected = {r["case_id"]: r for r in csv.DictReader(f)}
failures = []
for case_id, case in cases.items():
got = standardize(case["input"], domain=domain.split("_")[1])
want = expected[case_id]["output"]
if got != want:
failures.append((case_id, case["input"], got, want))
return failures
# Test each domain
for domain in ["24_format_dates", "25_format_phones", "28_format_names",
"29_format_currencies"]:
failures = test_single_column_domain(domain)
print(f"{domain}: {len(failures)} failures")
# Email and address have two-policy expected output
def test_two_policy(domain, policy_columns):
inp = FORMATS / f"{domain}.csv"
exp = EXPECTED / f"{domain}_expected.csv"
with inp.open() as f:
cases = {r["case_id"]: r for r in csv.DictReader(f)}
with exp.open() as f:
expected = {r["case_id"]: r for r in csv.DictReader(f)}
for policy in policy_columns:
failures = []
for case_id, case in cases.items():
got = standardize(case["input"], domain=domain.split("_")[1],
mode=policy)
want = expected[case_id][f"output_{policy}"]
if got != want:
failures.append((case_id, case["input"], got, want))
print(f"{domain} ({policy}): {len(failures)} failures")
test_two_policy("26_format_emails", ["default", "gmail_canonical"])
test_two_policy("27_format_addresses", ["default", "expand_abbrev"])
# Idempotency property test
import random
all_inputs = []
for domain in ["24_format_dates", "25_format_phones", "26_format_emails",
"27_format_addresses", "28_format_names", "29_format_currencies"]:
with (FORMATS / f"{domain}.csv").open() as f:
all_inputs.extend((domain, r["input"]) for r in csv.DictReader(f))
for domain, inp in all_inputs:
once = standardize(inp, domain=domain.split("_")[1])
twice = standardize(once, domain=domain.split("_")[1])
assert once == twice, f"non-idempotent: {domain} {inp!r} -> {once!r} -> {twice!r}"
```
---
## 11. What this corpus does NOT cover
Listed so the gaps are explicit:
1. **Performance**. All fixtures are small. Format standardization on a 500MB customer file may have memory or speed issues; benchmark separately.
2. **Cross-script integration with 02 and 04**. This corpus tests 03 in isolation. Running 02 → 03 → 04 in pipeline is a separate integration concern.
3. **GUI behavior**. Single-cell preview, per-row preview, domain auto-detection from column headers. Each is a Streamlit-layer test, not a transformation test.
4. **Custom locale dictionaries**. The fixtures assume the cleaner ships with English month names and US-default phone country. Customers who buy this product and then complain that German months aren't recognized are flagging a feature request, not a bug.
5. **URLs**. Listed in BUSINESS.md's adjacent territory but not in 03's scope. If you want URL standardization, that's a feature request.
6. **Booleans / yes-no normalization**. `Y` / `Yes` / `1` / `True``true`. Borderline 03 territory but explicitly excluded; can be added as a 7th domain if buyers ask for it.
7. **Postal codes outside US/UK/Canada**. ZIP-style validation only for US.
8. **Identifiers (SKU, SSN, EIN)**. Out of scope; too domain-specific.
---
## 12. How to extend the corpus
**Add a new test case in an existing domain**:
1. Edit the relevant fixture's row list in `generate_format_test_files.py`.
2. Add the corresponding expected output entry.
3. Re-run the generator.
4. If the new case is a category not yet listed, update the per-domain category table in this document.
**Add a new domain (e.g., URLs)**:
1. Define use cases by persona.
2. Define policy decisions and which require a flag vs. being default.
3. Build the input fixture as `31_format_<domain>.csv` and the expected output as `31_format_<domain>_expected.csv`.
4. Add a Section 13 to this document covering the domain.
5. Update the index table in Section 2.
**Add a new policy variant to an existing domain**:
1. Add a new column to the expected output file (e.g., `output_strict`).
2. Document the new policy and what triggers it (which flag) in the domain's Section 5.3 (or equivalent).
3. The two-policy test in Section 10's workflow generalizes to N-policy.