feat(format): per-cell standardizers + 199-row buyer corpus

Adds src/core/format_standardize.py — a per-cell standardizer for dates, phones, emails, addresses, names, currencies, booleans — wired through StandardizeOptions / standardize_dataframe with FieldType registry. Includes: - Date parser handles ISO/US/EU/longform/excel-serial/unix-timestamp/ partial-precision/quarter notation; opt-in French/German/Spanish month dictionaries via month_locales. - Phone via libphonenumber with extension preservation (;ext=N), 001 international prefix handling, error sentinels for placeholders / multi-number cells. - Email lowercase/trim/mailto/angle-bracket strip with optional --gmail-canonical mode. - Address USPS abbreviation expansion or compression (expand=False per corpus § 6.3), state-name → 2-letter conversion, multi-line collapse, PO Box normalization, state-code preservation regardless of input case. - Name handler: Mc/Mac/O'/D' inner caps, hyphen segments, particle lowercasing (von/van/de/da), comma-format reversal, period stripping for titles/suffixes/initials, PhD/MD acronym preservation, conservative mode for mixed-case input. - Currency: auto-detect EU vs US separators, space-thousands, Swiss apostrophe, accounting parens, optional ISO code preservation, error sentinels for percentages/ranges/word-values/ambiguous separators. - Per-domain error_policy ("passthrough" | "sentinel") for surfacing malformed values as <error: reason> per corpus § 0.3. Test corpus from Business/DataTools/test-cases-format-cleaner copied to test-cases/format-cleaner-corpus/ — 7 fixtures plus FORMATS-CASES.md. tests/test_format_standardize_corpus.py drives all 199 rows through the per-cell standardizers; 0 xfailed. Wires the GUI page (3_Format_Standardizer.py) to "Ready" status. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 02:11:24 +00:00
parent 3f007ef3d6
commit 4adeb5c7f3
14 changed files with 4364 additions and 69 deletions
--- a/test-cases/format-cleaner-corpus/24_format_dates.csv
+++ b/test-cases/format-cleaner-corpus/24_format_dates.csv
@@ -0,0 +1,46 @@
+case_id,category,description,input
+FD01,iso,ISO date plain,2024-01-15
+FD02,iso,ISO datetime no zone,2024-01-15T10:30:00
+FD03,iso,ISO datetime UTC,2024-01-15T10:30:00Z
+FD04,iso,ISO datetime offset,2024-01-15T10:30:00+05:00
+FD05,iso,ISO datetime with millis,2024-01-15T10:30:00.123Z
+FD06,iso,ISO datetime space separator,2024-01-15 10:30:00
+FD07,us,US slash 4-digit year,01/15/2024
+FD08,us,US slash 2-digit year,1/15/24
+FD09,us,US slash no leading zero,1/5/2024
+FD10,us,US slash unambiguous (day > 12),5/30/2024
+FD11,eu,EU dot 4-digit year,15.01.2024
+FD12,eu,EU dot 2-digit year,15.01.24
+FD13,eu,EU slash 4-digit year,15/01/2024
+FD14,eu,EU slash unambiguous (day > 12),30/05/2024
+FD15,eu,EU dash format,15-01-2024
+FD16,longform,Month name long,"January 15, 2024"
+FD17,longform,Month name short,"Jan 15, 2024"
+FD18,longform,Day-month-year long,15 January 2024
+FD19,longform,Day-month-year short,15 Jan 2024
+FD20,longform,With weekday,"Monday, January 15, 2024"
+FD21,longform,All caps month,JAN 15 2024
+FD22,excel,Excel serial date,45306
+FD23,excel,Excel serial with fractional time,45306.4375
+FD24,unix,Unix timestamp seconds,1705320000
+FD25,unix,Unix timestamp milliseconds,1705320000000
+FD26,partial,Year-month only ISO,2024-01
+FD27,partial,Year-month text,January 2024
+FD28,partial,Quarter notation,Q1 2024
+FD29,partial,Year only,2024
+FD30,edge,Two-digit year ambiguity (1969 vs 2069),1/15/69
+FD31,edge,Leap day valid,2024-02-29
+FD32,edge,Leap day invalid (not a leap year),2023-02-29
+FD33,edge,Excel 1900 leap year bug,1900-02-29
+FD34,edge,Invalid month,2024-13-15
+FD35,edge,Invalid day,2024-04-31
+FD36,edge,Date with extraneous text,Date: 2024-01-15
+FD37,edge,Date in parens annotation,2024-01-15 (verified)
+FD38,edge,Empty,
+FD39,edge,Whitespace-only,   
+FD40,edge,Garbage,not a date
+FD41,locale,French month name,15 janvier 2024
+FD42,locale,German month name,15. Januar 2024
+FD43,timezone,Datetime with named tz,2024-01-15 10:30:00 EST
+FD44,timezone,Datetime with offset and DST ambiguity,2024-03-10 02:30:00-05:00
+FD45,padding,Already-clean: pass through,2024-01-15
--- a/test-cases/format-cleaner-corpus/25_format_phones.csv
+++ b/test-cases/format-cleaner-corpus/25_format_phones.csv
@@ -0,0 +1,32 @@
+case_id,category,description,input
+FP01,us,Plain digits 10,5551234567
+FP02,us,Standard formatting,(555) 123-4567
+FP03,us,Dashes,555-123-4567
+FP04,us,Dots,555.123.4567
+FP05,us,Spaces,555 123 4567
+FP06,us,With country code +1,+1 555 123 4567
+FP07,us,With country code 1- prefix,1-555-123-4567
+FP08,us,With 001 prefix,001 555 123 4567
+FP09,ext,Extension ext keyword,555-123-4567 ext 123
+FP10,ext,Extension x abbreviation,555-123-4567 x123
+FP11,ext,Extension hash,555-123-4567 #123
+FP12,vanity,Vanity number 1-800-FLOWERS,1-800-FLOWERS
+FP13,vanity,Mixed letters and digits,555-CALL-NOW
+FP14,intl,UK with +44,+44 20 7946 0958
+FP15,intl,UK domestic,020 7946 0958
+FP16,intl,Germany with +49,+49 30 12345678
+FP17,intl,France with +33,+33 1 23 45 67 89
+FP18,intl,Japan with +81,+81-3-1234-5678
+FP19,intl,Australia with +61,+61 2 1234 5678
+FP20,e164,Already E.164 format,+15551234567
+FP21,edge,Too few digits (local-only),555-1234
+FP22,edge,Too many digits,1-555-123-4567-extra-99
+FP23,edge,All-zeros placeholder,000-000-0000
+FP24,edge,All-nines placeholder,999-999-9999
+FP25,edge,Multiple numbers in cell,555-123-4567 / 555-987-6543
+FP26,edge,Mismatched parens,555-(123)-4567
+FP27,edge,NBSP in number,555 123 4567
+FP28,edge,Very spaced,5 5 5 1 2 3 4 5 6 7
+FP29,edge,Empty,
+FP30,edge,Non-phone string,TBD
+FP31,edge,Smart-apostrophe contamination,555’s 123-4567
--- a/test-cases/format-cleaner-corpus/26_format_emails.csv
+++ b/test-cases/format-cleaner-corpus/26_format_emails.csv
@@ -0,0 +1,32 @@
+case_id,category,description,input
+FE01,basic,Plain ASCII,alice@example.com
+FE02,basic,Mixed case,Alice@Example.COM
+FE03,basic,All caps,ALICE@EXAMPLE.COM
+FE04,basic,Whitespace padding,  alice@example.com  
+FE05,displayname,Display name no quotes,Alice Smith <alice@example.com>
+FE06,displayname,Display name with quotes,"""Alice Smith"" <alice@example.com>"
+FE07,displayname,Wrapped in angle brackets only,<alice@example.com>
+FE08,prefix,mailto: prefix,mailto:alice@example.com
+FE09,prefix,MAILTO: caps,MAILTO:Alice@Example.com
+FE10,gmail,Gmail with dots,a.l.i.c.e@gmail.com
+FE11,gmail,Gmail with +tag,alice+newsletter@gmail.com
+FE12,gmail,Gmail with both,a.l.i.c.e+work@gmail.com
+FE13,gmail,Non-Gmail with dots (don't touch),a.l.i.c.e@example.com
+FE14,gmail,Non-Gmail with +tag (don't touch),alice+newsletter@example.com
+FE15,idn,Unicode in domain,alice@münchen.de
+FE16,idn,Unicode in local,アリス@example.jp
+FE17,trailing,Trailing comma,"alice@example.com,"
+FE18,trailing,Trailing period,alice@example.com.
+FE19,trailing,Trailing closing paren,alice@example.com)
+FE20,trailing,Trailing semicolon,alice@example.com;
+FE21,smartquote,Wrapped in curly quotes,“alice@example.com”
+FE22,invalid,Missing @,aliceexample.com
+FE23,invalid,Double @,alice@@example.com
+FE24,invalid,Multiple @,alice@example@com
+FE25,invalid,Spaces inside,alice @ example.com
+FE26,invalid,TLD-less local network,alice@localhost
+FE27,multiple,Two comma-separated,"alice@example.com, bob@example.com"
+FE28,multiple,Two semicolon-separated,alice@example.com; bob@example.com
+FE29,edge,Empty,
+FE30,edge,Whitespace-only,   
+FE31,edge,Already perfect,alice@example.com
--- a/test-cases/format-cleaner-corpus/27_format_addresses.csv
+++ b/test-cases/format-cleaner-corpus/27_format_addresses.csv
@@ -0,0 +1,34 @@
+case_id,category,description,input
+FA01,clean,Already USPS-formatted,"123 Main St, New York, NY 10001"
+FA02,case,All caps,"123 MAIN STREET, NEW YORK, NY 10001"
+FA03,case,All lowercase,"123 main street, new york, ny 10001"
+FA04,case,Mixed case (preserve),"123 Main Street, New York, NY 10001"
+FA05,abbrev,Street spelled out,"123 Main Street, New York, NY 10001"
+FA06,abbrev,Avenue spelled out,"456 Park Avenue, New York, NY 10001"
+FA07,abbrev,Boulevard spelled out,"789 Sunset Boulevard, Los Angeles, CA 90028"
+FA08,abbrev,St with period,"123 Main St., New York, NY 10001"
+FA09,directional,North spelled out,"123 North Main St, City, ST 12345"
+FA10,directional,NORTH all caps,"123 NORTH Main St, City, ST 12345"
+FA11,directional,NE compound,"123 NE Main St, City, ST 12345"
+FA12,unit,Apartment spelled out,"123 Main St, Apartment 4B, City, ST 12345"
+FA13,unit,Hash sign,"123 Main St, # 4B, City, ST 12345"
+FA14,unit,Suite spelled out,"123 Main St, Suite 200, City, ST 12345"
+FA15,state,State spelled out,"123 Main St, New York, New York 10001"
+FA16,state,State all caps spelled out,"123 Main St, New York, NEW YORK 10001"
+FA17,zip,ZIP+4,"123 Main St, New York, NY 10001-1234"
+FA18,zip,Leading-zero ZIP (MA),"123 Main St, Boston, MA 02101"
+FA19,multiline,Multi-line address,"123 Main St
+Apt 4B
+New York, NY 10001"
+FA20,pobox,PO Box with periods,"P.O. Box 123, City, ST 12345"
+FA21,pobox,PO Box without periods,"PO Box 123, City, ST 12345"
+FA22,pobox,Post Office Box spelled out,"Post Office Box 123, City, ST 12345"
+FA23,housenum,Letter suffix,"123A Main St, City, ST 12345"
+FA24,housenum,Hyphen number,"123-1 Main St, City, ST 12345"
+FA25,housenum,Half number,"123 1/2 Main St, City, ST 12345"
+FA26,non_us,UK postcode address,"10 Downing Street, London, SW1A 2AA"
+FA27,non_us,Canada postal code,"1 Yonge St, Toronto, ON M5E 1W7"
+FA28,non_us,Japan reverse-order,"100-0001, Tokyo, Chiyoda, Marunouchi 1-1"
+FA29,edge,Empty,
+FA30,edge,Just a city,New York
+FA31,edge,Trailing comma,"123 Main St, New York, NY 10001,"
--- a/test-cases/format-cleaner-corpus/28_format_names.csv
+++ b/test-cases/format-cleaner-corpus/28_format_names.csv
@@ -0,0 +1,35 @@
+case_id,category,description,input
+FN01,case,All caps,ALICE SMITH
+FN02,case,All lowercase,alice smith
+FN03,case,Already title case (preserve),Alice Smith
+FN04,case,Random case (preserve),aLiCe SmItH
+FN05,scots,McDonald lowercase,mcdonald
+FN06,scots,MCDONALD all caps,MCDONALD
+FN07,scots,MacDonald,macdonald
+FN08,scots,McTaggart already correct,McTaggart
+FN09,irish,O'Connor lowercase,o'connor
+FN10,irish,O'CONNOR all caps,O'CONNOR
+FN11,irish,O'Brien preserve,O'Brien
+FN12,hyphen,Mary-Jane lowercase,mary-jane smith
+FN13,hyphen,Smith-Jones,smith-jones
+FN14,particle,von Trapp,von trapp
+FN15,particle,Vincent van Gogh,vincent van gogh
+FN16,particle,Charles de Gaulle,charles de gaulle
+FN17,particle,Leonardo da Vinci,leonardo da vinci
+FN18,title,Mr period,Mr. John Smith
+FN19,title,DR caps,DR JANE DOE
+FN20,title,Prof preserve,Prof Alice Williams
+FN21,suffix,Jr period,John Smith Jr.
+FN22,suffix,III roman numeral,John Smith III
+FN23,suffix,PhD,Jane Doe PhD
+FN24,comma,"Last, First","Smith, John"
+FN25,comma,"LAST, FIRST","SMITH, JOHN"
+FN26,comma,"Last, First Middle","Smith, John Andrew"
+FN27,initial,Middle initial,John A. Smith
+FN28,initial,Multi-initial author,j.k. rowling
+FN29,nonlatin,Korean,김철수
+FN30,nonlatin,Japanese,田中太郎
+FN31,nonlatin,Russian,Иван Иванов
+FN32,edge,Single name,Madonna
+FN33,edge,Empty,
+FN34,edge,Whitespace-only,   
--- a/test-cases/format-cleaner-corpus/29_format_currencies.csv
+++ b/test-cases/format-cleaner-corpus/29_format_currencies.csv
@@ -0,0 +1,28 @@
+case_id,category,description,input
+FC01,us,Standard US dollar,"$1,234.56"
+FC02,us,US no comma,$1234.56
+FC03,us,US space after symbol,"$ 1,234.56"
+FC04,us,US no symbol,"1,234.56"
+FC05,us,US with code suffix,"1,234.56 USD"
+FC06,us,US with code prefix,"USD 1,234.56"
+FC07,us,US trailing symbol,1234.56$
+FC08,eu,Euro standard,"€1.234,56"
+FC09,eu,Euro space thousand,"€1 234,56"
+FC10,eu,Euro code suffix,"1.234,56 EUR"
+FC11,eu,Swiss apostrophe thousand,1'234.56
+FC12,intl,GBP,"£1,234.56"
+FC13,intl,JPY no decimal,"¥1,234"
+FC14,intl,Indian rupees lakhs,"₹1,23,456.78"
+FC15,negative,Leading minus,-$100.00
+FC16,negative,Accounting parens,($100.00)
+FC17,negative,Sign after symbol,$-100.00
+FC18,edge,Zero,$0.00
+FC19,edge,Scientific notation,1.5e6
+FC20,edge,Percentage,15.5%
+FC21,edge,Range (not normalizable),$50-$100
+FC22,edge,Word value,Free
+FC23,edge,TBD placeholder,TBD
+FC24,edge,Empty,
+FC25,edge,Already clean,1234.56
+FC26,ambig,"1,234 - could be US 1234 or EU 1.234","1,234"
+FC27,ambig,1.234 - could be US 1.234 or EU 1234,1.234
--- a/test-cases/format-cleaner-corpus/30_format_integration.csv
+++ b/test-cases/format-cleaner-corpus/30_format_integration.csv
@@ -0,0 +1,6 @@
+case_id,name,email,phone,date,amount,address
+FI01,ALICE SMITH,Alice@Example.COM,(555) 123-4567,1/15/24,"$1,234.56","123 main street, new york, ny 10001"
+FI02,"mcdonald, john",mailto:John@gmail.com,+44 20 7946 0958,15.01.2024,"€1.234,56","10 DOWNING STREET, LONDON, SW1A 2AA"
+FI03,DR JANE DOE PHD,"""Jane Doe"" <jane@example.com>",555-1234,"Jan 15, 2024",($100.00),"456 Park Avenue, Apt 12, New York, NEW YORK 10001"
+FI04,,,,,,
+FI05,Already Clean,alice@example.com,+15551234567,2024-01-15,1234.56,"123 Main St, New York, NY 10001"
--- a/test-cases/format-cleaner-corpus/FORMATS-CASES.md
+++ b/test-cases/format-cleaner-corpus/FORMATS-CASES.md
@@ -0,0 +1,513 @@
+# FORMATS-CASES.md - `03_format_standardizer.py` Test Corpus
+
+**Version**: 1.0
+**Last updated**: April 30, 2026
+**Companion to**: TEST-CASES.md (cleaning rules), QUOTE-CASES.md (parser robustness), ENCODINGS-CASES.md (I/O layer).
+
+This corpus tests `03_format_standardizer.py`, which owns "what's there but in the wrong format." Six domains: dates, phones, emails, addresses, names, currencies. Plus a cross-domain integration fixture.
+
+---
+
+## 0. Scope clarifications you should read first
+
+Three issues to surface before the per-domain sections, because they affect what tests are valid in the first place.
+
+### 0.1 Email scope conflict with TECHNICAL.md
+
+USER-GUIDE.md Section 2 lists 03's purpose as "dates, currencies, names, phone numbers, addresses." TECHNICAL.md Section 10.1 item 8 puts email normalization inside `01_deduplicator`'s Tier 1 spec. **Email appears in neither place as part of 03.**
+
+This corpus tests email normalization as if it lives in 03. The reasoning: 03 is "format standardizer" and email is a format like any other. Putting it in 01 means there's no public API for the buyer to normalize emails outside of running dedup, which is a weird ergonomic for the GUI ("To clean my emails I have to run the deduplicator?"). Better factoring: 03 owns email normalization as a public operation; 01 calls into the same `core/` function for matching.
+
+If you disagree, fixture `26_format_emails.csv` and its expected output drop out cleanly without affecting the other five domains. If you agree, update USER-GUIDE.md Section 2 and TECHNICAL.md Section 7's per-bundle technical notes.
+
+### 0.2 Schema preservation rule (TECHNICAL.md Section 9 invariant)
+
+03 changes cell content, never schema. Row count, column count, column order all unchanged. This rules out a few tempting designs:
+
+- Currency normalization that splits `$1,234.56` into separate amount and currency columns — **rejected**. Output stays in one cell.
+- Address normalization that splits a single-line address into structured street/city/state/zip columns — **rejected**. Output stays in one cell.
+- Phone normalization that splits phone + extension into two columns — **rejected**. Extension goes inline as `;ext=123` (RFC 3966 syntax).
+
+If you want structured output, that's a different script (a parser, not a standardizer).
+
+### 0.3 Boundary with neighboring scripts
+
+| If the cell is... | Owner | 03's behavior |
+|---|---|---|
+| Empty string | 04 (missing values) | Pass through unchanged. Don't decide if it means "missing." |
+| Whitespace-only | 02 (text cleaner) | Should already be empty by the time 03 sees it. If not (CLI user skipped 02), trim defensively. |
+| Statistically extreme but format-valid (date in year 1700, phone with 10 zeros) | 06 (outliers) | Format-normalize anyway. Don't flag unusual values. |
+| Format-invalid (Feb 30, missing @, letters in numeric) | 03 | Emit error sentinel `<error: <reason>>`. |
+| Already correctly formatted | 03 | Pass through. Idempotency required. |
+
+---
+
+## 1. Default configuration
+
+Tests assume the defaults below. Per-flag deviations are called out per case.
+
+| Setting | Default | Notes |
+|---|---|---|
+| `--date-format` | ISO 8601 | `YYYY-MM-DD` for dates, `YYYY-MM-DDTHH:MM:SS[+ZZ:ZZ]` for datetimes |
+| `--locale` | auto-detect | Per-column. Falls back to error if column has no disambiguating value |
+| `--two-digit-year-cutoff` | 69 | Python default: years 00-68 → 2000-2068, 69-99 → 1969-1999 |
+| `--phone-format` | E.164 | `+<country><digits>`, extensions via `;ext=` |
+| `--default-country` | US | Used for phones with no country code |
+| `--gmail-canonical` | off | Strip Gmail dots and +tags. Destructive, opt-in |
+| `--expand-abbrev` | off | Expand St → Street etc. USPS abbreviation is the default |
+| `--name-conservative` | on | Title-case only ALL CAPS or all-lowercase input |
+| `--currency-locale` | auto-detect | Per-column. Same fallback as date locale |
+| `--error-policy` | sentinel | Errors written as `<error: reason>`. Alternative: raise, skip-row |
+| `--columns` | all | All text columns processed; `--columns date,phone` restricts |
+
+**Idempotency requirement**: `format(format(x)) == format(x)` for every cell. Already-clean input passes through unchanged.
+
+---
+
+## 2. Test corpus index
+
+| File | Domain | Cases | Expected outputs |
+|---|---|---|---|
+| `24_format_dates.csv` | Dates | 45 | Single column |
+| `25_format_phones.csv` | Phones | 31 | Single column |
+| `26_format_emails.csv` | Emails | 31 | Two columns (default + gmail-canonical) |
+| `27_format_addresses.csv` | Addresses | 31 | Two columns (default + expand-abbrev) |
+| `28_format_names.csv` | Names | 34 | Single column |
+| `29_format_currencies.csv` | Currencies | 27 | Single column |
+| `30_format_integration.csv` | Cross-domain | 5 | Multi-column (full row) |
+
+All input fixtures share the schema `case_id, category, description, input` (except integration, which has the full multi-column shape). Expected output files key by `case_id` for diff-by-join testing.
+
+---
+
+## 3. DATES (`24_format_dates.csv`)
+
+### 3.1 Use cases by buyer persona
+
+- **Shopify**: Order export dates joined against manual entries that used a different format. Bookkeeping reports needing consistent date format for sorting.
+- **Bookkeeper**: Bank export reconciliation across multiple banks, each using its own date convention. Tax reports requiring consistent year-month grouping.
+- **Freelancer**: Client data dumps where the date column is in whatever format the client's locale or software produces.
+- **Marketing agency**: Campaign performance data joined across platforms (Google Ads, Facebook Ads, Mailchimp) that all use different date formats.
+
+### 3.2 Test categories
+
+| Category | Cases | What it tests |
+|---|---|---|
+| iso | FD01-FD06 | ISO 8601 baseline. Already-clean and minor variants (Z vs offset, T vs space) |
+| us | FD07-FD10 | M/D/Y format with 2-digit and 4-digit years. Includes one unambiguous case (day > 12) |
+| eu | FD11-FD15 | D/M/Y format with various separators. Includes one unambiguous case |
+| longform | FD16-FD21 | Month-name formats (full, abbreviated, with weekday, all caps) |
+| excel | FD22-FD23 | Excel serial numbers (45306 = 2024-01-15). Critical: Excel CSV exports often have date columns leak through as numbers |
+| unix | FD24-FD25 | Unix timestamps in seconds and milliseconds |
+| partial | FD26-FD29 | Year-month, quarter, year-only. Coarser-than-day precision |
+| edge | FD30-FD40 | Two-digit year ambiguity, leap day validity, Excel 1900 leap year bug, invalid dates, dates buried in other text |
+| locale | FD41-FD42 | French and German month names |
+| timezone | FD43-FD44 | Named time zones, DST transitions |
+| padding | FD45 | Already-clean idempotency check |
+
+### 3.3 Critical policy decisions
+
+**Locale ambiguity (M/D/Y vs D/M/Y)**: Per-column inspection. The cleaner scans all values in the column; if any value has day > 12, locale is unambiguously D/M/Y; if any has month > 12 (impossible in M/D/Y), locale is unambiguously D/M/Y. If nothing disambiguates, error out and require `--locale us|eu`. **Do not silently guess.** Fixture row FD13 (`15/01/2024`) is ambiguous in isolation; FD14 (`30/05/2024`) makes the column unambiguously D/M/Y; in a real column containing both, FD13 resolves to `2024-01-15`.
+
+**Two-digit year cutoff**: Python's default of 69 (years 00-68 → 2000s, 69-99 → 1969-1999). FD30 is `1/15/69` and resolves to `1969-01-15`. This is opinionated and frequently wrong for birth-year columns. Document the flag clearly; the buyer cleaning customer DOB data needs to override.
+
+**Excel serial dates** (FD22, FD23): Detection heuristic — column header contains "date", or all values are integers/floats in range 25569–73050 (Jan 1 1970 to Jan 1 2099 in Excel serial). Outside that heuristic the cleaner can't distinguish a date serial from any other number.
+
+**Excel 1900 leap year bug** (FD33): Excel claims 1900-02-29 exists; it doesn't. Detect and emit error. Don't silently accept and roll over to March 1.
+
+**Localized month names** (FD41, FD42): Default cleaner ships with English month names. French/German/Spanish/etc. require a locale dictionary. Either ship one (adds size) or document the limitation. **Recommendation**: ship English + opt-in `--month-locale=fr|de|es` for the others. This corpus tests as if French and German are supported.
+
+**Time zones** (FD43, FD44): Named zones (EST, PST) resolve to fixed offsets, NOT dynamically interpreted with DST rules. EST → -05:00 always. If buyers need DST-aware handling, that's a 04-bundle (out of scope) or an opt-in pyzoneinfo flag.
+
+### 3.4 Edge case: dates buried in text (FD36, FD37)
+
+`Date: 2024-01-15` and `2024-01-15 (verified)` extract to `2024-01-15`. The cleaner uses regex extraction for date-shaped substrings before parsing. **Risk**: false positives from random number sequences. Mitigation: require an unambiguous date pattern (4-digit year + valid month + valid day with explicit separator).
+
+### 3.5 What's not tested
+
+- Calendar systems other than Gregorian (Hijri, Hebrew, Japanese era). Out of scope.
+- Recurring date strings (`every 1st of month`). Not a date.
+- Date ranges (`2024-01-01 to 2024-01-15`). Out of scope; would require a different cell semantic.
+- Sub-millisecond precision. Pandas/datetime tolerate but aren't tested here.
+
+---
+
+## 4. PHONES (`25_format_phones.csv`)
+
+### 4.1 Use cases by buyer persona
+
+- **Shopify**: Customer phone list normalization before Klaviyo/Mailchimp import. SMS campaigns require E.164.
+- **Bookkeeper**: Vendor phone deduplication where same vendor has multiple format variants in QuickBooks vs. spreadsheets.
+- **Freelancer**: Lead lists from clients in arbitrary formats.
+- **Marketing agency**: Multi-platform audience reconciliation; ad platforms increasingly require E.164 for matching.
+
+### 4.2 Test categories
+
+| Category | Cases | What it tests |
+|---|---|---|
+| us | FP01-FP08 | Common US format variants — plain digits, parens-dash, dots, spaces, country code prefixes |
+| ext | FP09-FP11 | Extensions in three syntactic forms (`ext`, `x`, `#`) |
+| vanity | FP12-FP13 | Letter-to-digit conversion (1-800-FLOWERS) |
+| intl | FP14-FP19 | UK, Germany, France, Japan, Australia |
+| e164 | FP20 | Already-E.164 idempotency |
+| edge | FP21-FP31 | Insufficient/excess digits, placeholders, multiple numbers per cell, NBSP, smart-quote contamination |
+
+### 4.3 Critical policy decisions
+
+**Default output: E.164** (`+<country><digits>`). Universal storage format. Reverses cleanly to any presentation format if the buyer wants display formatting later.
+
+**Default country**: US, configurable via `--default-country=GB|DE|...`. For mixed-country columns, cleaner needs explicit country detection per-row, which is hard without context. Real-world advice for the buyer: split phone columns by country before normalizing.
+
+**Vanity numbers** (FP12, FP13): Letters convert via standard phone keypad: 2=ABC, 3=DEF, ..., 9=WXYZ. `FLOWERS` → `3569377`. Loses some information (you can't reverse 3569377 to FLOWERS). Acceptable tradeoff for storage normalization.
+
+**Trunk prefix dropping**: UK domestic format `020 7946 0958` (FP15) has a leading `0` that's a domestic trunk prefix, not part of the actual number. E.164 strips it: `+442079460958`. Same logic for other countries with trunk prefixes.
+
+**Placeholders** (FP23, FP24): All-zeros `000-000-0000` and all-nines `999-999-9999` are conventional "no phone" sentinels in some CRMs. Emit error rather than silently producing a syntactically valid E.164 that's semantically meaningless. **Tradeoff**: a real number that happens to be `999-999-9999` (which doesn't exist in NANP, by the way; 999 is reserved) would error too. Acceptable.
+
+**Multiple numbers** (FP25): Cell containing `555-123-4567 / 555-987-6543`. Don't silently pick one; emit error and tell the user to split first. Splitting is a structural change, not a format change, so it belongs upstream of 03.
+
+**NBSP and smart-quote contamination** (FP27, FP31): Should not reach 03 if 02 ran first. Defensive cleanup is fine; emit a debug log noting the upstream pollution.
+
+### 4.4 What's not tested
+
+- SMS-vs-voice number distinction.
+- Carrier lookup. Out of scope; would require a paid service.
+- Number portability validation.
+- Toll-free number recognition (888, 877, 866, 855, 844, 833) beyond accepting them as valid digits.
+
+---
+
+## 5. EMAILS (`26_format_emails.csv`) — see Section 0.1 for scope caveat
+
+### 5.1 Use cases by buyer persona
+
+- **Shopify**: Customer list cleanup before email-marketing platform import (every duplicate costs money on per-contact pricing). Pre-flight check on order export before re-engagement campaigns.
+- **Bookkeeper**: Vendor email list consolidation.
+- **Freelancer**: Client communication list normalization.
+- **Marketing agency**: List hygiene across multiple lead sources before campaign send.
+
+### 5.2 Test categories
+
+| Category | Cases | What it tests |
+|---|---|---|
+| basic | FE01-FE04 | Plain ASCII, mixed case, whitespace |
+| displayname | FE05-FE07 | RFC display-name forms `Name <email>`, with and without quotes |
+| prefix | FE08-FE09 | mailto: prefix |
+| gmail | FE10-FE14 | Gmail-specific dot-equivalence and +tag handling. Includes negative cases (non-Gmail domains) that must NOT be touched |
+| idn | FE15-FE16 | Internationalized domain names; Unicode in local part |
+| trailing | FE17-FE20 | Punctuation contamination from copy-paste contexts |
+| smartquote | FE21 | Word-paste damage |
+| invalid | FE22-FE26 | Missing @, double @, multiple @, internal whitespace, no TLD |
+| multiple | FE27-FE28 | Multiple emails in one cell |
+| edge | FE29-FE31 | Empty, whitespace-only, already-perfect |
+
+### 5.3 Critical policy decisions
+
+**Default behavior**: lowercase, trim, strip `mailto:`, strip wrapping `<>`, extract from `Display Name <email>` form. **Does NOT strip Gmail dots or +tags by default.** Those normalizations are destructive (`alice` and `a.l.i.c.e` aren't the same email per RFC; only Gmail's specific provider policy treats them as equivalent).
+
+**Aggressive mode (`--gmail-canonical`)**: Strip dots and +tags for `@gmail.com` only. Preserve them for all other domains, even if those domains have similar policies (some custom Google Workspace domains, some other providers). Don't second-guess provider policy.
+
+**FE13 and FE14 are critical negative tests**: a non-Gmail domain with dots or +tag must NOT be touched even in `--gmail-canonical` mode. Many cleaners get this wrong — they apply Gmail's policy to all domains, which corrupts data.
+
+**IDN handling** (FE15, FE16): Don't punycode-convert by default. Buyers who need ASCII-only output for legacy systems can opt in via `--punycode`. Default is to preserve Unicode in domain and local parts.
+
+**Display-name extraction** (FE05, FE06): Drop the display name. The cleaner extracts the email and discards `Alice Smith`. **Tradeoff**: information loss. Alternative would be to preserve display name in a separate column, but that violates schema preservation (Section 0.2). Buyers who want to keep display names should split the column upstream.
+
+**Multiple emails per cell** (FE27, FE28): Error, don't pick one. Same rationale as multiple phones.
+
+### 5.4 What's not tested
+
+- Email syntax validation per full RFC 5321/5322 (which permits all sorts of legitimately weird inputs like quoted-string locals). The cleaner uses a "good enough for 99% of real data" regex, not a full RFC parser.
+- Disposable-email-domain detection. Out of scope for format cleaning; that's data quality.
+- DNS / MX validation. Out of scope; requires network access.
+- Email-address-as-username (where domain is a hostname not an internet domain). Errors as TLD-less.
+
+---
+
+## 6. ADDRESSES (`27_format_addresses.csv`)
+
+### 6.1 Use cases by buyer persona
+
+- **Shopify**: Customer address normalization for shipping label generation; reduces failed deliveries.
+- **Bookkeeper**: Vendor master record cleanup; consistent format for bookkeeping software import.
+- **Freelancer**: Client address book consolidation.
+- **Marketing agency**: Direct mail audience cleanup.
+
+### 6.2 Test categories
+
+| Category | Cases | What it tests |
+|---|---|---|
+| clean | FA01 | Already-USPS-formatted idempotency |
+| case | FA02-FA04 | All-caps, all-lowercase, mixed-case (preserve) |
+| abbrev | FA05-FA08 | Street type expansion/abbreviation, periods after abbreviations |
+| directional | FA09-FA11 | North/N, NORTH/N, NE compounds |
+| unit | FA12-FA14 | Apartment/Apt, # / Apt, Suite/Ste |
+| state | FA15-FA16 | State name → 2-letter code |
+| zip | FA17-FA18 | ZIP+4, leading-zero ZIPs (Massachusetts 02xxx) |
+| multiline | FA19 | `\n`-separated address fields |
+| pobox | FA20-FA22 | Post Office Box variants |
+| housenum | FA23-FA25 | Letter suffix, hyphen, half-number |
+| non_us | FA26-FA28 | UK, Canada, Japan (minimal handling) |
+| edge | FA29-FA31 | Empty, partial, trailing comma |
+
+### 6.3 Critical policy decisions
+
+**US-first scope**: USPS abbreviations and state codes are the default. International addresses get whitespace + capitalization only. Document this clearly; buyers with significant non-US data should expect format drift.
+
+**USPS abbreviations as the default** (St, Ave, Blvd) rather than spelled-out forms. Reasoning: USPS recommends abbreviations; most CRMs expect them; they save space in tabular display. The `--expand-abbrev` flag inverts this for buyers whose downstream system requires full forms.
+
+**Multi-line collapse** (FA19): `123 Main St\nApt 4B\nNew York, NY 10001` becomes `123 Main St, Apt 4B, New York, NY 10001`. Consistent comma-separated single-line format. **Reverse direction not supported** — the cleaner doesn't take a single-line address and split into multi-line (that's structural).
+
+**State expansion vs abbreviation** (FA15, FA16): Default is 2-letter code (`NY`). The `--expand-abbrev` flag expands to full state name. Note: this is the OPPOSITE direction from street type abbreviations. State codes are universally expected in tabular data; full state names are only preferred in some downstream systems' "pretty" formats.
+
+**ZIP leading zeros** (FA18): If the column is already a ZIP-shaped string with leading zeros, preserve them. **Cannot restore lost leading zeros** — Excel-stripped `2101` (Massachusetts) cannot be confidently recovered to `02101` because `2101` could legitimately be `2101` (Idaho). Mention this as a known limitation; recommend the buyer fix at the source.
+
+**Canada handling** (FA27): Canadian addresses use the same street-type conventions as US, so `St` → `St` works. Postal code format is preserved as-is.
+
+**Japan / non-Western** (FA28): Field order is reversed (postal code first, then large-to-small geography). Default cleaner doesn't try to restructure; minimal handling only.
+
+### 6.4 What's not tested
+
+- Address verification against USPS database. Out of scope; would require a paid service or local USPS data.
+- Geocoding to lat/long. Out of scope.
+- Unit number parsing for buildings with non-standard nomenclatures.
+- Military addresses (APO, FPO, DPO) beyond accepting them.
+- Rural Route, Highway Contract, General Delivery formats.
+
+---
+
+## 7. NAMES (`28_format_names.csv`)
+
+### 7.1 Use cases by buyer persona
+
+- **Shopify**: Customer list display normalization. ALL-CAPS imports from older systems become readable.
+- **Bookkeeper**: Vendor name consistency across QuickBooks and spreadsheets.
+- **Freelancer**: Client list capitalization cleanup.
+- **Marketing agency**: First-name personalization in email campaigns (`Hi alice` vs `Hi Alice`).
+
+### 7.2 Test categories
+
+| Category | Cases | What it tests |
+|---|---|---|
+| case | FN01-FN04 | All-caps, all-lowercase, already-correct, random-case |
+| scots | FN05-FN08 | Mc and Mac prefixes |
+| irish | FN09-FN11 | O' prefix |
+| hyphen | FN12-FN13 | Hyphenated names |
+| particle | FN14-FN17 | von, van, de, da (Germanic, Dutch, French, Italian) |
+| title | FN18-FN20 | Mr, Dr, Prof |
+| suffix | FN21-FN23 | Jr, III, PhD |
+| comma | FN24-FN26 | "Last, First" reversal to "First Last" |
+| initial | FN27-FN28 | Middle initial, multi-initial |
+| nonlatin | FN29-FN31 | Korean, Japanese, Russian (preserve) |
+| edge | FN32-FN34 | Single name, empty, whitespace-only |
+
+### 7.3 Critical policy decisions
+
+**Conservative by default**: Title-case ONLY when input is ALL CAPS or all lowercase. Mixed-case input is preserved as-is (FN04: `aLiCe SmItH` → `aLiCe SmItH`). Reasoning: people have idiosyncratic spellings (`danah boyd`, `bell hooks`) that the cleaner should never overwrite. If the buyer wants aggressive title-casing, that's `--name-aggressive`.
+
+**Mc vs Mac** (FN05-FN08): Default convention is `McDonald` (cap after Mc) and `MacDonald` (cap after Mac). Some Mac-prefixed names should be `Macdonald` (cap only on Mac). Without a names dictionary, the cleaner can't distinguish. Default to capitalizing — produces `MacDonald` for ambiguous cases. Buyers with significant Scottish/Irish customer bases may need a custom override list.
+
+**Particles** (FN14-FN17): Particles like `von`, `van`, `de`, `da` stay lowercase. This is the convention for people with surnames containing these words (`Vincent van Gogh`, `Charles de Gaulle`). **Note**: at the start of a sentence or in last-name-first contexts (`De Gaulle, Charles`), capitalization rules invert. This corpus tests the natural-order case only.
+
+**Comma format reversal** (FN24-FN26): `Smith, John` → `John Smith`. **Tradeoff**: irreversibly destroys the comma-format. If the buyer's downstream system expects "Last, First" format, they need `--name-format=last-first`. Default is natural reading order.
+
+**Titles and suffixes**:
+- Title period stripping: `Mr.` → `Mr`. Some style guides keep the period; this corpus drops it for consistency. `--keep-title-periods` flag if buyers prefer.
+- Roman numerals (`II`, `III`, `IV`) stay all-caps. They aren't names; they're numerals.
+- `PhD`, `MD`, `Esq` keep their conventional case. Don't lower-case them.
+
+**Non-Latin scripts** (FN29-FN31): Pass through unchanged. Title-casing rules don't apply to scripts without case (Korean, Japanese, Chinese, Arabic, Hebrew, etc.). Cyrillic does have case but the conservative-by-default rule applies — only ALL CAPS gets title-cased.
+
+**Single names** (FN32): Madonna, Cher, Pelé. Pass through unchanged when input is already title-case.
+
+### 7.4 What's not tested
+
+- Honorific stacking (`Dr. Mr. Jane Smith` — pathological, rare, hard).
+- Cultural name-order detection (East Asian family-first vs Western given-first). Without a column-level signal the cleaner can't guess.
+- Nickname expansion (`Bob` → `Robert`). Out of scope; that's data enrichment, not standardization.
+- Name part identification (which token is given, family, middle). Belongs to a parser, not a standardizer.
+
+---
+
+## 8. CURRENCIES (`29_format_currencies.csv`)
+
+### 8.1 Use cases by buyer persona
+
+- **Shopify**: Order amount normalization across multi-currency stores.
+- **Bookkeeper**: Bank export reconciliation; mixed bank formats produce different currency representations.
+- **Freelancer**: Invoice data normalization.
+- **Marketing agency**: Campaign spend normalization across ad platforms.
+
+### 8.2 Test categories
+
+| Category | Cases | What it tests |
+|---|---|---|
+| us | FC01-FC07 | $ prefix/suffix, comma thousands, dot decimal, USD code prefix/suffix |
+| eu | FC08-FC11 | € prefix, dot thousands and comma decimal, space thousands, Swiss apostrophe |
+| intl | FC12-FC14 | £, ¥ (no decimal), ₹ (lakhs grouping) |
+| negative | FC15-FC17 | Leading minus, accounting parens, sign after symbol |
+| edge | FC18-FC25 | Zero, scientific, percentage, range, word values, empty, idempotency |
+| ambig | FC26-FC27 | Locale-ambiguous separator (`1,234` could be 1234 or 1.234) |
+
+### 8.3 Critical policy decisions
+
+**Output format**: `<symbol_or_code><normalized_number>`. Number uses dot decimal, no thousand separators, leading minus for negative. Currency symbol or code preserved if present in input; if no currency indicator, output is just the number.
+
+**Locale ambiguity** (FC26, FC27): `1,234` is `1234` in US English and `1.234` in German. `1.234` is `1.234` in US English and `1234` in German. Per-column inspection: any value with both `,` and `.` (like `1,234.56`) locks the locale unambiguously; otherwise the cleaner errors and demands `--currency-locale=us|eu`. **Do not silently guess.**
+
+**Accounting parens** (FC16): `($100.00)` → `-$100.00`. Standard accounting convention. The leading minus is more universally readable than the parens.
+
+**Currency symbol position**: Preserved. `$100` stays prefix-symbol; `100$` (rare but seen) stays suffix-symbol; `100 USD` keeps the suffix-code form. Reasoning: changing position is destructive and the buyer can do it themselves with a simple find-replace if they want.
+
+**Indian lakhs grouping** (FC14): `₹1,23,456.78` flattens to `₹123456.78`. Lakhs grouping (groups of 2 after the first 3) is unusual outside India and breaks downstream tools that expect Western thousand-grouping.
+
+**JPY no decimal** (FC13): Japanese yen conventionally has no fractional part. `¥1,234` → `¥1234`. The cleaner doesn't add a decimal that wasn't there.
+
+**Scientific notation** (FC19): `1.5e6` → `1500000`. Expand to plain notation for spreadsheet compatibility. Loses the "this was scientific" information; acceptable tradeoff.
+
+**Percentages** (FC20): Error. Percentage and currency are different domains. If the column is meant for percentages, that's not currency.
+
+**Ranges** (FC21): Error. Same reasoning as multi-emails; structural split needed.
+
+**Word values** (FC22, FC23): `Free`, `TBD`, `N/A`. Error. The buyer might want these mapped to `0` (Free) or empty (TBD/N/A), but those are domain decisions the cleaner can't make safely.
+
+### 8.4 What's not tested
+
+- Cross-currency conversion (USD to EUR via exchange rate). Massively out of scope.
+- Cryptocurrency formats (BTC, ETH amounts with high decimal precision). Out of scope.
+- Historical currency notation (pre-decimalization £.s.d). Out of scope.
+- Currency code standardization (USD vs US$ vs $US). Default: pass through whatever's there.
+
+---
+
+## 9. INTEGRATION (`30_format_integration.csv`)
+
+### 9.1 Purpose
+
+Five rows, each a complete record with one or more format issues across multiple columns. Tests that running 03 across multiple columns in one pass produces consistent output and doesn't drop or scramble fields.
+
+### 9.2 Per-row test goals
+
+| Row | What it tests |
+|---|---|
+| FI01 | Standard messy-but-cleanable record. All six format types in one row. Tests that no domain's normalizer interferes with another's. |
+| FI02 | International record (UK address, EUR currency, German-format date, mailto-prefixed Gmail address, comma-format Mc-name). Tests cross-domain locale handling. |
+| FI03 | Errors (insufficient phone digits) and complex name (DR + JANE DOE + PHD title+name+suffix). Tests error handling and complex name parsing. |
+| FI04 | All empty. Tests that empty cells pass through without errors. |
+| FI05 | Already-clean record. Idempotency check — the entire row should round-trip unchanged. |
+
+### 9.3 What this fixture catches that single-domain fixtures don't
+
+- **Cross-column interference**: a name normalizer that reaches into the email column, or vice versa.
+- **Schema drift**: a normalizer that adds, removes, or reorders columns.
+- **Error-handling consistency**: when one column errors (FI03's phone), other columns in the same row still process correctly.
+- **Idempotency at the row level**: FI05 must produce byte-identical output.
+
+---
+
+## 10. Suggested test workflow
+
+```python
+import csv
+from pathlib import Path
+from src.core.format_standardizer import standardize  # your impl
+
+FORMATS = Path("test_data/formats")
+EXPECTED = Path("expected/formats")
+
+def test_single_column_domain(domain):
+    """Test FD/FP/FE/FA/FN/FC fixtures with single-column expected output."""
+    inp = FORMATS / f"{domain}.csv"
+    exp = EXPECTED / f"{domain}_expected.csv"
+
+    with inp.open() as f:
+        cases = {r["case_id"]: r for r in csv.DictReader(f)}
+    with exp.open() as f:
+        expected = {r["case_id"]: r for r in csv.DictReader(f)}
+
+    failures = []
+    for case_id, case in cases.items():
+        got = standardize(case["input"], domain=domain.split("_")[1])
+        want = expected[case_id]["output"]
+        if got != want:
+            failures.append((case_id, case["input"], got, want))
+    return failures
+
+# Test each domain
+for domain in ["24_format_dates", "25_format_phones", "28_format_names",
+               "29_format_currencies"]:
+    failures = test_single_column_domain(domain)
+    print(f"{domain}: {len(failures)} failures")
+
+# Email and address have two-policy expected output
+def test_two_policy(domain, policy_columns):
+    inp = FORMATS / f"{domain}.csv"
+    exp = EXPECTED / f"{domain}_expected.csv"
+    with inp.open() as f:
+        cases = {r["case_id"]: r for r in csv.DictReader(f)}
+    with exp.open() as f:
+        expected = {r["case_id"]: r for r in csv.DictReader(f)}
+
+    for policy in policy_columns:
+        failures = []
+        for case_id, case in cases.items():
+            got = standardize(case["input"], domain=domain.split("_")[1],
+                              mode=policy)
+            want = expected[case_id][f"output_{policy}"]
+            if got != want:
+                failures.append((case_id, case["input"], got, want))
+        print(f"{domain} ({policy}): {len(failures)} failures")
+
+test_two_policy("26_format_emails", ["default", "gmail_canonical"])
+test_two_policy("27_format_addresses", ["default", "expand_abbrev"])
+
+# Idempotency property test
+import random
+all_inputs = []
+for domain in ["24_format_dates", "25_format_phones", "26_format_emails",
+               "27_format_addresses", "28_format_names", "29_format_currencies"]:
+    with (FORMATS / f"{domain}.csv").open() as f:
+        all_inputs.extend((domain, r["input"]) for r in csv.DictReader(f))
+
+for domain, inp in all_inputs:
+    once = standardize(inp, domain=domain.split("_")[1])
+    twice = standardize(once, domain=domain.split("_")[1])
+    assert once == twice, f"non-idempotent: {domain} {inp!r} -> {once!r} -> {twice!r}"
+```
+
+---
+
+## 11. What this corpus does NOT cover
+
+Listed so the gaps are explicit:
+
+1. **Performance**. All fixtures are small. Format standardization on a 500MB customer file may have memory or speed issues; benchmark separately.
+2. **Cross-script integration with 02 and 04**. This corpus tests 03 in isolation. Running 02 → 03 → 04 in pipeline is a separate integration concern.
+3. **GUI behavior**. Single-cell preview, per-row preview, domain auto-detection from column headers. Each is a Streamlit-layer test, not a transformation test.
+4. **Custom locale dictionaries**. The fixtures assume the cleaner ships with English month names and US-default phone country. Customers who buy this product and then complain that German months aren't recognized are flagging a feature request, not a bug.
+5. **URLs**. Listed in BUSINESS.md's adjacent territory but not in 03's scope. If you want URL standardization, that's a feature request.
+6. **Booleans / yes-no normalization**. `Y` / `Yes` / `1` / `True` → `true`. Borderline 03 territory but explicitly excluded; can be added as a 7th domain if buyers ask for it.
+7. **Postal codes outside US/UK/Canada**. ZIP-style validation only for US.
+8. **Identifiers (SKU, SSN, EIN)**. Out of scope; too domain-specific.
+
+---
+
+## 12. How to extend the corpus
+
+**Add a new test case in an existing domain**:
+1. Edit the relevant fixture's row list in `generate_format_test_files.py`.
+2. Add the corresponding expected output entry.
+3. Re-run the generator.
+4. If the new case is a category not yet listed, update the per-domain category table in this document.
+
+**Add a new domain (e.g., URLs)**:
+1. Define use cases by persona.
+2. Define policy decisions and which require a flag vs. being default.
+3. Build the input fixture as `31_format_<domain>.csv` and the expected output as `31_format_<domain>_expected.csv`.
+4. Add a Section 13 to this document covering the domain.
+5. Update the index table in Section 2.
+
+**Add a new policy variant to an existing domain**:
+1. Add a new column to the expected output file (e.g., `output_strict`).
+2. Document the new policy and what triggers it (which flag) in the domain's Section 5.3 (or equivalent).
+3. The two-policy test in Section 10's workflow generalizes to N-policy.