Files

Michael c349a90e18 test: add text-cleaner corpus and close gaps surfaced by it

The 21-fixture corpus (test-cases/text-cleaner-corpus/) exercises the cleaner
end-to-end against the spec in TEST-CASES.md. Closing the failing cases drove
five small cleaner fixes plus two fixture-generation fixes:

- _SMART_CHARS: add prime, double prime, guillemets (case 03)
- _ZERO_WIDTH: add soft hyphen U+00AD (case 05)
- clean_dataframe: clean column headers via the same pipeline (cases 16/19/20),
  with a clean_headers toggle on CleanOptions
- smart_title_case: title-case full-shout strings ("ALICE SMITH" -> "Alice
  Smith") while still preserving embedded acronyms; preserve uppercase after
  apostrophe in names ("O'CONNOR" -> "O'Connor", "o'neil" -> "O'neil")
- test_corpus.py reader: pre-strip NUL bytes (C parser truncates at NUL,
  python engine is too strict about embedded literal "), per spec case 06
- generate_test_data.py: properly CSV-escape literal-quote cells in case 03
  expected; quote the rogue-comma price field in case 17 input

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-29 15:37:35 +00:00

28 KiB

Raw Permalink Blame History

TEST-CASES.md - `02_text_cleaner.py` Test Corpus

Version: 1.0 Last updated: April 29, 2026 Companion to: TECHNICAL.md Section 9 (script boundaries) and the per-script functional spec template introduced in TECHNICAL.md Section 10.1.

Purpose of this document

Defines the complete set of behaviors 02_text_cleaner.py is expected to exhibit, with one test fixture per behavior. Used as:

The build target when porting the (currently skeleton) script to working state.
The pytest input set once the script ships.
The acceptance criteria for the GUI port (every fixture must produce its expected output through both CLI and Streamlit GUI).

Each test case has an input file in test_data/ and (where exact-diff comparison applies) an expected-output file in expected/.

1. Scope boundary (what 02 owns vs what it doesn't)

This is the load-bearing decision. Every contested case routes back to it.

02 owns: character-level hygiene only.

Whitespace normalization (outer trim + internal collapse for text columns).
Unicode normalization (NFC by default, NFKC opt-in).
Smart-punctuation ASCII-fication (curly quotes, em/en dash, ellipsis, primes).
Invisible / zero-width character stripping.
Control character stripping (with explicit allowlist for tab/newline inside quoted cells).
BOM detection on input, never written on output.
Line-ending normalization at the file level AND inside multi-line cells.
Optional case operations (per-column, opt-in only).

02 does NOT own:

Concern	Owned by
Detecting and replacing nulls / sentinel codes	`04_missing_value_handler`
Reformatting dates, currencies, phones, names, addresses	`03_format_standardizer`
Outlier detection or domain-rule violations	`06_outlier_detector`
Renaming or reordering columns	`05_column_mapper_enforcer`
Deduplication (even though dedup normalizes internally)	`01_deduplicator`
File encoding detection on read	The shared I/O layer in `src/core/io.py`

Invariant 02 must preserve: after running 02, the schema (column count, column order, row count) is unchanged. 02 changes cell content, never structure. The one nuance: a cell containing only whitespace becomes an empty string, but the cell still exists and the row is not dropped.

2. Default configuration assumed by these tests

Tests assume the default config below. Any test that exercises a non-default flag explicitly says so in its description.

Setting	Default	Notes
`--trim`	on	Strip leading/trailing whitespace including Unicode whitespace (NBSP, NNBSP, ideographic space, etc.)
`--collapse-internal`	on (text columns only)	Collapse runs of internal whitespace to a single ASCII space, ONLY in cells that don't parse as numeric, date, or phone-shaped
`--unicode-form`	NFC	NFKC available as opt-in; folds ligatures and fullwidth
`--smart-quotes`	on	Curly to straight, em/en dash to hyphen, ellipsis to `...`, primes to `'`/`"`
`--strip-zero-width`	on	ZWSP, ZWJ, ZWNJ, LRM, RLM, soft hyphen, word joiner
`--strip-controls`	on	Strip C0 (except `\t\n\r` inside quoted cells) and DEL
`--strip-bom`	on	BOM removed on read; never written on output
`--line-endings`	LF	File-level AND embedded-cell line endings normalized to LF
`--case`	none	Case operations are opt-in per column
`--fix-mojibake`	off	Logged as warning by default; opt-in repair via ftfy
`--columns`	all	All text columns processed; `--columns name,email` restricts

Idempotency requirement: for any input X, clean(clean(X)) == clean(X). This is a property test, not a fixture-comparison test. Every fixture below should be run through the cleaner twice and produce identical output both times.

3. Test case index

#	File	Category	What it tests	Diff-testable
01	`01_whitespace_basic.csv`	Whitespace	ASCII space + tab, leading/trailing/internal	Yes
02	`02_whitespace_unicode.csv`	Whitespace	NBSP, narrow NBSP, ideographic, em/thin space	Yes
03	`03_smart_punctuation.csv`	Punctuation	Curly quotes, em/en dash, ellipsis, primes	Yes
04	`04_unicode_forms.csv`	Unicode	NFC vs NFD, ligatures, fullwidth, presentation forms	Yes
05	`05_zero_width_invisible.csv`	Invisible	ZWSP, ZWJ, ZWNJ, LRM, RLM, soft hyphen	Yes
06	`06_control_characters.csv`	Control	NUL, BEL, BS, VT, FF, ESC, DEL	Yes
07	`07_bom_utf8.csv`	Encoding	UTF-8 BOM at file start	Yes (byte-exact)
08	`08_line_endings_crlf.csv`	Line endings	All CRLF (Windows)	Yes (byte-exact)
09	`09_line_endings_cr.csv`	Line endings	All CR (classic Mac)	Yes (byte-exact)
10	`10_line_endings_mixed.csv`	Line endings	CRLF + LF + CR mixed in one file	Yes (byte-exact)
11	`11_embedded_newlines.csv`	Line endings	Newlines inside quoted cells (preserve, normalize)	Yes
12	`12_case_variations.csv`	Case	Mixed case across name/email/product columns	3 outputs (default + 2 modes)
13	`13_non_latin_scripts.csv`	Preservation	Chinese, Japanese, Arabic, Russian, emoji	Yes
14	`14_mojibake.csv`	Encoding	Double-encoded UTF-8 (warn-by-default; fix opt-in)	2 outputs (default + fixed)
15	`15_whitespace_only_cells.csv`	Boundary (vs 04)	Cells containing only whitespace become empty	Yes
16	`16_dirty_headers.csv`	Headers	Headers themselves have whitespace, BOM, smart quotes	Yes
17	`17_preserve_intended.csv`	Negative	Things 02 must NOT touch	Yes
18	`18_empty_file.csv`	Edge	Zero-byte file	Yes
19	`19_headers_only.csv`	Edge	Headers but no data rows	Yes
20	`20_kitchen_sink.csv`	Integration	Everything combined in one file	Yes
21	`21_excel_pollution.xlsx`	Excel-specific	Multi-sheet, Alt+Enter cells, force-text, copy-paste pollution	No (manual)

4. Per-test details

01 - Whitespace basic

File: test_data/01_whitespace_basic.csv -> expected/01_whitespace_basic.csv

Tests the core whitespace contract on ASCII space and tab characters. Every kind of placement: leading-only, trailing-only, both, internal-multiple, tab-padded, multiple internal multi-space runs in one cell, all of the above combined.

Expected behavior:

Leading and trailing whitespace stripped from every cell.
Internal runs of whitespace collapsed to a single ASCII space.
Tabs treated as whitespace by both rules.

Why it matters: This is the highest-frequency real-world pollution. Trailing-space pollution alone is what the v1.5 audit identified as the gap that motivated creating script 02 in the first place (DECISIONS.md v1.5 entry).

02 - Whitespace, Unicode

File: test_data/02_whitespace_unicode.csv -> expected/02_whitespace_unicode.csv

The whitespace pretenders. Python's str.strip() with no argument actually does strip these in 3.x, but a lot of cleaners written by people who were burned in 2.x explicitly pass ' \t\n' and miss them. Excel and Word produce these constantly when you copy from a styled document.

Characters covered: NBSP (U+00A0), narrow NBSP (U+202F), ideographic space (U+3000), em space (U+2003), thin space (U+2009).

Expected behavior: treated identically to ASCII space - trimmed at edges, collapsed internally.

Why it matters: "It looks fine but the join doesn't match" debugging sessions almost always end here. NBSP-padded keys are the silent killer.

03 - Smart punctuation

File: test_data/03_smart_punctuation.csv -> expected/03_smart_punctuation.csv

Curly quotes, dashes, ellipsis, primes - the autocorrect-as-you-type damage from Word/Excel. ASCII-fy where round-trip-safe.

Input	Output	Notes
`\u201c` `\u201d` (curly double)	`"`
`\u2018` `\u2019` (curly single)	`'`	Includes apostrophe
`\u2014` (em-dash)	`-`
`\u2013` (en-dash)	`-`
`\u2026` (ellipsis)	`...`
`\u2032` (prime)	`'`
`\u2033` (double prime)	`"`
`\u00ab` `\u00bb` (guillemets)	`"`
`\u00d7` (multiplication sign)	preserved	Not safely round-trip-able to ASCII; `x` would be wrong
`\u00b1` (plus-minus)	preserved	Same reasoning

Why it matters: smart-quote pollution breaks regex, breaks downstream parsers, and breaks string equality joins. The two preservation cases (multiplication, plus-minus) are deliberate - they have no faithful ASCII equivalent and forcing one is destructive.

04 - Unicode normalization forms

File: test_data/04_unicode_forms.csv -> expected/04_unicode_forms.csv

café can be encoded two ways:

NFC: caf\u00e9 (one code point, e-acute as a unit)
NFD: cafe\u0301 (two code points, plain e + combining accent)

These render identically. They compare unequal. They have different lengths. macOS filesystem defaults to NFD, which means a CSV exported from a Mac and joined against a CSV from Excel can silently fail.

Default normalization: NFC (most compact, what Excel emits, what most Western databases expect).

Cases covered:

Pre-composed (NFC) e-acute and i-diaeresis.
Decomposed (NFD) versions of the same.
The \uFB03 ffi ligature - preserved under NFC (NFKC would fold it to ffi).
Fullwidth Latin letters (\uFF21\uFF22\uFF23 = ＡＢＣ) - preserved under NFC.
Roman numeral nine character (\u2168) - preserved under NFC.

After cleaning, rows 1 and 2 must produce identical bytes (NFC and NFD both normalized to NFC). Same for rows 3 and 4.

Why it matters: Mac-vs-Windows data joins. Catches "they look the same but won't match" bugs.

Opt-in --unicode-form=NFKC test: not provided as a fixture but should exist as a unit test. Under NFKC, ligature folds to ffi, fullwidth folds to ASCII ABC, roman numeral folds to IX. NFKC is destructive for some legitimate text (mathematical notation, some CJK content) so it stays opt-in.

05 - Zero-width and invisible characters

File: test_data/05_zero_width_invisible.csv -> expected/05_zero_width_invisible.csv

These bytes show up from rich-text copy/paste, from RTL text, from accidentally-included U+FEFF in the middle of a cell (yes, this happens), and from some web-form pastes.

Characters covered: U+200B (ZWSP), U+200C (ZWNJ), U+200D (ZWJ), U+200E (LRM), U+200F (RLM), U+00AD (soft hyphen), U+2060 (word joiner).

Expected behavior: all stripped unconditionally. None of these has a legitimate role in tabular data cells, even when there's a domain reason for them in prose (typesetting Arabic, hyphenation hints in long-form text). For a CSV, they're noise.

Why it matters: these are the truly invisible polluters. You can stare at the cell forever and not see them. They break joins, they bloat string lengths, they hash differently. The first time a buyer hits a zero-width-space in a customer name, this test is what saves them.

06 - Control characters

File: test_data/06_control_characters.csv -> expected/06_control_characters.csv

The C0 controls (U+0000 through U+001F) plus DEL (U+007F). Test cases: NUL, BEL, BS, VT, FF, ESC, DEL, and a multi-control combination.

Expected behavior: all stripped from cell content.

The exception: tab (U+0009), LF (U+000A), and CR (U+000D) are NOT stripped from inside quoted cells. Tab might be intentional formatting; LF/CR are handled by line-ending normalization (case 11). Outside of quoted cells, tab is whitespace and gets normalized like space.

Why it matters: real-world exports from broken systems, half-corrupted database dumps, copy-paste from terminals (including ANSI escape sequences starting with ESC), and binary data accidentally exported as text all leave these in cells. A NUL byte mid-string breaks C-string-based parsers; a BEL makes terminals beep when you cat the file; ESC sequences corrupt logs.

07 - UTF-8 BOM

File: test_data/07_bom_utf8.csv -> expected/07_bom_utf8.csv (byte-exact comparison)

File starts with the three-byte sequence EF BB BF. Excel writes UTF-8 with BOM by default. Pandas read_csv usually handles this but leaves the BOM as part of the first column header name unless you pass encoding='utf-8-sig'. Result: a mystery column called \ufeffid that breaks every df["id"] lookup downstream.

Expected behavior:

BOM stripped on read.
First column header is the clean string id, not \ufeffid.
Output file is written WITHOUT a BOM.

Diff target: byte-for-byte equality with expected/07_bom_utf8.csv. The expected file must NOT have the BOM.

Why it matters: Excel-origin data is the dominant input for the target buyer. Getting BOM handling wrong silently breaks the rest of the pipeline.

08, 09, 10 - Line endings: CRLF, CR-only, mixed

Files: 08_line_endings_crlf.csv, 09_line_endings_cr.csv, 10_line_endings_mixed.csv

08: every line ends with CRLF (\r\n). Standard Windows.
09: every line ends with CR (\r) only. Classic Mac. Rare but seen.
10: same file contains all three: CRLF, LF, CR, CRLF, LF.

Expected behavior on output: all lines end with LF (\n). Byte-exact match to the expected files.

Why LF as the default output: it's what Linux uses, what every modern code editor handles, what Git stores by default, and what Streamlit / pandas write by default. CRLF is an option for buyers who specifically need Windows-style output, but the default minimizes round-trip surprises.

Why it matters: mixed line endings cause "ghost rows" in some parsers, blank lines in some editors, and silent data loss in any tool that splits on one specific newline pattern. Case 10 is the disaster scenario - multi-source concat - and is the most important of the three.

11 - Embedded newlines inside quoted cells

File: test_data/11_embedded_newlines.csv -> expected/11_embedded_newlines.csv

The trap. File-level line-ending normalization must NOT collapse intentional newlines inside multi-line cells (addresses, notes columns). But the embedded line endings should still be normalized to LF for consistency.

Expected behavior:

File-level line endings: LF.
Embedded CRLF inside a quoted cell: normalized to LF.
Embedded CR inside a quoted cell: normalized to LF.
Cell stays multi-line; the newline character count inside the cell is preserved.

Why it matters: an address column with 123 Main St\r\nApt 4B\r\nNew York is the canonical legitimate multi-line cell. A naive text.replace('\r\n', '\n') works correctly. A naive text.split('\n') to "remove blank lines" destroys the address. The cleaner must understand CSV quoting.

12 - Case operations (opt-in)

Files: input 12_case_variations.csv; three expected outputs:

expected/12_case_variations__default.csv (no flag - identity)
expected/12_case_variations__email_lower.csv (--case email=lower)
expected/12_case_variations__name_title.csv (--case name=title)

Default behavior is preserve case. Case operations are opt-in per column because:

Lowercasing emails is almost always right (emails are case-insensitive per RFC 5321 local-part-aside).
Title-casing names is almost always right (ALICE SMITH -> Alice Smith), but must handle apostrophes correctly (O'Connor -> O'Connor, not O'connor).
Lowercasing product codes is almost always WRONG (SKU-A1B2 is a code, not prose).

So the tool offers per-column case ops, never a global one. The expected outputs cover the two most common configurations.

Tricky case to verify: row 4 name DAN O'CONNOR. Under --case=title this must become Dan O'Connor, not Dan O'connor. Python's str.title() gets this wrong. Implementations should use string.capwords() or a regex that respects apostrophes inside words.

Why it matters: dedup quality (case 01 in the deduplicator) depends on consistent case in the comparison columns. Buyers running 02 before 01 expect this to "just work" for the email column.

13 - Non-Latin scripts and emoji (preservation negative test)

File: test_data/13_non_latin_scripts.csv -> expected/13_non_latin_scripts.csv

Negative test: cleaning must not damage characters outside the Latin/punctuation block. Trim and NFC still apply (row 1 has leading and trailing space, which gets trimmed).

Coverage: Chinese (Beijing), Japanese (katakana test), Arabic RTL, Cyrillic Russian, multi-codepoint emoji (party popper U+1F389, rocket U+1F680), accent + emoji combo (café ☕).

Expected behavior: only whitespace and NFC normalization apply. All script-significant characters preserved exactly.

Why it matters: the cleaner must be safe on international buyer data. Stripping "weird-looking" characters because they're outside ASCII is a textbook bug. Emoji in particular are in the supplementary planes (above U+FFFF) and naive byte-level filters often mangle them.

14 - Mojibake

Files: input 14_mojibake.csv; two expected outputs:

expected/14_mojibake__default.csv (no flag - bytes preserved, warning logged)
expected/14_mojibake__fixed.csv (--fix-mojibake - heuristic repair)

Mojibake is the result of UTF-8 bytes being interpreted as cp1252 or Latin-1 and re-saved as UTF-8. Classic patterns:

café becomes cafÃ©
München becomes München
naïve becomes naïve
The smart-apostrophe in don't becomes don't

Default behavior: warn, do NOT auto-fix. Reasoning: mojibake repair is heuristic, and the heuristic can false-positive on legitimate strings that happen to contain Ã followed by another Latin-1 character. The right call for a tool sold to non-experts is to flag the suspicious pattern in the log and let the user opt in.

With --fix-mojibake (uses ftfy or equivalent): repair attempted. The expected output shows fully repaired text including the smart-apostrophe-via-cp1252 case, which ftfy specifically handles.

Why it matters: mojibake is silent corruption. The customer doesn't know it happened until a name shows up wrong on a printed invoice. Flagging it is the responsible default.

15 - Whitespace-only cells (the 02-vs-04 boundary)

File: test_data/15_whitespace_only_cells.csv -> expected/15_whitespace_only_cells.csv

Per TECHNICAL.md Section 9.3: 02 trims whitespace first, leaving an empty string. Script 04 then detects empty strings as disguised null. So 02's job in this file is to convert " ", "\t\t", "\u00A0\u00A0", and mixed-whitespace cells all into "".

What 02 does NOT do here:

Does not decide whether the cell is "missing." That's 04's call.
Does not write NaN or N/A or any other sentinel. Just produces empty string.
Does not drop the row. Schema is invariant.

Expected behavior: every whitespace-only cell becomes empty. Row count unchanged. Headers untouched.

Why it matters: this is the single most-relitigated boundary in the bundle. Documenting it via fixture prevents drift.

16 - Dirty headers

File: test_data/16_dirty_headers.csv -> expected/16_dirty_headers.csv

Headers themselves are subject to all the same pollution as data cells. A header " Email " (NBSP-padded) breaks df["Email"] lookups because the actual column name has NBSP padding. Smart-quoted header "\u201cEmail\u201d" is even worse.

Expected behavior: headers cleaned by the same rules as data. Note that the smart-quoted header "Email" (with surrounding quotes) becomes a header value containing literal ASCII double quotes, which then requires CSV-quoting in the output. The expected file is written with proper CSV escaping.

Why it matters: broken column names break every downstream join, every selectbox in the GUI, and every CLI flag that takes a column name. Cleaning headers is non-negotiable.

17 - Preserve-intended (negative tests)

File: test_data/17_preserve_intended.csv -> expected/17_preserve_intended.csv

The negative-test file. Things 02 must NOT touch because they belong to other scripts:

Cell content	What 02 does	What 02 does NOT do
`100`	Trims to `100`	Doesn't reformat as `$100.00` (that's 03)
`1 234`	Preserves as `1 234`	Doesn't collapse internal space (looks numeric, European thousand-sep)
`$1,500.00`	Trims outer whitespace	Doesn't reformat currency (that's 03)
`2024-01-15`	Trims outer whitespace	Doesn't reformat date (that's 03)
`(555) 123-4567`	Trims outer whitespace	Doesn't reformat phone (that's 03); does not collapse internal space
`+1 555 123 4567`	Trims outer whitespace	Same; phone-shaped, leave internal spacing alone
`N/A`	Trims to `N/A`	Doesn't replace with empty or NaN (that's 04)
`nan`	Trims to `nan`	Doesn't replace with empty or NaN (that's 04)

The internal-whitespace heuristic: if a cell parses as numeric, looks like a date, or matches a phone-shape regex (digits + common separators), do NOT collapse internal whitespace. Only collapse in cells classified as free text. This requires a per-cell check; document it in the implementation.

Why it matters: scope discipline. If 02 starts reformatting dates because "while we're trimming whitespace anyway", it stops being 02 and starts being a worse 03. The DECISIONS.md Section 4a rule (functional scope) cuts the other way too: 02 must not reach into other scripts' territory.

18 - Empty file

File: test_data/18_empty_file.csv (zero bytes) -> expected/18_empty_file.csv (zero bytes)

Expected behavior: graceful no-op. Either produces an empty output file with a logged warning, or emits a clean error message naming the problem ("Input file is empty"). What it MUST NOT do: crash with pandas.errors.EmptyDataError traceback in the GUI.

Why it matters: error UX standard from DECISIONS.md Section 4b - errors that name the problem and the fix, not stack traces.

19 - Headers only (no data rows)

File: test_data/19_headers_only.csv -> expected/19_headers_only.csv

Just headers, no data. Headers themselves are dirty (whitespace + NBSP + ZWSP).

Expected behavior: headers cleaned, output is clean headers + no data rows. No crash, no warning required (it's a legitimate state).

Why it matters: template files often look like this. The buyer might be cleaning a template before populating it. Don't punish them for it.

20 - Kitchen sink (integration)

File: test_data/20_kitchen_sink.csv -> expected/20_kitchen_sink.csv

The integration test. Combines:

UTF-8 BOM at file start.
CRLF line endings throughout.
Headers with leading/trailing space, NBSP, smart quotes, ZWSP.
Data cells with NBSP, internal multi-space, smart quotes, em-dash, ellipsis, primes (foot/inch markers).
A whitespace-only cell that should become empty.
Multiplication sign (preserved).

Expected output: every transformation applied correctly, schema unchanged, file written as UTF-8 (no BOM) with LF line endings.

Why it matters: this is the one fixture that catches transformation-order bugs. If smart-quote replacement runs before whitespace trim, you get different output than the other order. Picking and locking the order is part of the implementation; the fixture verifies it.

Recommended transformation pipeline order (informative, not normative):

Decode bytes -> strip BOM at file level.
Normalize file-level line endings -> LF.
Parse CSV (with proper quoting for embedded newlines).
Per cell, in order: a. Unicode NFC normalize. b. Strip zero-width and control characters. c. Strip BOM if it appears mid-cell. d. Smart-quote ASCII-fy. e. Normalize embedded line endings to LF. f. Whitespace trim (outer). g. Internal whitespace collapse (text columns only - check after trim). h. Per-column case op (if configured).
Headers go through the same per-cell pipeline.
Write as UTF-8, LF line endings, no BOM.

21 - Excel pollution (multi-sheet XLSX)

File: test_data/21_excel_pollution.xlsx (no expected file - manual / programmatic verification per sheet)

Four sheets, each isolating an Excel-specific concern:

Sheet Customers - dirty headers (NBSP, smart quotes, ZWSP) and dirty data cells (NBSP padding, tab padding, smart apostrophe in O'Connor, em-dash). One whitespace-only name cell to verify the 02/04 boundary applies on XLSX too.

Sheet Notes - multi-line cells from Alt+Enter (LF inside cell), plus a cell with mixed CRLF inside (from someone pasting Windows text into Excel). Cells have wrap_text formatting set so the line breaks render in Excel. After cleaning, all in-cell line breaks should be LF.

Sheet International - non-Latin scripts and emoji with surrounding whitespace. Verifies the preservation contract from case 13 holds for XLSX.

Sheet ForceText - leading-zero IDs (e.g., 0001234). These must not be stripped of leading zeros (that's not 02's job - it doesn't change semantic content). Row 3 has a leaked apostrophe ('9999999) from a force-text cell - this is a judgment call but the default is to preserve it; trying to detect "leaked apostrophe" is too error-prone.

Why it matters: XLSX has pollution patterns that don't appear in CSV (Alt+Enter cells, force-text apostrophes, sheet structure). The XLSX reader path needs the same cleaning logic as the CSV reader path; this fixture verifies that.

5. What this corpus does NOT cover

Listed so the gap is explicit, not hidden:

Encoding detection (cp1252 input, Latin-1 input, UTF-16). That's the I/O layer's job, not 02's transformation logic. Once the reader produces a Python str, 02 operates the same regardless of source encoding. Add I/O-layer fixtures separately when that layer is built.
Performance / large files. No multi-GB fixture is included because it bloats the repo. Add a benchmark (not a unit test) targeting a 500MB CSV; verify processing completes without OOM via chunked reads.
Streamlit UI behavior. The fixtures verify cleaning correctness; verifying the GUI shows the right preview, applies the right defaults, and renders cleaning in the diff view is a separate test layer (probably manual, possibly Playwright).
Concurrency / file-locking (e.g., user has the input file open in Excel). Expected to fail with a clean error, not corrupt data. Add a manual test, not a fixture.
CLI argument parsing for the various flags. Each flag should have a Typer-level test, separate from the fixtures here.

6. How to use this corpus

As a build target

Each fixture is one piece of the spec. Implement the cleaner against fixture 01, run, diff, fix, repeat. Move to 02. By the time fixture 20 passes, the script is done.

As pytest fixtures

import pytest
from pathlib import Path
from src.core.text_cleaner import clean_csv

CORPUS = Path("tests/corpus")  # wherever this folder lands

@pytest.mark.parametrize("name", [
    "01_whitespace_basic",
    "02_whitespace_unicode",
    "03_smart_punctuation",
    "04_unicode_forms",
    "05_zero_width_invisible",
    "06_control_characters",
    "07_bom_utf8",
    "08_line_endings_crlf",
    "09_line_endings_cr",
    "10_line_endings_mixed",
    "11_embedded_newlines",
    "13_non_latin_scripts",
    "15_whitespace_only_cells",
    "16_dirty_headers",
    "17_preserve_intended",
    "18_empty_file",
    "19_headers_only",
    "20_kitchen_sink",
])
def test_default_config(name, tmp_path):
    inp = CORPUS / "test_data" / f"{name}.csv"
    expected = (CORPUS / "expected" / f"{name}.csv").read_bytes()
    out = tmp_path / "out.csv"
    clean_csv(inp, out)  # default config
    assert out.read_bytes() == expected

# Cases 12 and 14 have multiple expected files; parametrize them separately
# with the relevant flags.

# Idempotency property test - applies to every fixture:
@pytest.mark.parametrize("name", [...same list...])
def test_idempotent(name, tmp_path):
    inp = CORPUS / "test_data" / f"{name}.csv"
    out1 = tmp_path / "out1.csv"
    out2 = tmp_path / "out2.csv"
    clean_csv(inp, out1)
    clean_csv(out1, out2)
    assert out1.read_bytes() == out2.read_bytes()

Regenerating fixtures

If a default policy changes (e.g., switch the default Unicode form from NFC to NFKC, which would be a meaningful policy decision), the fixtures in expected/ need regenerating. Edit generate_test_data.py and re-run. Document the policy change in DECISIONS.md before doing this.

28 KiB Raw Permalink Blame History Unescape Escape

TEST-CASES.md - 02_text_cleaner.py Test Corpus

Purpose of this document

1. Scope boundary (what 02 owns vs what it doesn't)

2. Default configuration assumed by these tests

3. Test case index

4. Per-test details

01 - Whitespace basic

02 - Whitespace, Unicode

03 - Smart punctuation

04 - Unicode normalization forms

05 - Zero-width and invisible characters

06 - Control characters

07 - UTF-8 BOM

08, 09, 10 - Line endings: CRLF, CR-only, mixed

11 - Embedded newlines inside quoted cells

12 - Case operations (opt-in)

13 - Non-Latin scripts and emoji (preservation negative test)

14 - Mojibake

15 - Whitespace-only cells (the 02-vs-04 boundary)

16 - Dirty headers

17 - Preserve-intended (negative tests)

18 - Empty file

19 - Headers only (no data rows)

20 - Kitchen sink (integration)

21 - Excel pollution (multi-sheet XLSX)

5. What this corpus does NOT cover

6. How to use this corpus

As a build target

As pytest fixtures

Regenerating fixtures

28 KiB

Raw Permalink Blame History

TEST-CASES.md - `02_text_cleaner.py` Test Corpus