diff --git a/DECISIONS.md b/DECISIONS.md
new file mode 100644
index 0000000..07a4fdd
--- /dev/null
+++ b/DECISIONS.md
@@ -0,0 +1,33 @@
+# Product & architecture decisions
+
+A running log of decisions that aren't obvious from the code and would
+otherwise be re-litigated. Newest first.
+
+## 2026-06-08 — PDF to CSV and Reconcile stay in the bundle, under a "Finance" group
+
+**Decision:** `10_pdf_extractor` (PDF to CSV) and `11_reconciler` (Reconcile
+Two Files) remain part of the DataTools suite. In the sidebar they are
+segregated into their own **Finance** section, distinct from the
+file-cleaning tools.
+
+**Context / why this needed deciding:**
+- Both tools sit outside the documented 9-script cleaning architecture
+ (TECHNICAL.md / USER-GUIDE.md stop at the orchestrator).
+- They occupy the "reconciliation / manual data-entry" territory the
+ product's honest-positioning note explicitly placed outside a
+ file-cleaning tool's scope.
+- A journey-level UX review flagged that every extra tool in the main
+ sidebar raises the "which tool do I need?" load for a non-technical
+ buyer, so tools serving a different job should live in a clearly
+ different place.
+
+**Resolution:** Keep them in-bundle (they're built, useful, and ship
+today) but group them under "Finance" so the cleaning flow stays
+uncluttered. Revisit only if a separate finance-focused product emerges.
+
+**Implications:**
+- `tools_registry.py`: Reconcile + PDF to CSV carry a `finance` section.
+- Sidebar order: Start here → Data Cleaners → Transformations →
+ Automations → Finance → Coming soon.
+- This is the source-of-truth realization of the `layout-review/`
+ mockups (see `layout-review/shell.js`).
diff --git a/layout-review/01_deduplicator.html b/layout-review/01_deduplicator.html
index b00c79f..032439c 100644
--- a/layout-review/01_deduplicator.html
+++ b/layout-review/01_deduplicator.html
@@ -19,29 +19,30 @@
Find Duplicates
-
+
+
+
+ Runs 100% locally
+
+
+
Find rows that repeat, then keep one and remove the extras.
-
-
-
-
- upload_file Drag and drop file here
- Up to 1.5 GB · CSV, TSV, XLSX, XLS · encoding & delimiter auto-detected
-
-
-
-
-
- customers_export.csv
- 2.1 MB
-
+
+
+ description
+ Using customers_export.csv from the upload screen.
+
-
+
Comma (,)
@@ -67,32 +68,33 @@
-
+
+
+
+
85
+
Higher means rows must look more alike to count as a duplicate.
+
+
the most-complete row
+
Which row survives in each group of duplicates.
+
+
+
- Options
+ Advanced options
-
- Advanced Options
-
-
-
-
-
Leave empty to auto-detect
-
-
email ✕
-
-
name ✕
-
-
-
jaro_winkler
-
-
85
-
most-complete
-
-
-
check Merge mode — fill missing fields in the surviving row
+
Leave these empty to auto-detect which columns to compare. Otherwise, list the columns that must match exactly and the ones that only need to match approximately — together these are the columns used to find duplicates.
+
+
+
+
email ✕
+
+
name ✕
-
+
+
jaro_winkler
+
+
+
check Merge mode — fill missing fields in the surviving row
@@ -109,8 +111,9 @@
Match groups
147
Rows kept
18,130
+
Preview of an auto-resolved run: each group keeps its auto-picked survivor. Review the groups below to override any pending picks before the final download.
-
+
@@ -123,6 +126,7 @@
+
Differing columns are highlighted. The survivor row is kept; uncheck a row to split it out of the group.
@@ -140,7 +144,6 @@
-
Differing columns highlighted. The survivor row is kept; uncheck rows to split the group.
@@ -163,8 +166,8 @@
-
Decisions: 1 merged, 1 pending
-
+
Decisions: 1 merged, 1 pending · Pending groups keep their auto-picked survivor unless you review them.
+
@@ -178,6 +181,8 @@
+
arrow_forwardDuplicates handled — your file is cleaned. Review the result or Back to Start here →
+ minimal: trim and collapse whitespace only — no character substitutions.
+ excel-hygiene: trim, collapse whitespace, fold smart quotes, strip invisible chars, normalize line endings, and normalize accented characters.
+ paranoid: everything in excel-hygiene plus strip control characters, strip BOM, and normalize accented and look-alike characters (lossy).
+
Make dates, phones, currency, and names look the same throughout.
@@ -76,18 +85,23 @@
Format options
-
+
- US (default) — ISO 8601 dates · E.164 phones · USD
- European — DMY input · INTL phones · EUR comma decimal
+ US (default) — ISO 8601 dates · international-format phones (+1…) · USD
+ European — DMY input · INTL phones · EUR comma decimal base UK — DD/MM/YYYY · GB phones · Yes/No booleans ISO Strict — ISO 8601 · bare-number currency · true/false Legacy US — MM/DD/YYYY · National phones · Yes/No
- Custom — keep current settings
+ Custom — based on European, 2 controls changed modified
-
Pick a published standard or regional convention as the baseline. Every option below is still individually overridable.
+
+ rule
+ Individual controls win over the preset. You started from European, then changed Ambiguous input order and Decimal separator below — so the preset is now Custom. The controls' current values are what actually run.
+
+
Pick a published standard or regional convention as the baseline. Every option below is still individually overridable; overriding any one switches the preset to Custom.
@@ -97,15 +111,16 @@
Dates
YYYY-MM-DD (ISO)
-
+
MDY (US) DMY (EU)
+
Winning value: MDY. Overrides the European base (DMY) — 01/02/2024 reads as 2024-01-02.
Phones
-
E.164 (+15551234567)
+
Standard international format (+15551234567)
US
@@ -117,11 +132,12 @@
Currency
-
+
dot (1,234.56) comma (1.234,56)
+
Winning value: dot. Overrides the European base (comma) — $1,234.5 reads as 1234.50.
2
Preserve original precision (don't round)
@@ -154,9 +170,30 @@
info
- 47 cell(s) in typed columns didn't match a recognizable shape and were left as-is. Check the changes audit below to find them, or re-classify the column to (skip).
+ 47 cell(s) in typed columns didn't match a recognizable shape and were left as-is. See Unparseable cells below to review them, or re-classify the column to (skip). (They aren't in the changes audit — nothing was changed.)
+
+
+ Unparseable cells (47)
+
+
Cells in typed columns that didn't match a recognizable shape and were left unchanged.
+
+
+
row
column
field_type
value (left as-is)
+
+
318
signup_date
date
soon
+
902
phone
phone
ext. 4471
+
1,544
amount
currency
TBD
+
2,087
active
boolean
maybe
+
3,610
signup_date
date
00/00/0000
+
+
+
+
… and 42 more.
+
+
+
Changes by column
@@ -194,6 +231,7 @@
Standardized preview (first 10 rows)
+
Showing 5 of 6 columns — notes is set to (skip), so it's omitted here.
Find blank cells (even hidden ones) and fill them in or remove them.
-
-
Tip: files imported on the Home screen are picked up here automatically.
-
-
-
- upload_file Drag and drop file here
- Up to 1.5 GB · CSV, TSV, XLSX, XLS
-
-
-
-
-
- survey_responses.csv
- 684 KB
-
+
+
+ description
+ Using survey_responses.csv from the upload screen.
+
@@ -63,39 +62,44 @@
-
-
+
+
Missingness profile
+
+
Rows
2,150
+
Cells missing
1,043
+
% cells missing
8.1%
+
Complete rows
1,388
+
+
+
+
column
dtype
missing
missing_pct
disguised
has_missing
+
+
respondent_id
object
0
0.0%
0
False
+
age
float64
187
8.7%
61
True
+
region
object
142
6.6%
142
True
+
income
float64
329
15.3%
118
True
+
satisfaction
float64
95
4.4%
40
True
+
comments
object
290
13.5%
290
True
+
+
+
+
+
+
+
+ Options
-
Missingness profile
-
-
Rows
2,150
-
Cells missing
1,043
-
% cells missing
8.1%
-
Complete rows
1,388
-
-
-
-
-
column
dtype
missing
missing_pct
disguised
has_missing
-
-
respondent_id
object
0
0.0%
0
False
-
age
float64
187
8.7%
61
True
-
region
object
142
6.6%
142
True
-
income
float64
329
15.3%
118
True
-
satisfaction
float64
95
4.4%
40
True
-
comments
object
290
13.5%
290
True
-
-
-
-
-
-
Strategy
+
+ layers
+ Resolution order: per-column override → global strategy → preset. The most specific setting wins; layers it overrides are dimmed.
+
-
+
info Overridden by Global strategy → median (set under Advanced options). Presets apply only when global is “(use preset)”.
+
detect-only (standardize sentinels to NaN, no fill or drop) safe-fill (numeric → median, categorical → mode) drop-incomplete (drop any row with missing)
@@ -112,16 +116,16 @@
Detection
check Standardize disguised nulls to NaN
-
+
N/A, n/a, NA, NULL, null, None, -, --, ?, #N/A
-
Matched case-insensitively after stripping whitespace.
+
Text that really means “empty.” Matched case-insensitively after stripping whitespace.
Strategy override
-
(use preset)
+
median
drop_row / drop_col use the thresholds below. mean / median / interpolate are numeric only — non-numeric columns fall back to the categorical strategy.
@@ -135,11 +139,11 @@
-
1.00
+
1.00
-
1.00
+
1.00
@@ -164,13 +168,13 @@
Set a different strategy for specific columns. Leave any row blank to use the global strategy.
-
Column
Override
+
Column
Override
Resolves to
-
age
median
-
region
mode
-
income
-
satisfaction
-
comments
constant
+
age
(global)
median · global
+
region
(global)
mode · global → categorical fallback
+
income
(global)
median · global
+
satisfaction
(global)
median · global
+
comments
constant
constant· this column
@@ -198,28 +202,14 @@
Missingness — before vs. after
-
column
before_missing
before_pct
after_missing
after_pct
+
column
before_missing
before_pct
after_missing
after_pct
strategy
-
respondent_id
0
0.0
0
0.0
-
age
187
8.7
0
0.0
-
region
142
6.6
0
0.0
-
income
329
15.3
0
0.0
-
satisfaction
95
4.4
0
0.0
-
comments
290
13.5
0
0.0
-
-
-
-
-
Strategy applied per column
-
-
-
column
strategy
-
-
age
median
-
region
mode
-
income
median
-
satisfaction
median
-
comments
constant
+
respondent_id
0
0.0
0
0.0
—
+
age
187
8.7
0
0.0
median
+
region
142
6.6
0
0.0
mode
+
income
329
15.3
0
0.0
median
+
satisfaction
95
4.4
0
0.0
median
+
comments
290
13.5
0
0.0
constant
@@ -262,6 +252,8 @@
+
arrow_forwardMissing values handled. Next, most files need: Find Duplicates →
Rename columns, change their order, and set each one as text, number, or date.
-
-
You can also import a file on the home screen and pick it up here.
-
-
-
- upload_file Drag and drop file here
- Up to 1.5 GB · CSV, TSV, XLSX, XLS · encoding & delimiter auto-detected
-
-
-
-
-
- crm_contacts_raw.csv
- 684 KB
-
+
+
+ description
+ Using crm_contacts_raw.csv from the upload screen.
+
@@ -75,9 +74,9 @@
Build interactively (start from current columns) Import schema JSON
- Skip (rename / coerce only — no schema)
+ Skip (rename / convert types only — no schema)
-
An interactive build is fastest for one-off cleanup. Import a JSON when you have a fixed contract (a CRM import format, db schema). Skip when you only want to rename or coerce specific columns.
+
An interactive build is fastest for one-off cleanup. Import a JSON when you have a fixed contract (a CRM import format, db schema). Skip when you only want to rename or convert the type of specific columns.
Edit the table to define your target schema. Add rows for fields the input doesn't have yet (with a default), or remove rows for columns you want to drop.
Pick a target for each source column. Notes stays unmapped — with the lenient preset it is kept as-is. source is added from the schema default.
+
Pick a target for each source column. Notes stays unmapped — with the keep-extras strategy it is kept as-is. source is added from the schema default.
+
+
+
+
+
+
Strategy
+
+
+
+ rename-only (just rename, leave types alone, keep extras)
+ lenient-schema (rename + convert types + reorder, keep extras)
+ strict-schema (rename + convert types + reorder, drop extras) base
+ Custom — based on strict-schema, 1 control changed modified
+
+
+ rule
+ Individual Advanced controls win over the preset. You started from strict-schema, then changed Unmapped source columns to keep below — so the preset is now Custom. The controls' current values are what actually run.
+
+
Pick a strategy as the baseline. Every Advanced toggle below is still individually overridable; overriding any one switches the preset to Custom.
+
+
+
+
+ Advanced options
+
+
+
+
+
+
keep
+
Winning value: keep. Overrides the strict-schema base (drop) — so Notes survives into the output.
+
+
check Convert each column to the right type
+
check Reorder to schema order
+
+
+
check Auto-infer mapping (fuzzy match)
+
+
+
0.80
+
+
check Enforce required fields
+
+
+
+
@@ -176,20 +186,6 @@
infoAdded (with defaults): source
warningSome cells could not be coerced and were left as NaN: amount_spent (3)
visibility
- Static layout preview of Find Unusual Values — a Coming Soon tool. The page is a stub/teaser: an "under development" notice, a list of planned features, and disabled placeholder controls (only the file uploader is live). All pages →
+ Static layout preview of Find Unusual Values — a Coming Soon tool. The page is a stub: a "coming soon" notice, a plain-English list of what the tool will do, and a single "Notify me" action. All pages →
@@ -25,62 +25,26 @@
-
+
info
- This tool is under development.
+ This tool is coming soon.
-
-
Features:
+
+
What it will do:
-
Z-score detection (configurable threshold)
-
IQR (interquartile range) detection
-
MAD (median absolute deviation) detection
-
Domain-rule violations (e.g., age < 0, price > $1M)
-
Visual outlier highlighting in data preview
-
Handling: flag only, remove, cap/winsorize to bounds
+
Find values that are unusually high or low for a column
+
Spot values that break the rules you set (out of range, wrong type)
+
Choose how sensitive the check is
+
Flag unusual rows by adding a column, without changing your data
+
Cap extreme values at a limit you choose
+
See a summary of how many values were flagged
-
-
-
-
- upload_file Drag and drop file here
- CSV, TSV, XLSX, XLS · Import a file to preview. Processing is not yet available.
-
visibility
- Static layout preview of Combine Files — a Coming-Soon tool. The page is a stub: an "under development" notice, a planned-features list, a working multi-file uploader, and disabled placeholder options. All pages →
+ Static layout preview of Combine Files — a Coming Soon tool. The page is a stub: a "coming soon" notice, a plain-English list of what the tool will do, and a single "Notify me" action. All pages →
@@ -23,56 +23,28 @@
Combine several CSV or Excel files into one — even if columns differ.
-
+
+
+
info
- This tool is under development.
+ This tool is coming soon.
-
-
Features:
-
-
Import multiple CSV/Excel files at once
-
Automatic schema alignment (matching columns by name)
-
Append mode: stack files vertically (union)
-
Join mode: merge files on shared key columns
-
Handle mismatched columns (fill missing with nulls or drop)
-
Source file tracking column
+
+
What it will do:
+
+
Import several CSV or Excel files at once
+
Line up columns automatically by matching their names
+
Stack files on top of each other into one long file
+
Merge files side by side using shared key columns
+
Handle columns that don't match (fill the gaps with blanks or drop them)
+
Add a column showing which file each row came from
-
-
-
-
- upload_file Drag and drop files here
- CSV, TSV, XLSX, XLS · multiple files allowed
-
-
-
-
Import multiple files to preview. Processing is not yet available.
visibility
- Static layout preview of Quality Check, a Coming-Soon tool. The page is a stub: an "under development" notice, a feature list, a working file uploader, and disabled placeholder controls. All pages →
+ Static layout preview of Quality Check — a Coming Soon tool. The page is a stub: a "coming soon" notice, a plain-English list of what the tool will do, and a single "Notify me" action. All pages →
@@ -25,64 +25,26 @@
-
+
info
- This tool is under development.
+ This tool is coming soon.
Run several tools in a row — save the steps once, reuse them anytime.
@@ -67,69 +76,192 @@
Options
-
+
- Use the recommended default (text-clean → format → missing → dedup)
- Build interactively
+ Use the recommended default (Clean Text → Standardize → Fix Missing → Find Duplicates) · modified
+ Build interactively Import a saved pipeline JSON
+
+ edit
+ You started from the recommended default and edited a step, so the mode switched to Build interactively. The steps below are now yours to change — pick recommended default again to discard your edits and restore the suggested order.
+
+
- Edit the table to add, remove, reorder (drag the row index), enable, or configure each step.
+ Add, remove, reorder (drag the row index), enable, or configure each step.
+ Open a step's Configure panel to set its options in plain language.
Tool order is recommended, not enforced — violations surface as warnings below the table.
+
+
-
+ Recommended tool order — why each step belongs where it does
text_clean before format_standardize — format parsers (phone / currency / date) fail on smart-quote-contaminated or NBSP-padded input — clean text first
+ info
+ 141 phone values didn't match any known pattern and were left unchanged. The step still completed — review them in the output preview if needed.
+
Compare two lists of transactions (e.g. bank vs. ledger) and flag what doesn't match.
@@ -30,18 +39,11 @@
Left (e.g. bank feed)
-
-
- upload_file Drag and drop file here
- CSV, TSV, XLSX, XLS
-
- Browse files
-
-
-
- bank_feed_may.csv
- 214 KB
+
+ description
+ Using bank_feed_may.csv from the upload screen.
+ Use a different file
bank_feed_may.csv — 1,204 rows, 4 columns
Preview left (e.g. bank feed)
@@ -63,18 +65,11 @@
Right (e.g. ledger)
-
-
- upload_file Drag and drop file here
- CSV, TSV, XLSX, XLS
-
- Browse files
-
-
-
- ledger_may.xlsx
- 96 KB
+
+ description
+ Using ledger_may.xlsx from the upload screen.
+ Use a different file
ledger_may.xlsx — 1,198 rows, 5 columns
Preview right (e.g. ledger)
@@ -105,7 +100,7 @@
Left columns
posted_date
description
-
amount
+
amount
ref ✕
@@ -114,9 +109,10 @@
Right columns
txn_date
memo
-
value
+
value
-
invoice_no ✕
+
invoice_no ✕
+
check_circle 1 reference each side — counts match
@@ -132,7 +128,7 @@
1
Allow N calendar days of drift between posting dates.
-
Invert right amount sign
+
Use when one side records debits as positive and the other as negative.
@@ -150,56 +146,34 @@
Results
-
Matched
1,173
Review
9
Unmatched left
22
Unmatched right
16
+
Matched
1,173
Coverage: 97.4% of the larger side
-
+
- Matched (1,173)
- Review (9)
+ Review (9)Unmatched left (22)Unmatched right (16)
+ Matched (1,173)
-
-
Preview of first 25 of 1,173 rows — download the CSV below for the full set.
+
+
Pairs flagged because the algorithm couldn't pick a single best match (e.g. multiple equally-good candidates). Use the left/right indices to disambiguate manually.
-
-
left_posted_date
left_description
left_amount
-
right_txn_date
right_memo
right_value
amount_diff
-
+
left_idx
left_amount
right_idx
right_value
candidates
-
2026-05-01
ACME SUPPLIES
-1240.00
2026-05-01
Acme Supplies Inc
-1240.00
0.00
-
2026-05-02
PAYROLL RUN
-8800.00
2026-05-02
Monthly payroll
-8800.00
0.00
-
2026-05-03
CLIENT GLOBEX
5200.00
2026-05-03
Globex retainer
5200.00
0.00
-
2026-05-04
UTILITY CO
-318.42
2026-05-04
City Utilities
-318.40
0.02
-
2026-05-06
OFFICE DEPOT
-89.15
2026-05-07
Office supplies
-89.15
0.00
+
118
-450.00
121, 209
-450.00
2 equal
+
203
1000.00
198, 244
1000.00
2 equal
-
- Review (9) — ambiguous candidates
-
-
Pairs flagged because the algorithm couldn't pick a single best match (e.g. multiple equally-good candidates). Use the left/right indices to disambiguate manually.
-
-
-
left_idx
left_amount
right_idx
right_value
candidates
-
-
118
-450.00
121, 209
-450.00
2 equal
-
203
1000.00
198, 244
1000.00
2 equal
-
-
-
-
-
-
Unmatched left (22) — only in bank_feed_may.csv
@@ -232,14 +206,37 @@
+
+ Matched (1,173) — cleanly reconciled
+
+
Preview of first 25 of 1,173 rows — download the CSV below for the full set.
+
+
+
+
left_posted_date
left_description
left_amount
+
right_txn_date
right_memo
right_value
amount_diff
+
+
+
2026-05-01
ACME SUPPLIES
-1240.00
2026-05-01
Acme Supplies Inc
-1240.00
0.00
+
2026-05-02
PAYROLL RUN
-8800.00
2026-05-02
Monthly payroll
-8800.00
0.00
+
2026-05-03
CLIENT GLOBEX
5200.00
2026-05-03
Globex retainer
5200.00
0.00
+
2026-05-04
UTILITY CO
-318.42
2026-05-04
City Utilities
-318.40
0.02
+
2026-05-06
OFFICE DEPOT
-89.15
2026-05-07
Office supplies
-89.15
0.00
+
+
+
+
+
+
-
+
- Matched CSVReview CSVUnmatched leftUnmatched right
+ Matched CSV