+ visibility
+ Static layout preview of Clean Text, shown with a file imported and a completed run (results metrics, changes-by-column, before/after examples, cleaned preview, downloads). All pages →
+
+
+
+
+
+
Clean Text
+
+
+
Trim extra spaces and strip out odd characters.
+
+
+
+
+
+
+
+ upload_file Drag and drop file here
+ Up to 1.5 GB · CSV, TSV, XLSX, XLS · encoding auto-detected
+
+ visibility
+ Static layout preview of Standardize Formats, shown with a file imported from the upload screen and a completed run (results + changes audit + standardized preview). All pages →
+
+
+
+
+
+
Standardize Formats
+
+
+
Make dates, phones, currency, and names look the same throughout.
+
+
+
+
+
+ description
+ Using customers_export.csv from the upload screen.
+
+
+
+
+
+ Preview: customers_export.csv
+
+
18,442 rows, 6 columns
+
+
+
full_name
phone
amount
signup_date
active
+
+
0
jane DOE
(512) 555-0190
$1,234.5
01/04/2024
Y
+
1
bob smith
720.555.7781
$99
2024-2-11
yes
+
2
ALICIA REYES
+1 415 555 2233
$45,000
Mar 3, 2024
n
+
3
m. okafor
2125550148
$7.999
2024/04/22
true
+
+
+
+
+
+
+
+
+
+
+ Options
+
+
+
Column types
+
Assign each column to a field type. Auto-detected suggestions are pre-filled; pick (skip) to leave a column untouched.
+
+
+
+
Name
+
Phone
+
Currency
+
+
+
Date
+
Boolean
+
(skip)
+
+
+
+
Format options
+
+
+
+
+
+ US (default) — ISO 8601 dates · E.164 phones · USD
+ European — DMY input · INTL phones · EUR comma decimal
+ UK — DD/MM/YYYY · GB phones · Yes/No booleans
+ ISO Strict — ISO 8601 · bare-number currency · true/false
+ Legacy US — MM/DD/YYYY · National phones · Yes/No
+ Custom — keep current settings
+
+
Pick a published standard or regional convention as the baseline. Every option below is still individually overridable.
+
+
+
+
+
+
+
Dates
+
YYYY-MM-DD (ISO)
+
+
+
+ MDY (US)
+ DMY (EU)
+
+
+
+
Phones
+
E.164 (+15551234567)
+
+
+
US
+
Region used when the input has no country code. US, GB, DE, etc.
+ info
+ 47 cell(s) in typed columns didn't match a recognizable shape and were left as-is. Check the changes audit below to find them, or re-classify the column to (skip).
+
+ visibility
+ Static layout preview of Fix Missing Values, shown with a file imported and a completed run (per-column missingness profile + before/after results). All pages →
+
+
+
+
+
+
Fix Missing Values
+
+
+
Find blank cells (even hidden ones) and fill them in or remove them.
+
+
+
+
+
Tip: files imported on the Home screen are picked up here automatically.
+
+
+
+ upload_file Drag and drop file here
+ Up to 1.5 GB · CSV, TSV, XLSX, XLS
+
+
+
+
+
+ survey_responses.csv
+ 684 KB
+
+
+
+
+
+ Preview: survey_responses.csv
+
+
2,150 rows, 6 columns
+
+
+
respondent_id
age
region
income
satisfaction
comments
+
+
0
R-1001
34
West
52000
4
great service
+
1
R-1002
N/A
East
3
?
+
2
R-1003
41
-
61000
NULL
none
+
3
R-1004
29
South
N/A
5
quick
+
+
+
+
+
+
+
+
+
+
+ Options
+
+
+
Missingness profile
+
+
Rows
2,150
+
Cells missing
1,043
+
% cells missing
8.1%
+
Complete rows
1,388
+
+
+
+
+
column
dtype
missing
missing_pct
disguised
has_missing
+
+
respondent_id
object
0
0.0%
0
False
+
age
float64
187
8.7%
61
True
+
region
object
142
6.6%
142
True
+
income
float64
329
15.3%
118
True
+
satisfaction
float64
95
4.4%
40
True
+
comments
object
290
13.5%
290
True
+
+
+
+
+
+
+
Strategy
+
+
+
+ detect-only (standardize sentinels to NaN, no fill or drop)
+ safe-fill (numeric → median, categorical → mode)
+ drop-incomplete (drop any row with missing)
+
+
detect-only: replace 'N/A', '-', 'NULL', etc. with real NaN, then stop. safe-fill: also fill — numeric columns with median, others with mode. drop-incomplete: also drop every row that has any missing cell.
+
+
+
+
+ Advanced options
+
+
+
+
Detection
+
check Standardize disguised nulls to NaN
+
+
+
N/A, n/a, NA, NULL, null, None, -, --, ?, #N/A
+
Matched case-insensitively after stripping whitespace.
+
+
+
+
Strategy override
+
+
+
(use preset)
+
drop_row / drop_col use the thresholds below. mean / median / interpolate are numeric only — non-numeric columns fall back to the categorical strategy.
+
+
+
+
mode
+
+
+
+
+
Drop thresholds
+
+
+
+
1.00
+
+
+
+
1.00
+
+
+
+
Scope
+
+
+
+ respondent_id ✕
+ age ✕
+ region ✕
+ income ✕
+ satisfaction ✕
+ comments ✕
+
+
+
+
+
Choose columns
+
+
+
Per-column strategy overrides (optional)
+
Set a different strategy for specific columns. Leave any row blank to use the global strategy.
+ visibility
+ Static layout preview of Map Columns, shown with a file imported, an interactive target schema + mapping configured, and a completed run (results + mapped preview). All pages →
+
+
+
+
+
+
Map Columns
+
+
+
Rename columns, change their order, and set each one as text, number, or date.
+
+
+
+
+
You can also import a file on the home screen and pick it up here.
+
+
+
+ upload_file Drag and drop file here
+ Up to 1.5 GB · CSV, TSV, XLSX, XLS · encoding & delimiter auto-detected
+
+
+
+
+
+ crm_contacts_raw.csv
+ 684 KB
+
+
+
+
+
+ Preview: crm_contacts_raw.csv
+
+
4,210 rows, 6 columns
+
+
+
Full Name
EmailAddr
Phone #
Signup
Amount Spent
Notes
+
+
0
Jane Doe
jane@acme.io
512-555-0190
01/04/2024
$1,204.50
VIP
+
1
Bob Smith
bob@globex.com
720-555-7781
02/11/2024
$88.00
+
2
Carla Reyes
carla@initech.net
415-555-3322
03/02/2024
$612.10
renewal
+
3
Dev Patel
dev@umbrella.co
206-555-9043
03/19/2024
$0.00
+
+
+
+
+
+
+
+
+
+
+ Options
+
+
+
+
Target schema
+
+
+
+ Build interactively (start from current columns)
+ Import schema JSON
+ Skip (rename / coerce only — no schema)
+
+
An interactive build is fastest for one-off cleanup. Import a JSON when you have a fixed contract (a CRM import format, db schema). Skip when you only want to rename or coerce specific columns.
+
+
+
Edit the table to define your target schema. Add rows for fields the input doesn't have yet (with a default), or remove rows for columns you want to drop.
+
+
+
+
+
Target name
Type
Required
Default (for added cols)
Aliases (comma-sep, helps fuzzy-match)
+
+
full_name
string
✗
Full Name, name
+
email
string
✓
EmailAddr, email_address
+
phone
string
✗
Phone #, tel
+
signup_date
date
✗
Signup
+
amount_spent
float
✗
0.0
Amount Spent
+
source
string
✗
crm-import
+
add add row
+
+
+
+
6 target fields · 1 added field (source) not present in the input.
+ visibility
+ Static layout preview of Find Unusual Values — a Coming Soon tool. The page is a stub/teaser: an "under development" notice, a list of planned features, and disabled placeholder controls (only the file uploader is live). All pages →
+
+
+
+
+
+
Find Unusual Values
+
+
+
Spot values that look wrong — way too high, too low, or breaking your rules.
+
+
+
+
+
+ info
+ This tool is under development.
+
+
+
+
Features:
+
+
Z-score detection (configurable threshold)
+
IQR (interquartile range) detection
+
MAD (median absolute deviation) detection
+
Domain-rule violations (e.g., age < 0, price > $1M)
+
Visual outlier highlighting in data preview
+
Handling: flag only, remove, cap/winsorize to bounds
+
+
+
+
+
+
+
+
+ upload_file Drag and drop file here
+ CSV, TSV, XLSX, XLS · Import a file to preview. Processing is not yet available.
+
+ visibility
+ Static layout preview of Combine Files — a Coming-Soon tool. The page is a stub: an "under development" notice, a planned-features list, a working multi-file uploader, and disabled placeholder options. All pages →
+
+
+
+
+
+
Combine Files
+
+
+
Combine several CSV or Excel files into one — even if columns differ.
+
+
+
+ info
+ This tool is under development.
+
+
+
+
Features:
+
+
Import multiple CSV/Excel files at once
+
Automatic schema alignment (matching columns by name)
+
Append mode: stack files vertically (union)
+
Join mode: merge files on shared key columns
+
Handle mismatched columns (fill missing with nulls or drop)
+
Source file tracking column
+
+
+
+
+
+
+
+
+ upload_file Drag and drop files here
+ CSV, TSV, XLSX, XLS · multiple files allowed
+
+
+
+
Import multiple files to preview. Processing is not yet available.
+ visibility
+ Static layout preview of Quality Check, a Coming-Soon tool. The page is a stub: an "under development" notice, a feature list, a working file uploader, and disabled placeholder controls. All pages →
+
+
+
+
+
+
Quality Check
+
+
+
Check your file against rules you set, and export a PDF or Excel report.
+ visibility
+ Static layout preview of Automated Workflows (Pipeline Runner), shown with a file imported, a four-step pipeline configured, and a completed run (results + per-step summary). All pages →
+
+
+
+
+
+
Automated Workflows
+
+
+
Run several tools in a row — save the steps once, reuse them anytime.
+
+
+
+
+
+
+
+ upload_file Drag and drop file here
+ Up to 1.5 GB · CSV, TSV, XLSX, XLS · encoding & delimiter auto-detected
+
+
+
+
+
+ customers_export.csv
+ 2.1 MB
+
+
+
+
+
+ Preview: customers_export.csv
+
+
18,442 rows, 6 columns
+
+
+
name
email
city
phone
signup_date
+
+
0
Jane Doe
jane@acme.io
Austin
512-555-0190
2024-01-04
+
1
jane doe
JANE@ACME.IO
austin
(512) 555-0190
01/04/2024
+
2
Bob Smith
bob@globex.com
Denver
720.555.7781
2024-02-11
+
3
R. Smith
bob@globex.com
—
720-555-7781
Feb 11 2024
+
+
+
+
+
+
+
+
+
+
+ Options
+
+
+
+
+
+
+ Use the recommended default (text-clean → format → missing → dedup)
+ Build interactively
+ Import a saved pipeline JSON
+
+
+
+
+ Edit the table to add, remove, reorder (drag the row index), enable, or configure each step.
+ Tool order is recommended, not enforced — violations surface as warnings below the table.
+
+
+
+
+
+
+ Recommended tool order — why each step belongs where it does
+
+
text_clean before format_standardize — format parsers (phone / currency / date) fail on smart-quote-contaminated or NBSP-padded input — clean text first
+
text_clean before missing — sentinel detection misses cells padded with NBSP / zero-width characters — clean text first
+
text_clean before dedup — fuzzy matching treats NBSP-padded values as different — clean text first
+
+
+
+
diff --git a/layout-review/10_pdf_extractor.html b/layout-review/10_pdf_extractor.html
new file mode 100644
index 0000000..8eb11fc
--- /dev/null
+++ b/layout-review/10_pdf_extractor.html
@@ -0,0 +1,189 @@
+
+
+
+
+
+Layout review — PDF to CSV
+
+
+
+
+
+
+
+ visibility
+ Static layout preview of PDF to CSV, shown with two bank-statement PDFs imported and a completed scan (candidate transactions in the editable preview table). All pages →
+
+
+
+
+
+
PDF to CSV
+
+
+
Pull transactions out of bank-statement PDFs into a clean CSV file.
+
+
+
+
+
+ Scan options
+
+
+
+ check
+ Treat (4.50) as negative
+
+
+ check
+ Use OCR for scanned pages
+
+
+
OCR status: ready (bundled Tesseract). Most modern bank PDFs are text-based and don't need OCR — only enable for image-based scans.
+
+
+
+
YYYY-MM-DD (2026-01-13)
+
+
+
+
+
Leave blank for automatic (statement period → filename year → this override).
+
+
+
+
+
+
+
+
Files
+ 2 files · 318.4 KB total
+
+
+
+
+
+
+
+ statement-jan-2026.pdf
+ 171.2 KB
+
+
+
+
+ statement-feb-2026.pdf
+ 147.2 KB
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Warnings (1)
+
+
+ warning
+ [statement-feb-2026.pdf] 2 lines matched a date but no amount — skipped (likely a wrapped description). Check the source if a transaction looks missing.
+
+
+
+
+
+
47 candidate transaction(s) from 2 file(s)
+
Uncheck rows to exclude. Edit any cell to fix a value the scanner got wrong. The raw column shows the original PDF text for that row.
+ visibility
+ Static layout preview of Reconcile Two Files, shown with both files imported, key columns mapped, and a completed reconciliation (matched / review / unmatched results). All pages →
+
+
+
+
+
+
Reconcile Two Files
+
+
+
Compare two lists of transactions (e.g. bank vs. ledger) and flag what doesn't match.
+
+
+
+
+
+
+
+
Left (e.g. bank feed)
+
+
+ upload_file Drag and drop file here
+ CSV, TSV, XLSX, XLS
+
+
+
+
+
+ bank_feed_may.csv
+ 214 KB
+
+
bank_feed_may.csv — 1,204 rows, 4 columns
+
+ Preview left (e.g. bank feed)
+
+
+
+
posted_date
description
amount
ref
+
+
2026-05-01
ACME SUPPLIES
-1240.00
CHK1041
+
2026-05-02
PAYROLL RUN
-8800.00
ACH5520
+
2026-05-03
CLIENT GLOBEX
5200.00
DEP0090
+
2026-05-04
UTILITY CO
-318.42
CHK1042
+
+
+
+
+
+
+
+
+
Right (e.g. ledger)
+
+
+ upload_file Drag and drop file here
+ CSV, TSV, XLSX, XLS
+
+
+
+
+
+ ledger_may.xlsx
+ 96 KB
+
+
ledger_may.xlsx — 1,198 rows, 5 columns
+
+ Preview right (e.g. ledger)
+
+
+
+
txn_date
memo
value
invoice_no
account
+
+
2026-05-01
Acme Supplies Inc
-1240.00
INV-1041
5000
+
2026-05-02
Monthly payroll
-8800.00
INV-5520
6000
+
2026-05-03
Globex retainer
5200.00
INV-0090
4000
+
2026-05-04
City Utilities
-318.40
INV-1042
6100
+
+
+
+
+
+
+
+
+
+
+
+
Match settings
+
+
+
+
Left columns
+
posted_date
+
description
+
amount
+
+
ref ✕
+
+
+
+
Right columns
+
txn_date
+
memo
+
value
+
+
invoice_no ✕
+
+
+
+
+
+ Tolerances & options
+
+
+
+
0.0200
+
Absolute tolerance on amount (e.g. 0.01 to absorb cent rounding).
+
+
1
+
Allow N calendar days of drift between posting dates.
+
+
Invert right amount sign
+
Use when one side records debits as positive and the other as negative.
+
+
+
80
+
When both sides have a description column set, accept matches with this minimum fuzzy similarity even if amount/date are merely within tolerance. Lower = more permissive.
+
+
+
+
+
+
+
+
+
+
+
Results
+
+
Matched
1,173
+
Review
9
+
Unmatched left
22
+
Unmatched right
16
+
+
Coverage: 97.4% of the larger side
+
+
+
+ Matched (1,173)
+ Review (9)
+ Unmatched left (22)
+ Unmatched right (16)
+
+
+
+
Preview of first 25 of 1,173 rows — download the CSV below for the full set.
+
+
+
+
left_posted_date
left_description
left_amount
+
right_txn_date
right_memo
right_value
amount_diff
+
+
+
2026-05-01
ACME SUPPLIES
-1240.00
2026-05-01
Acme Supplies Inc
-1240.00
0.00
+
2026-05-02
PAYROLL RUN
-8800.00
2026-05-02
Monthly payroll
-8800.00
0.00
+
2026-05-03
CLIENT GLOBEX
5200.00
2026-05-03
Globex retainer
5200.00
0.00
+
2026-05-04
UTILITY CO
-318.42
2026-05-04
City Utilities
-318.40
0.02
+
2026-05-06
OFFICE DEPOT
-89.15
2026-05-07
Office supplies
-89.15
0.00
+
+
+
+
+
+
+ Review (9) — ambiguous candidates
+
+
Pairs flagged because the algorithm couldn't pick a single best match (e.g. multiple equally-good candidates). Use the left/right indices to disambiguate manually.
+
+
+
left_idx
left_amount
right_idx
right_value
candidates
+
+
118
-450.00
121, 209
-450.00
2 equal
+
203
1000.00
198, 244
1000.00
2 equal
+
+
+
+
+
+
+
+ Unmatched left (22) — only in bank_feed_may.csv
+
+
Preview of first 25 of 22 rows.
+
+
+
posted_date
description
amount
ref
+
+
2026-05-09
BANK FEE
-12.00
FEE0001
+
2026-05-14
ATM WITHDRAWAL
-200.00
ATM7781
+
+
+
+
+
+
+
+ Unmatched right (16) — only in ledger_may.xlsx
+
Static HTML reproductions of every tool page, built from the live app's design tokens for human review of layouts.
+
+
+
+
+ info
+ These are faithful static mockups — not the running Streamlit app. Colors, type scale, spacing, and components are copied verbatim from theme.py and components/_legacy.py. Each page is shown in a representative populated state so the layout can be reviewed end-to-end. Fonts load from Google Fonts (needs network); the chrome (sidebar + footer) is shared across every page.
+