diff --git a/DECISIONS.md b/DECISIONS.md new file mode 100644 index 0000000..07a4fdd --- /dev/null +++ b/DECISIONS.md @@ -0,0 +1,33 @@ +# Product & architecture decisions + +A running log of decisions that aren't obvious from the code and would +otherwise be re-litigated. Newest first. + +## 2026-06-08 — PDF to CSV and Reconcile stay in the bundle, under a "Finance" group + +**Decision:** `10_pdf_extractor` (PDF to CSV) and `11_reconciler` (Reconcile +Two Files) remain part of the DataTools suite. In the sidebar they are +segregated into their own **Finance** section, distinct from the +file-cleaning tools. + +**Context / why this needed deciding:** +- Both tools sit outside the documented 9-script cleaning architecture + (TECHNICAL.md / USER-GUIDE.md stop at the orchestrator). +- They occupy the "reconciliation / manual data-entry" territory the + product's honest-positioning note explicitly placed outside a + file-cleaning tool's scope. +- A journey-level UX review flagged that every extra tool in the main + sidebar raises the "which tool do I need?" load for a non-technical + buyer, so tools serving a different job should live in a clearly + different place. + +**Resolution:** Keep them in-bundle (they're built, useful, and ship +today) but group them under "Finance" so the cleaning flow stays +uncluttered. Revisit only if a separate finance-focused product emerges. + +**Implications:** +- `tools_registry.py`: Reconcile + PDF to CSV carry a `finance` section. +- Sidebar order: Start here → Data Cleaners → Transformations → + Automations → Finance → Coming soon. +- This is the source-of-truth realization of the `layout-review/` + mockups (see `layout-review/shell.js`). diff --git a/layout-review/01_deduplicator.html b/layout-review/01_deduplicator.html index b00c79f..032439c 100644 --- a/layout-review/01_deduplicator.html +++ b/layout-review/01_deduplicator.html @@ -19,29 +19,30 @@

Find Duplicates

- +
+ + + + + + Runs 100% locally + + +

Find rows that repeat, then keep one and remove the extras.

- - -
-
- upload_file Drag and drop file here - Up to 1.5 GB · CSV, TSV, XLSX, XLS · encoding & delimiter auto-detected -
- -
-
- - customers_export.csv - 2.1 MB - + +
+ description + Using customers_export.csv from the upload screen.
+ - +
Comma (,)
@@ -67,32 +68,33 @@
- + +
+
+
85
+
Higher means rows must look more alike to count as a duplicate.
+
+
the most-complete row
+
Which row survives in each group of duplicates.
+
+ +
- Options + Advanced options
-
- Advanced Options -
-
-
-
-
Leave empty to auto-detect
-
-
email
-
-
name
-
-
-
jaro_winkler
-
-
85
-
most-complete
-
-
-
check Merge mode — fill missing fields in the surviving row
+

Leave these empty to auto-detect which columns to compare. Otherwise, list the columns that must match exactly and the ones that only need to match approximately — together these are the columns used to find duplicates.

+
+
+
+
email
+
+
name
-
+
+
jaro_winkler
+
+
+
check Merge mode — fill missing fields in the surviving row
@@ -109,8 +111,9 @@
Match groups
147
Rows kept
18,130
+

Preview of an auto-resolved run: each group keeps its auto-picked survivor. Review the groups below to override any pending picks before the final download.

- +
@@ -123,6 +126,7 @@ +

Differing columns are highlighted. The survivor row is kept; uncheck a row to split it out of the group.

@@ -140,7 +144,6 @@
-

Differing columns highlighted. The survivor row is kept; uncheck rows to split the group.

@@ -163,8 +166,8 @@ -

Decisions: 1 merged, 1 pending

- +

Decisions: 1 merged, 1 pending · Pending groups keep their auto-picked survivor unless you review them.

+
@@ -178,6 +181,8 @@
+
arrow_forwardDuplicates handled — your file is cleaned. Review the result or Back to Start here →
+ diff --git a/layout-review/02_text_cleaner.html b/layout-review/02_text_cleaner.html index 6f49c95..7d3b828 100644 --- a/layout-review/02_text_cleaner.html +++ b/layout-review/02_text_cleaner.html @@ -27,34 +27,39 @@

Clean Text

- +
+ + + + + + Runs 100% locally + + +

Trim extra spaces and strip out odd characters.

- - -
-
- upload_file Drag and drop file here - Up to 1.5 GB · CSV, TSV, XLSX, XLS · encoding auto-detected -
- -
-
- - contacts_messy.csv - 684 KB - + +
+ description + Using contacts_messy.csv from the upload screen.
+
Preview: contacts_messy.csv

4,120 rows, 4 columns

-
check Show hidden characters in preview
+
check Show hidden characters
+
+ · Whitespace + Smart / special + Control +
@@ -82,7 +87,11 @@ minimal paranoid -
excel-hygiene: trim, collapse whitespace, fold smart quotes, strip invisible chars, normalize line endings, NFC.
+
+ minimal: trim and collapse whitespace only — no character substitutions.
+ excel-hygiene: trim, collapse whitespace, fold smart quotes, strip invisible chars, normalize line endings, and normalize accented characters.
+ paranoid: everything in excel-hygiene plus strip control characters, strip BOM, and normalize accented and look-alike characters (lossy). +
@@ -99,8 +108,8 @@
check Fold smart characters (curly quotes, em-dash, NBSP)
check Strip zero-width / invisible characters
-
check Unicode NFC normalization
-
Unicode NFKC compat fold (lossy: ① → 1, fi → fi)
+
check Normalize accented characters (NFC)
+
Normalize accented and look-alike characters (lossy: ① → 1, fi → fi)
@@ -143,17 +152,20 @@
Columns processed
4
-
check Show hidden characters (NBSP, ZWSP, smart quotes, control chars…)
+
+
check Show hidden characters (NBSP, ZWSP, smart quotes, control chars…)
+
Same setting as “Show hidden characters” in the preview above — toggling either updates both.
+

Changes by column

nameemailcompanynotes
- + - - - - + + + +
cells_changed
columncells_changed
company1,604
name1,210
notes982
email151
company1,604
name1,210
notes982
email151
@@ -199,6 +211,9 @@
+ +
arrow_forwardText cleaned. Next, most files need: Standardize Formats →
+
diff --git a/layout-review/03_format_standardizer.html b/layout-review/03_format_standardizer.html index a0bff95..d4c9f8d 100644 --- a/layout-review/03_format_standardizer.html +++ b/layout-review/03_format_standardizer.html @@ -19,7 +19,16 @@

Standardize Formats

- +
+ + + + + + Runs 100% locally + + +

Make dates, phones, currency, and names look the same throughout.

@@ -76,18 +85,23 @@

Format options

- +
- US (default) — ISO 8601 dates · E.164 phones · USD - European — DMY input · INTL phones · EUR comma decimal + US (default) — ISO 8601 dates · international-format phones (+1…) · USD + European — DMY input · INTL phones · EUR comma decimal base UK — DD/MM/YYYY · GB phones · Yes/No booleans ISO Strict — ISO 8601 · bare-number currency · true/false Legacy US — MM/DD/YYYY · National phones · Yes/No - Custom — keep current settings + Custom — based on European, 2 controls changed modified
-
Pick a published standard or regional convention as the baseline. Every option below is still individually overridable.
+
+ rule + Individual controls win over the preset. You started from European, then changed Ambiguous input order and Decimal separator below — so the preset is now Custom. The controls' current values are what actually run. +
+
Pick a published standard or regional convention as the baseline. Every option below is still individually overridable; overriding any one switches the preset to Custom.
@@ -97,15 +111,16 @@

Dates

YYYY-MM-DD (ISO)
- +
MDY (US) DMY (EU)
+
Winning value: MDY. Overrides the European base (DMY) — 01/02/2024 reads as 2024-01-02.

Phones

-
E.164 (+15551234567)
+
Standard international format (+15551234567)
US
@@ -117,11 +132,12 @@

Currency

- +
dot (1,234.56) comma (1.234,56)
+
Winning value: dot. Overrides the European base (comma) — $1,234.5 reads as 1234.50.
2
Preserve original precision (don't round)
@@ -154,9 +170,30 @@
info - 47 cell(s) in typed columns didn't match a recognizable shape and were left as-is. Check the changes audit below to find them, or re-classify the column to (skip). + 47 cell(s) in typed columns didn't match a recognizable shape and were left as-is. See Unparseable cells below to review them, or re-classify the column to (skip). (They aren't in the changes audit — nothing was changed.)
+ +
+ Unparseable cells (47) +
+

Cells in typed columns that didn't match a recognizable shape and were left unchanged.

+
+ + + + + + + + + +
rowcolumnfield_typevalue (left as-is)
318signup_datedatesoon
902phonephoneext. 4471
1,544amountcurrencyTBD
2,087activebooleanmaybe
3,610signup_datedate00/00/0000
+
+

… and 42 more.

+
+
+

Changes by column

@@ -194,6 +231,7 @@

Standardized preview (first 10 rows)

+

Showing 5 of 6 columns — notes is set to (skip), so it's omitted here.

@@ -215,6 +253,9 @@ + +
arrow_forwardFormats standardized. Next, most files need: Fix Missing Values →
+ diff --git a/layout-review/04_missing_handler.html b/layout-review/04_missing_handler.html index 8475a01..e17ef3c 100644 --- a/layout-review/04_missing_handler.html +++ b/layout-review/04_missing_handler.html @@ -19,28 +19,27 @@

Fix Missing Values

- +
+ + + + + + Runs 100% locally + + +

Find blank cells (even hidden ones) and fill them in or remove them.

- -

Tip: files imported on the Home screen are picked up here automatically.

- -
-
- upload_file Drag and drop file here - Up to 1.5 GB · CSV, TSV, XLSX, XLS -
- -
-
- - survey_responses.csv - 684 KB - + +
+ description + Using survey_responses.csv from the upload screen.
+
@@ -63,39 +62,44 @@
- -
+ +

Missingness profile

+
+
Rows
2,150
+
Cells missing
1,043
+
% cells missing
8.1%
+
Complete rows
1,388
+
+
+
full_namephoneamountsignup_dateactive
+ + + + + + + + + +
columndtypemissingmissing_pctdisguisedhas_missing
respondent_idobject00.0%0False
agefloat641878.7%61True
regionobject1426.6%142True
incomefloat6432915.3%118True
satisfactionfloat64954.4%40True
commentsobject29013.5%290True
+
+ +
+ + +
Options
-

Missingness profile

-
-
Rows
2,150
-
Cells missing
1,043
-
% cells missing
8.1%
-
Complete rows
1,388
-
- -
- - - - - - - - - - -
columndtypemissingmissing_pctdisguisedhas_missing
respondent_idobject00.0%0False
agefloat641878.7%61True
regionobject1426.6%142True
incomefloat6432915.3%118True
satisfactionfloat64954.4%40True
commentsobject29013.5%290True
-
- -
-

Strategy

+
+ layers + Resolution order: per-column overrideglobal strategypreset. The most specific setting wins; layers it overrides are dimmed. +
-
+
info Overridden by Global strategy → median (set under Advanced options). Presets apply only when global is “(use preset)”.
+
detect-only (standardize sentinels to NaN, no fill or drop) safe-fill (numeric → median, categorical → mode) drop-incomplete (drop any row with missing) @@ -112,16 +116,16 @@

Detection

check Standardize disguised nulls to NaN
- +
N/A, n/a, NA, NULL, null, None, -, --, ?, #N/A
-
Matched case-insensitively after stripping whitespace.
+
Text that really means “empty.” Matched case-insensitively after stripping whitespace.

Strategy override

-
(use preset)
+
median
drop_row / drop_col use the thresholds below. mean / median / interpolate are numeric only — non-numeric columns fall back to the categorical strategy.
@@ -135,11 +139,11 @@
-
1.00
+
1.00
-
1.00
+
1.00
@@ -164,13 +168,13 @@

Set a different strategy for specific columns. Leave any row blank to use the global strategy.

- + - - - - - + + + + +
ColumnOverride
ColumnOverrideResolves to
agemedian
regionmode
income
satisfaction
commentsconstant
age(global)median · global
region(global)mode · global → categorical fallback
income(global)median · global
satisfaction(global)median · global
commentsconstantconstant · this column
@@ -198,28 +202,14 @@

Missingness — before vs. after

- + - - - - - - - -
columnbefore_missingbefore_pctafter_missingafter_pct
columnbefore_missingbefore_pctafter_missingafter_pctstrategy
respondent_id00.000.0
age1878.700.0
region1426.600.0
income32915.300.0
satisfaction954.400.0
comments29013.500.0
-
- -

Strategy applied per column

-
- - - - - - - - + + + + + +
columnstrategy
agemedian
regionmode
incomemedian
satisfactionmedian
commentsconstant
respondent_id00.000.0
age1878.700.0median
region1426.600.0mode
income32915.300.0median
satisfaction954.400.0median
comments29013.500.0constant
@@ -262,6 +252,8 @@
+
arrow_forwardMissing values handled. Next, most files need: Find Duplicates →
+
diff --git a/layout-review/05_column_mapper.html b/layout-review/05_column_mapper.html index c0c2a02..106f744 100644 --- a/layout-review/05_column_mapper.html +++ b/layout-review/05_column_mapper.html @@ -19,28 +19,27 @@

Map Columns

- +
+ + + + + + Runs 100% locally + + +

Rename columns, change their order, and set each one as text, number, or date.

- -

You can also import a file on the home screen and pick it up here.

- -
-
- upload_file Drag and drop file here - Up to 1.5 GB · CSV, TSV, XLSX, XLS · encoding & delimiter auto-detected -
- -
-
- - crm_contacts_raw.csv - 684 KB - + +
+ description + Using crm_contacts_raw.csv from the upload screen.
+
@@ -75,9 +74,9 @@
Build interactively (start from current columns) Import schema JSON - Skip (rename / coerce only — no schema) + Skip (rename / convert types only — no schema)
-
An interactive build is fastest for one-off cleanup. Import a JSON when you have a fixed contract (a CRM import format, db schema). Skip when you only want to rename or coerce specific columns.
+
An interactive build is fastest for one-off cleanup. Import a JSON when you have a fixed contract (a CRM import format, db schema). Skip when you only want to rename or convert the type of specific columns.

Edit the table to define your target schema. Add rows for fields the input doesn't have yet (with a default), or remove rows for columns you want to drop.

@@ -93,7 +92,7 @@ signup_datedate✗Signup amount_spentfloat✗0.0Amount Spent sourcestring✗crm-import - add add row + add add row
@@ -101,43 +100,8 @@
- -

Strategy

-
- -
- rename-only (just rename, leave types alone, keep extras) - lenient-schema (rename + coerce + reorder, keep extras) - strict-schema (rename + coerce + reorder, drop extras) -
-
- - -
- Advanced options -
-
-
-
- -
keep
-
-
check Coerce types per schema
-
check Reorder to schema order
-
-
-
check Auto-infer mapping (fuzzy match)
-
- -
0.80
-
-
check Enforce required fields
-
-
-
-
- +

Mapping

@@ -153,7 +117,53 @@
-

Pick a target for each source column. Notes stays unmapped — with the lenient preset it is kept as-is. source is added from the schema default.

+

Pick a target for each source column. Notes stays unmapped — with the keep-extras strategy it is kept as-is. source is added from the schema default.

+ +
+ + + +

Strategy

+
+ +
+ rename-only (just rename, leave types alone, keep extras) + lenient-schema (rename + convert types + reorder, keep extras) + strict-schema (rename + convert types + reorder, drop extras) base + Custom — based on strict-schema, 1 control changed modified +
+
+ rule + Individual Advanced controls win over the preset. You started from strict-schema, then changed Unmapped source columns to keep below — so the preset is now Custom. The controls' current values are what actually run. +
+
Pick a strategy as the baseline. Every Advanced toggle below is still individually overridable; overriding any one switches the preset to Custom.
+
+ + +
+ Advanced options +
+
+
+
+ +
keep
+
Winning value: keep. Overrides the strict-schema base (drop) — so Notes survives into the output.
+
+
check Convert each column to the right type
+
check Reorder to schema order
+
+
+
check Auto-infer mapping (fuzzy match)
+
+ +
0.80
+
+
check Enforce required fields
+
+
+
+
@@ -176,20 +186,6 @@
infoAdded (with defaults): source
warningSome cells could not be coerced and were left as NaN: amount_spent (3)
-

Resolved mapping

-
- - - - - - - - - -
sourcetargetauto
Full Namefull_nameTrue
EmailAddremailTrue
Phone #phoneTrue
Signupsignup_dateTrue
Amount Spentamount_spentTrue
-
-

Mapped preview (first 10 rows)

@@ -213,6 +209,9 @@ + +
arrow_forwardColumns mapped. Run the recommended clean →
+ diff --git a/layout-review/06_outlier_detector.html b/layout-review/06_outlier_detector.html index 546a847..3f9e164 100644 --- a/layout-review/06_outlier_detector.html +++ b/layout-review/06_outlier_detector.html @@ -12,7 +12,7 @@
visibility - Static layout preview of Find Unusual Values — a Coming Soon tool. The page is a stub/teaser: an "under development" notice, a list of planned features, and disabled placeholder controls (only the file uploader is live). All pages → + Static layout preview of Find Unusual Values — a Coming Soon tool. The page is a stub: a "coming soon" notice, a plain-English list of what the tool will do, and a single "Notify me" action. All pages →
@@ -25,62 +25,26 @@
- +
info - This tool is under development. + This tool is coming soon.
- -

Features:

+ +

What it will do:

    -
  • Z-score detection (configurable threshold)
  • -
  • IQR (interquartile range) detection
  • -
  • MAD (median absolute deviation) detection
  • -
  • Domain-rule violations (e.g., age < 0, price > $1M)
  • -
  • Visual outlier highlighting in data preview
  • -
  • Handling: flag only, remove, cap/winsorize to bounds
  • +
  • Find values that are unusually high or low for a column
  • +
  • Spot values that break the rules you set (out of range, wrong type)
  • +
  • Choose how sensitive the check is
  • +
  • Flag unusual rows by adding a column, without changing your data
  • +
  • Cap extreme values at a limit you choose
  • +
  • See a summary of how many values were flagged

- - -
-
- upload_file Drag and drop file here - CSV, TSV, XLSX, XLS · Import a file to preview. Processing is not yet available. -
- -
- - -

Detection Method

- -
- -
Z-Score
-
- -
- -
3.0
-
- -
- -
1.5
-
- -

Handling

- -
- -
Flag only (add column)
-
- -
- +
diff --git a/layout-review/07_multi_file_merger.html b/layout-review/07_multi_file_merger.html index ede9b11..aa42434 100644 --- a/layout-review/07_multi_file_merger.html +++ b/layout-review/07_multi_file_merger.html @@ -12,7 +12,7 @@
visibility - Static layout preview of Combine Files — a Coming-Soon tool. The page is a stub: an "under development" notice, a planned-features list, a working multi-file uploader, and disabled placeholder options. All pages → + Static layout preview of Combine Files — a Coming Soon tool. The page is a stub: a "coming soon" notice, a plain-English list of what the tool will do, and a single "Notify me" action. All pages →
@@ -23,56 +23,28 @@

Combine several CSV or Excel files into one — even if columns differ.

- +
+ +
info - This tool is under development. + This tool is coming soon.
- -

Features:

-
    -
  • Import multiple CSV/Excel files at once
  • -
  • Automatic schema alignment (matching columns by name)
  • -
  • Append mode: stack files vertically (union)
  • -
  • Join mode: merge files on shared key columns
  • -
  • Handle mismatched columns (fill missing with nulls or drop)
  • -
  • Source file tracking column
  • + +

    What it will do:

    +
      +
    • Import several CSV or Excel files at once
    • +
    • Line up columns automatically by matching their names
    • +
    • Stack files on top of each other into one long file
    • +
    • Merge files side by side using shared key columns
    • +
    • Handle columns that don't match (fill the gaps with blanks or drop them)
    • +
    • Add a column showing which file each row came from

    - - -
    -
    - upload_file Drag and drop files here - CSV, TSV, XLSX, XLS · multiple files allowed -
    - -
    -
    Import multiple files to preview. Processing is not yet available.
    - - -

    Merge Strategy

    - -
    - -
    Append (stack vertically)
    -
    - -
    - -
    Fill with null
    -
    - -
    - check Add source filename column -
    - -
    - - +
diff --git a/layout-review/08_validator_reporter.html b/layout-review/08_validator_reporter.html index d255430..4dafd96 100644 --- a/layout-review/08_validator_reporter.html +++ b/layout-review/08_validator_reporter.html @@ -12,7 +12,7 @@
visibility - Static layout preview of Quality Check, a Coming-Soon tool. The page is a stub: an "under development" notice, a feature list, a working file uploader, and disabled placeholder controls. All pages → + Static layout preview of Quality Check — a Coming Soon tool. The page is a stub: a "coming soon" notice, a plain-English list of what the tool will do, and a single "Notify me" action. All pages →
@@ -25,64 +25,26 @@
- +
info - This tool is under development. + This tool is coming soon.
- -

Features:

+ +

What it will do:

    -
  • Column-level validation rules (not null, unique, regex pattern, range, enum)
  • -
  • Cross-column validation (e.g., start_date < end_date)
  • -
  • Data quality score per column and overall
  • -
  • Generate PDF quality report
  • -
  • Generate Excel report with flagged rows highlighted
  • -
  • Summary dashboard: pass/fail counts, severity breakdown
  • +
  • Check each column against rules you set (no blanks, no duplicates, matches a pattern, within a range, from a set list)
  • +
  • Check rules across columns (for example, start date is before end date)
  • +
  • Give each column and the whole file a quality score
  • +
  • Export a PDF quality report
  • +
  • Export an Excel report with the problem rows highlighted
  • +
  • Show a summary of what passed, what failed, and how serious each issue is

- - -
-
- upload_file Drag and drop file here - Import a file to preview. Processing is not yet available. -
- -
- - -

Validation Rules

- - -
-
- upload_file Drag and drop file here - JSON -
- -
- -
- -
- Choose options -
-
- -

Report Format

- -
- -
Excel (flagged rows)
-
- -
- - +
diff --git a/layout-review/09_pipeline_runner.html b/layout-review/09_pipeline_runner.html index 63426bd..22256d2 100644 --- a/layout-review/09_pipeline_runner.html +++ b/layout-review/09_pipeline_runner.html @@ -19,7 +19,16 @@

Automated Workflows

- +
+ + + + + + Runs 100% locally + + +

Run several tools in a row — save the steps once, reuse them anytime.

@@ -67,69 +76,192 @@ Options
- +
- Use the recommended default (text-clean → format → missing → dedup) - Build interactively + Use the recommended default (Clean Text → Standardize → Fix Missing → Find Duplicates) · modified + Build interactively Import a saved pipeline JSON
+
+ edit + You started from the recommended default and edited a step, so the mode switched to Build interactively. The steps below are now yours to change — pick recommended default again to discard your edits and restore the suggested order. +
+

- Edit the table to add, remove, reorder (drag the row index), enable, or configure each step. + Add, remove, reorder (drag the row index), enable, or configure each step. + Open a step's Configure panel to set its options in plain language. Tool order is recommended, not enforced — violations surface as warnings below the table.

- +
- - - + + + - + - - - - - - - - - - - - - - - - - - - - - - - +
ToolEnabledOptions (JSON)StepEnabledConfigure
≡ 0text_clean expand_more
Clean Text
Trim spaces, collapse repeats, leave case as-is
check{"trim": true, "collapse_whitespace": true}
≡ 1format_standardize expand_morecheck{"column_types": {"phone": "phone", "signup_date": "date"}}
≡ 2missing expand_morecheck{"strategy": "flag", "sentinels": ["N/A", "—"]}
≡ 3dedup expand_morecheck{"survivor_rule": "most_complete", "merge": true}
Add rowtune Configure expand_more
+ +
+ Configure: Clean Text +
+
check Trim leading & trailing whitespace
+
check Collapse repeated spaces to one
+
Normalize smart quotes & dashes to plain ASCII
+
+ +
Leave as-is
+
+
+
+ +
+ + + + + + + + + +
≡ 1
Standardize Formats
Format phone as phone, signup_date as a date
checktune Configure chevron_right
+
+ +
+ Configure: Standardize Formats +
+

Choose a target format for each column. Columns left as “Leave as-is” are untouched.

+
+ + + + + + + + +
ColumnFormat as
nameLeave as-is
emailLeave as-is
phonePhone number
signup_dateDate
+
+
+
+ +
+ + + + + + + + + +
≡ 2
Fix Missing Values
Flag blank cells (treat “N/A” and “—” as blank)
checktune Configure chevron_right
+
+ +
+ Configure: Fix Missing Values +
+
+ +
+ Flag them (mark blanks, change nothing) + Fill them in (numbers → median, text → most common) + Drop rows that have any blank +
+
+
+ +
N/A, —
+
Matched case-insensitively after stripping whitespace.
+
+
+
+ +
+ + + + + + + + + + + + + +
≡ 3
Find Duplicates
Match on email & phone; keep the most complete row, merge in missing fields
checktune Configure chevron_right
Add step
+
+ +
+ Configure: Find Duplicates +
+
+ +
Keep the most complete row
+
Other options: keep the first seen, keep the last seen.
+
+
check Merge matched rows (fill each survivor's blanks from its duplicates)
+
+ +
+ email + phone +
+
+
+
+ +
+ Advanced — import / export pipeline as JSON +
+

For sharing or version control. Editing is done in the step panels above — this is just the saved form of the same settings.

+
{ + "version": 1, + "steps": [ + {"tool": "text_clean", "enabled": true, "options": {"trim": true, "collapse_whitespace": true}}, + {"tool": "format_standardize", "enabled": true, "options": {"column_types": {"phone": "phone", "signup_date": "date"}}}, + {"tool": "missing", "enabled": true, "options": {"strategy": "flag", "sentinels": ["N/A", "—"]}}, + {"tool": "dedup", "enabled": true, "options": {"survivor_rule": "most_complete", "merge": true, "keys": ["email", "phone"]}} + ] +}
+
+ + +
+
+
+ -
+
Recommended tool order — why each step belongs where it does

text_clean before format_standardize — format parsers (phone / currency / date) fail on smart-quote-contaminated or NBSP-padded input — clean text first

@@ -161,39 +293,49 @@

Per-step summary

+
- + - - - + + - - - + + + + + + - - - + + - - - + +
stepstatuselapsed_mssummaryerror
stepstatuselapsedsummary
text_clean ok214{"cells_changed": 1204, "columns": ["name", "city"]}214 ms1,204 cells changed in name & city
format_standardizeok388{"phone": 18301, "signup_date": 17996}warning ok · 141 skipped388 ms18,301 phones and 17,996 dates standardized
+ info + 141 phone values didn't match any known pattern and were left unchanged. The step still completed — review them in the output preview if needed. +
missing ok121{"flagged_cells": 642, "sentinels_found": ["—"]}121 ms642 blank cells flagged (sentinel “—”)
dedup ok911{"input_rows": 18442, "output_rows": 18130, "duplicates_removed": 312, "groups": 147}911 ms312 duplicates removed across 147 groups (18,442 → 18,130 rows)
diff --git a/layout-review/10_pdf_extractor.html b/layout-review/10_pdf_extractor.html index 3d457b1..93a272b 100644 --- a/layout-review/10_pdf_extractor.html +++ b/layout-review/10_pdf_extractor.html @@ -19,7 +19,16 @@

PDF to CSV

- +
+ + + + + + Runs 100% locally + + +

Pull transactions out of bank-statement PDFs into a clean CSV file.

@@ -74,7 +83,7 @@ statement-feb-2026.pdf 147.2 KB
-
@@ -100,84 +109,89 @@

47 candidate transaction(s) from 2 file(s)

-

Uncheck rows to exclude. Edit any cell to fix a value the scanner got wrong. The raw column shows the original PDF text for that row.

+

Uncheck rows to exclude. Edit any cell to fix a value the scanner got wrong. Hover the info on any row to see the original PDF text it came from.

-
+ +
+ - - - + + - + + - + + - + + - + + - + + - + + - + + - + +
Include date description amount_debit amount_credit account_number source_filepageraw
check2026-01-03OPENING BALANCE****4821statement-jan-2026.pdf101/03 OPENING BALANCE 2,140.55info2026-01-03OPENING BALANCE****4821statement-jan-2026.pdf
check2026-01-05POS PURCHASE WHOLE FOODS MKT84.12****4821statement-jan-2026.pdf101/05 POS PURCHASE WHOLE FOODS MKT (84.12)info2026-01-05POS PURCHASE WHOLE FOODS MKT84.12****4821statement-jan-2026.pdf
check2026-01-08ACH DEPOSIT PAYROLL ACME CORP3,250.00****4821statement-jan-2026.pdf101/08 ACH DEPOSIT PAYROLL ACME CORP 3,250.00info2026-01-08ACH DEPOSIT PAYROLL ACME CORP3,250.00****4821statement-jan-2026.pdf
check2026-01-11ONLINE TRANSFER TO SAVINGS500.00****4821statement-jan-2026.pdf201/11 ONLINE TRANSFER TO SAVINGS (500.00)info2026-01-11ONLINE TRANSFER TO SAVINGS500.00****4821statement-jan-2026.pdf
2026-01-12INTEREST RATE 0.50% APY DETAIL****4821statement-jan-2026.pdf201/12 INTEREST RATE 0.50% APY 0.00info2026-01-12INTEREST RATE 0.50% APY DETAIL auto-excluded · not a transaction line****4821statement-jan-2026.pdf
check2026-01-14DEBIT CARD SHELL OIL #228752.40****4821statement-jan-2026.pdf201/14 DEBIT CARD SHELL OIL #2287 (52.40)info2026-01-14DEBIT CARD SHELL OIL #228752.40****4821statement-jan-2026.pdf
check2026-02-02POS PURCHASE TRADER JOES #51161.88****4821statement-feb-2026.pdf102/02 POS PURCHASE TRADER JOES #511 (61.88)info2026-02-02POS PURCHASE TRADER JOES #51161.88****4821statement-feb-2026.pdf
check2026-02-06ACH DEPOSIT PAYROLL ACME CORP3,250.00****4821statement-feb-2026.pdf202/06 ACH DEPOSIT PAYROLL ACME CORP 3,250.00info2026-02-06ACH DEPOSIT PAYROLL ACME CORP3,250.00****4821statement-feb-2026.pdf
check2026-02-09CHECK #10431,200.00****4821statement-feb-2026.pdf202/09 CHECK #1043 (1,200.00)info2026-02-09CHECK #10431,200.00****4821statement-feb-2026.pdf
- -
-
- -

46 of 47 rows selected.

-
-
-
- -
- date - description - amount_debit - amount_credit - account_number - source_file -
-
page and raw are kept off by default; tick them if you want them in the file.
+ +
+
+ +
+ date + description + amount_debit + amount_credit + account_number + source_file
+
page and raw are kept off by default; tick them if you want them in the file.
+ +

1 row excluded (INTEREST RATE detail line).

diff --git a/layout-review/11_reconciler.html b/layout-review/11_reconciler.html index 311837e..b543f1e 100644 --- a/layout-review/11_reconciler.html +++ b/layout-review/11_reconciler.html @@ -19,7 +19,16 @@

Reconcile Two Files

- +
+ + + + + + Runs 100% locally + + +

Compare two lists of transactions (e.g. bank vs. ledger) and flag what doesn't match.

@@ -30,18 +39,11 @@

Left (e.g. bank feed)

-
-
- upload_file Drag and drop file here - CSV, TSV, XLSX, XLS -
- -
-
- - bank_feed_may.csv - 214 KB +
+ description + Using bank_feed_may.csv from the upload screen.
+

bank_feed_may.csv — 1,204 rows, 4 columns

Preview left (e.g. bank feed) @@ -63,18 +65,11 @@

Right (e.g. ledger)

-
-
- upload_file Drag and drop file here - CSV, TSV, XLSX, XLS -
- -
-
- - ledger_may.xlsx - 96 KB +
+ description + Using ledger_may.xlsx from the upload screen.
+

ledger_may.xlsx — 1,198 rows, 5 columns

Preview right (e.g. ledger) @@ -105,7 +100,7 @@

Left columns

posted_date
description
-
amount
+
amount
ref
@@ -114,9 +109,10 @@

Right columns

txn_date
memo
-
value
+
value
-
invoice_no
+
invoice_no
+
check_circle 1 reference each side — counts match
@@ -132,7 +128,7 @@
1
Allow N calendar days of drift between posting dates.
-
Invert right amount sign
+
Use when one side records debits as positive and the other as negative.
@@ -150,56 +146,34 @@

Results

-
Matched
1,173
Review
9
Unmatched left
22
Unmatched right
16
+
Matched
1,173

Coverage: 97.4% of the larger side

- +
- Matched (1,173) - Review (9) + Review (9) Unmatched left (22) Unmatched right (16) + Matched (1,173)
- -

Preview of first 25 of 1,173 rows — download the CSV below for the full set.

+ +

Pairs flagged because the algorithm couldn't pick a single best match (e.g. multiple equally-good candidates). Use the left/right indices to disambiguate manually.

- - - - + - - - - - + +
left_posted_dateleft_descriptionleft_amountright_txn_dateright_memoright_valueamount_diff
left_idxleft_amountright_idxright_valuecandidates
2026-05-01ACME SUPPLIES-1240.002026-05-01Acme Supplies Inc-1240.000.00
2026-05-02PAYROLL RUN-8800.002026-05-02Monthly payroll-8800.000.00
2026-05-03CLIENT GLOBEX5200.002026-05-03Globex retainer5200.000.00
2026-05-04UTILITY CO-318.422026-05-04City Utilities-318.400.02
2026-05-06OFFICE DEPOT-89.152026-05-07Office supplies-89.150.00
118-450.00121, 209-450.002 equal
2031000.00198, 2441000.002 equal
-
- Review (9) — ambiguous candidates -
-

Pairs flagged because the algorithm couldn't pick a single best match (e.g. multiple equally-good candidates). Use the left/right indices to disambiguate manually.

-
- - - - - - -
left_idxleft_amountright_idxright_valuecandidates
118-450.00121, 209-450.002 equal
2031000.00198, 2441000.002 equal
-
-
-
-
Unmatched left (22) — only in bank_feed_may.csv
@@ -232,14 +206,37 @@
+
+ Matched (1,173) — cleanly reconciled +
+

Preview of first 25 of 1,173 rows — download the CSV below for the full set.

+
+ + + + + + + + + + + + +
left_posted_dateleft_descriptionleft_amountright_txn_dateright_memoright_valueamount_diff
2026-05-01ACME SUPPLIES-1240.002026-05-01Acme Supplies Inc-1240.000.00
2026-05-02PAYROLL RUN-8800.002026-05-02Monthly payroll-8800.000.00
2026-05-03CLIENT GLOBEX5200.002026-05-03Globex retainer5200.000.00
2026-05-04UTILITY CO-318.422026-05-04City Utilities-318.400.02
2026-05-06OFFICE DEPOT-89.152026-05-07Office supplies-89.150.00
+
+
+
+
- +
- +
diff --git a/layout-review/app.css b/layout-review/app.css index b363eaa..c169c1d 100644 --- a/layout-review/app.css +++ b/layout-review/app.css @@ -122,6 +122,19 @@ code, .dt-mono { font-family: var(--font-mono); font-size: 0.92em; font-feature- .dt-nav-link .dt-mi { font-family: "Material Symbols Outlined"; font-size: 18px; color: var(--ink-secondary); line-height: 1; } .dt-nav-link.is-active .dt-mi { color: var(--ink); } .dt-nav-link.is-soon { opacity: 0.55; } + +/* "Start here" front-door item — weightier than ordinary nav links so the + obvious entry point reads at a glance. Accent-fill ground + accent-hover ink, + slightly larger hit area, with bottom margin to part it from the groups below. + Layers on .dt-nav-link, so the .is-active treatment still overrides cleanly. */ +.dt-nav-start { + background: var(--accent-fill); color: var(--accent-hover); font-weight: 600; + padding: 8px 10px; margin-bottom: 12px; +} +.dt-nav-start:hover { background: var(--accent-fill-strong); color: var(--accent-hover); } +.dt-nav-start .dt-mi { color: var(--accent); } +.dt-nav-start.is-active { background: var(--accent-fill-strong); color: var(--accent-hover); } +.dt-nav-start.is-active .dt-mi { color: var(--accent); } .dt-nav-soon-tag { margin-left: auto; font-size: 9px; font-weight: 600; letter-spacing: 0.06em; text-transform: uppercase; color: var(--ink-tertiary); @@ -199,6 +212,11 @@ code, .dt-mono { font-family: var(--font-mono); font-size: 0.92em; font-feature- } .dt-help-btn .dt-mi { font-family: "Material Symbols Outlined"; font-size: 18px; } .dt-tool-caption { font-size: 12.5px; color: var(--ink-tertiary); line-height: 1.5; margin: 2px 0 0; } +/* Right-side actions cluster in a tool header: the local-first privacy pill + + the Help button. One shared class so every tool page aligns identically + (replaces per-page inline flex/gap/margin drift). */ +.dt-tool-header-actions { display: flex; align-items: center; gap: 12px; flex-shrink: 0; margin-top: 6px; } +.dt-tool-header-actions .dt-help-btn { margin-top: 0; } /* =========================================================================== Buttons @@ -288,6 +306,24 @@ code, .dt-mono { font-family: var(--font-mono); font-size: 0.92em; font-feature- .dt-alert.error { background: var(--danger-fill); color: var(--danger); } .dt-alert code { background: rgba(0,0,0,0.05); padding: 1px 5px; border-radius: 4px; } +/* Next-step strip — slim single-line "what to do next" suggestion shown at the + end of a tool's results. Subtle accent ground + left accent rule so it nudges + without competing with alerts; the trailing dismiss control is unobtrusive. */ +.dt-next-step { + display: flex; align-items: center; gap: 10px; + background: var(--accent-fill); border-left: 3px solid var(--accent); + border-radius: var(--r-md); padding: 10px 14px; margin: 16px 0; + font-size: 13.5px; line-height: 1.4; color: var(--ink); +} +.dt-next-step .dt-mi { font-family: "Material Symbols Outlined"; font-size: 18px; color: var(--accent); flex-shrink: 0; } +.dt-next-step a { color: var(--accent); font-weight: 500; } +.dt-next-step a:hover { color: var(--accent-hover); } +.dt-next-step-dismiss { + margin-left: auto; background: transparent; border: none; cursor: pointer; + color: var(--ink-tertiary); font-size: 13px; line-height: 1; padding: 2px 4px; +} +.dt-next-step-dismiss:hover { color: var(--ink-secondary); } + /* =========================================================================== Inputs (static representations of Streamlit widgets) =========================================================================== */ @@ -330,6 +366,20 @@ code, .dt-mono { font-family: var(--font-mono); font-size: 0.92em; font-feature- .dt-radio .dot { width: 16px; height: 16px; border-radius: 50%; border: 1px solid var(--border-strong); display: inline-block; flex-shrink: 0; } .dt-radio.on .dot { border: 5px solid var(--ink); } +/* Strategy precedence legend + overridden state (Fix Missing Values). + Makes the preset -> global -> per-column resolution order legible and + visibly dims a layer when a more specific layer wins. */ +.dt-precedence { + display: flex; align-items: center; gap: 8px; + background: var(--surface-hover); border: 1px solid var(--border); + border-radius: var(--r-md); padding: 9px 13px; margin: 0 0 14px; + font-size: 12.5px; color: var(--ink-secondary); line-height: 1.4; +} +.dt-precedence .dt-mi { font-family: "Material Symbols Outlined"; font-size: 18px; color: var(--ink-tertiary); flex-shrink: 0; } +.dt-precedence strong { color: var(--ink); font-weight: 600; } +.dt-radio-row.is-overridden { opacity: 0.5; } +.dt-radio-row.is-overridden .dt-radio { text-decoration: line-through; text-decoration-color: var(--ink-tertiary); } + /* Slider */ .dt-slider { margin: 14px 0 6px; } .dt-slider .track { position: relative; height: 4px; background: var(--border-strong); border-radius: 2px; } @@ -445,6 +495,25 @@ table.dt-table td.idx { color: var(--ink-tertiary); background: var(--surface-ho .dt-finding-title strong { font-weight: 500; } .dt-finding-meta { font-family: var(--font-mono); font-size: 12px; color: var(--ink-tertiary); line-height: 1.4; margin: 0; font-feature-settings: "ss02"; } +/* Overflow control — sits at the foot of a findings card when rows are hidden. + Bleeds to the card edges (cancels the .dt-card 16px padding) like .dt-file-add. */ +.dt-finding-more { + display: flex; align-items: center; justify-content: center; gap: 6px; + width: calc(100% + 32px); margin: 4px -16px -16px; + padding: 11px 16px; background: var(--surface-hover); + border: none; border-top: 1px solid var(--border); + border-radius: 0 0 var(--r-lg) var(--r-lg); cursor: pointer; + font-family: var(--font-sans); font-size: 12.5px; font-weight: 500; color: var(--ink-secondary); +} +.dt-finding-more:hover { background: var(--accent-fill); color: var(--accent); } +.dt-finding-more .dt-mi { font-family: "Material Symbols Outlined"; font-size: 18px; } + +/* Collapsed findings panel — the group head fills the whole card (head only, + no body). Proper state variant so the two states don't drift; replaces the + per-instance inline margin-bottom:-16px hack. */ +.dt-card.is-collapsed { padding: 0; } +.dt-finding-group-head.is-collapsed { margin: 0; border-bottom: none; border-radius: var(--r-lg); } + /* Match-group review card (dedup) */ .dt-match-card { background: var(--surface); border: 1px solid var(--border); border-radius: var(--r-lg); box-shadow: 0 1px 2px rgba(28,25,23,0.03); margin: 12px 0; overflow: hidden; } .dt-match-head { background: var(--surface-hover); border-bottom: 1px solid var(--border); padding: 12px 16px; display: flex; align-items: center; gap: 12px; } diff --git a/layout-review/home.html b/layout-review/home.html index 5c4d3ca..6f2d8cb 100644 --- a/layout-review/home.html +++ b/layout-review/home.html @@ -69,9 +69,9 @@
-
- - +
+ +

@@ -79,8 +79,8 @@
-
Files analyzed
-
3
+
Rows scanned
+
48,210 rows
Total findings
@@ -96,6 +96,44 @@
+ +
+
+ + auto_awesome + +
+

Recommended

+

Runs the recommended clean — fix text, standardize formats, fill blanks, remove duplicates — in the right order, then hands you the cleaned file.

+
+ +
+ +
+ 1 · Clean Text + arrow_forward + 2 · Standardize + arrow_forward + 3 · Fix Missing + arrow_forward + 4 · Find Duplicates + Result downloads when finished +
+
+ + +

Or fix issues one at a time

+

Prefer to handle things yourself? Open any finding to jump straight to the right tool.

+
@@ -129,11 +167,15 @@

3 formats detected · Standardize Formats →

+ +
-
-
+