MortarBench: Evaluating Mortgage Loan Origination Agents

Bojun Liu; Cheng Li; Derek Rindner; Manav Munjal; Matthew Toles; Stephanie Selig; Yuanhao Deng; Yunan Lu; Zhou Yu

arxiv: 2606.19416 · v2 · pith:5KNGVF3Rnew · submitted 2026-06-17 · 💻 cs.LG

MortarBench: Evaluating Mortgage Loan Origination Agents

Matthew Toles , Yunan Lu , Manav Munjal , Bojun Liu , Yuanhao Deng , Stephanie Selig , Derek Rindner , Cheng Li

show 1 more author

Zhou Yu

This is my paper

Pith reviewed 2026-06-26 20:41 UTC · model grok-4.3

classification 💻 cs.LG

keywords mortgage loan originationLLM benchmarkconfidence calibrationAI biasfinancial decision makingloan agent evaluationCRIT framework

0 comments

The pith

A benchmark for mortgage loan agents shows closed-source LLMs reach only 77.1% exact match accuracy, with a new calibration method lifting it to 80.5%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MortarBench to test how well large language models can perform mortgage loan origination tasks such as eligibility checks and risk assessment. Evaluations show current models fall short and display systematic biases tied to non-English names. The authors present CRIT, a confidence calibration approach that raises exact match accuracy to 80.5 percent, steers risk management decisions more effectively, and lowers detected bias.

Core claim

MortarBench relies on a financial data synthesis and mutation pipeline to produce examples that cover edge cases while matching real-world distributions. State-of-the-art LLMs achieve at most 77.1 percent exact match accuracy on the benchmark and exhibit biases in how they perceive foreignness from applicant names. The CRIT confidence calibration framework improves accuracy to 80.5 percent, refines risk management steering, and reduces bias.

What carries the argument

MortarBench benchmark built from synthesis and mutation pipeline, together with the CRIT confidence calibration framework that adjusts model outputs for accuracy and fairness.

If this is right

Loan origination agents can be scored and compared using MortarBench as a public standard.
Applying CRIT calibration improves both accuracy and bias metrics on the benchmark tasks.
Models that pass the benchmark with calibration still require checks for risk steering consistency.
Bias patterns linked to name origin can be measured and mitigated in financial decision agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same synthesis approach could generate benchmarks for other lending products such as auto or personal loans.
Deployed agents would need periodic re-testing as applicant demographics and regulations shift.
Combining CRIT-calibrated models with human review loops could further lower error rates in high-stakes cases.

Load-bearing premise

The pipeline that synthesizes and mutates financial data produces examples whose distribution and edge cases match those encountered in actual mortgage applications.

What would settle it

Running the same LLMs on a set of real historical loan files from a lender and measuring exact match to human underwriter decisions would reveal whether MortarBench scores predict performance outside the synthetic set.

Figures

Figures reproduced from arXiv: 2606.19416 by Bojun Liu, Cheng Li, Derek Rindner, Manav Munjal, Matthew Toles, Stephanie Selig, Yuanhao Deng, Yunan Lu, Zhou Yu.

**Figure 2.** Figure 2: Dataset generation pipeline. We generate [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: False negative and positive rate as a function [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Rate at which wire transfers are classified as foreign as a function of language for company and personal [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Recall rate of transactions involving US [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

read the original abstract

Loan origination is the process by which a lender creates a new loan, from application and underwriting through approval and funding. This process serves a critical role in evaluating the eligibility and level of risk posed by an applicant. Recently, firms have begun using mortgage loan agents to augment human loan officers, despite a lack of any public benchmark. To fill this gap, we present MortarBench, a loan origination agent benchmark. MortarBench uses a financial data synthesis and mutation pipeline to generate examples with broad edge case coverage that match real-world distributions and questions. We find that state-of-the-art large language models (LLMs) perform poorly, with closed-source models achieving at most 77.1\% exact match accuracy. We also discover systematic biases in LLM perception of foreignness related to non-English names. Noting these weaknesses, we introduce CRIT, a confidence calibration framework. Our method increases accuracy to 80.5\% while improving risk management steering and reducing bias.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MortarBench is the first public benchmark for mortgage loan agents but its synthetic data pipeline has no reported validation against real distributions like HMDA, so the 77% LLM ceiling and CRIT gains rest on unverified ground.

read the letter

MortarBench introduces the first public benchmark for mortgage loan origination agents along with the CRIT calibration framework. The paper reports that closed-source LLMs top out at 77.1% exact match accuracy on the task, with systematic bias against non-English names, and that CRIT lifts accuracy to 80.5% while improving risk steering.

The contribution that actually lands is the decision to build and release a benchmark in a high-stakes regulated domain where agents are already being tried. Calling out both the accuracy gap and the name-based bias gives concrete evidence that current models are not ready for unsupervised use in lending.

The soft spot is the data. The synthesis and mutation pipeline is described as matching real-world distributions and covering edge cases, yet the paper supplies no quantitative checks—KS tests, Wasserstein distances, or calibration against HMDA or Fannie Mae files. Without those, the performance numbers and bias findings could be artifacts of how the examples were generated rather than genuine task properties. The abstract also gives accuracy figures without an evaluation protocol or statistical tests.

This is for people building agent benchmarks or testing LLMs in finance and compliance. The benchmark idea itself is worth attention, but anyone planning to rely on the reported numbers will need stronger evidence that the test cases track reality.

Send it to peer review. The domain matters and a public test set for loan origination fills a real gap, even if the current version needs more work on data validation.

Referee Report

1 major / 2 minor

Summary. The paper introduces MortarBench, a benchmark for mortgage loan origination agents built via a financial data synthesis and mutation pipeline that is asserted to produce examples with broad edge-case coverage matching real-world distributions. It reports that closed-source LLMs achieve at most 77.1% exact-match accuracy on the benchmark, identifies systematic biases related to non-English names, and proposes the CRIT confidence-calibration framework, which raises accuracy to 80.5% while improving risk steering and reducing bias.

Significance. If the synthetic-data fidelity claim holds, the work would fill a genuine gap by supplying the first public benchmark for an emerging high-stakes application of LLM agents and would supply concrete evidence of current model limitations together with a practical calibration technique. The absence of any quantitative validation of the data pipeline against external loan-level corpora, however, leaves the headline performance numbers and bias findings without an anchor to real distributions.

major comments (1)

[Methods / data-generation pipeline] The section describing the synthesis/mutation pipeline asserts that generated examples 'match real-world distributions' yet reports no statistical comparison (KS tests, Wasserstein distances, marginal/joint calibration, or approval-rate matching) against external references such as HMDA or Fannie Mae loan-level files. Because the central claims (77.1% LLM ceiling, 80.5% CRIT improvement, bias reduction) rest on MortarBench being a faithful proxy, this omission is load-bearing.

minor comments (2)

[Abstract] The abstract introduces the acronym CRIT without an immediate parenthetical expansion; a reader must reach the body to learn it denotes a confidence-calibration framework.
[Results] No table or figure caption clarifies the exact definition of 'exact match accuracy' for loan-origination decisions (e.g., whether it requires identical approval/denial plus all underwriting fields).

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the importance of quantitative validation for the synthetic data pipeline. We address the major comment below and commit to revisions that strengthen the manuscript.

read point-by-point responses

Referee: [Methods / data-generation pipeline] The section describing the synthesis/mutation pipeline asserts that generated examples 'match real-world distributions' yet reports no statistical comparison (KS tests, Wasserstein distances, marginal/joint calibration, or approval-rate matching) against external references such as HMDA or Fannie Mae loan-level files. Because the central claims (77.1% LLM ceiling, 80.5% CRIT improvement, bias reduction) rest on MortarBench being a faithful proxy, this omission is load-bearing.

Authors: We agree that the manuscript's assertion of matching real-world distributions would be strengthened by explicit statistical comparisons to external references such as HMDA or Fannie Mae loan-level files, and that the absence of such validation (e.g., KS tests, Wasserstein distances, marginal/joint calibration, or approval-rate matching) is a substantive limitation given the centrality of benchmark fidelity to the reported LLM performance and bias results. The synthesis pipeline was constructed using domain-informed rules and mutation strategies intended to cover edge cases and approximate real distributions, but no direct quantitative anchoring against public loan-level corpora was performed or reported. In the revised version we will add these comparisons, including Kolmogorov-Smirnov tests and Wasserstein distances on key marginals (credit score, DTI, LTV, income), joint distribution checks where feasible, and approval-rate alignment against HMDA aggregates, with any limitations in data access or privacy clearly noted. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with no derivations or load-bearing self-citations

full rationale

The paper introduces MortarBench via a described synthesis pipeline and reports direct empirical measurements of LLM accuracy (77.1% ceiling, 80.5% with CRIT) plus bias observations. No equations, fitted parameters renamed as predictions, uniqueness theorems, or self-citation chains appear in the provided text. All claims reduce to experimental results on the constructed benchmark rather than any derivation that collapses to its own inputs by construction. This is standard self-contained empirical reporting.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review; no mathematical structure, free parameters, or axioms identifiable.

invented entities (1)

CRIT no independent evidence
purpose: confidence calibration framework to improve LLM accuracy and reduce bias in loan tasks
New method introduced to address observed weaknesses; no independent evidence supplied.

pith-pipeline@v0.9.1-grok · 5717 in / 1078 out tokens · 29535 ms · 2026-06-26T20:41:17.898639+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 3 canonical work pages · 3 internal anchors

[1]

InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3697–3711

Finqa: A dataset of numerical reasoning over financial data. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3697–3711. Chanyeol Choi, Jihoon Kwon, Jaeseon Ha, Hojun Choi, Chaewoon Kim, Yongjae Lee, Jy-yong Sohn, and Alejandro Lopez-Lira

2021
[2]

InProceedings of the 6th ACM International Conference on AI in Fi- nance, pages 638–646

Finder: Finan- cial dataset for question answering and evaluating retrieval-augmented generation. InProceedings of the 6th ACM International Conference on AI in Fi- nance, pages 638–646. Fannie Mae. 2025.Fannie Mae Single-Family Selling Guide. Fannie Mae. Published December 10,

2025
[3]

https://sf

Uniform residen- tial loan application. https://sf. freddiemac.com/tools-learning/ uniform-mortgage-data-program/ulad . Ac- cessed: 2026-05-21. Pranab Islam, Anand Kannappan, Douwe Kiela, Re- becca Qian, Nino Scherrer, and Bertie Vidgen

2026
[4]

FinanceBench: A New Benchmark for Financial Question Answering

Financebench: A new benchmark for financial ques- tion answering.arXiv preprint arXiv:2311.11944. Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, and 1 others

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Language Models (Mostly) Know What They Know

Language mod- els (mostly) know what they know.arXiv preprint arXiv:2207.05221. Sanyam Kapoor, Nate Gruver, Manley Roberts, Arka Pal, Samuel Dooley, Micah Goldblum, and Andrew Wilson

work page internal anchor Pith review Pith/arXiv arXiv
[6]

InPro- ceedings of the 1st Workshop on Uncertainty-Aware NLP (UncertaiNLP 2024), pages 1–14

Calibration-tuning: Teaching large lan- guage models to know what they don’t know. InPro- ceedings of the 1st Workshop on Uncertainty-Aware NLP (UncertaiNLP 2024), pages 1–14. Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar

2024
[7]

Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation

Semantic uncertainty: Linguistic invariances for un- certainty estimation in natural language generation. arXiv preprint arXiv:2302.09664. Lefteris Loukas, Manos Fergadiotis, Ilias Chalkidis, Eirini Spyropoulou, Prodromos Malakasiotis, Ion An- droutsopoulos, and Georgios Paliouras

work page internal anchor Pith review Pith/arXiv arXiv
[8]

In Proceedings of the 2023 conference on empirical methods in natural language processing, pages 9004–

Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. In Proceedings of the 2023 conference on empirical methods in natural language processing, pages 9004–

2023
[9]

InProceedings of the 62nd Annual Meeting of the Association for Computational Lin- guistics (Volume 2: Short Papers), pages 445–458

Docfinqa: A long-context financial rea- soning dataset. InProceedings of the 62nd Annual Meeting of the Association for Computational Lin- guistics (Volume 2: Short Papers), pages 445–458. Kenneth J Rothman, Sander Greenland, Timothy L Lash, and 1 others. 2008.Modern epidemiology, vol- ume

2008
[10]

In Proceedings of the Third Workshop on Economics and Natural Language Processing, pages 19–25

The global banking standards qa dataset (gbs-qa). In Proceedings of the Third Workshop on Economics and Natural Language Processing, pages 19–25. Wikipedia Contributors. 2026a. List of most popular given names. https://en.wikipedia.org/ wiki/List_of_most_popular_given_names. Wikipedia, accessed 2026-05-22. Wikipedia Contributors. 2026b. Lists of most comm...

2026
[11]

In Findings of the Association for Computational Lin- guistics: EMNLP 2023, pages 9778–9795

On the calibra- tion of large language models and alignment. In Findings of the Association for Computational Lin- guistics: EMNLP 2023, pages 9778–9795. Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, and Tat-Seng Chua

2023
[12]

seed": "generated-test-15d7a496

Tat-qa: A question answering benchmark on a hybrid of tabular and textual content in finance. InProceedings of the 59th annual meeting of the Association for Computational Linguistics and the 11th international joint conference on natural language processing (volume 1: long papers), pages 3277–3287. A Example Bank Statement { "seed": "generated-test-15d7a...

2026

[1] [1]

InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3697–3711

Finqa: A dataset of numerical reasoning over financial data. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3697–3711. Chanyeol Choi, Jihoon Kwon, Jaeseon Ha, Hojun Choi, Chaewoon Kim, Yongjae Lee, Jy-yong Sohn, and Alejandro Lopez-Lira

2021

[2] [2]

InProceedings of the 6th ACM International Conference on AI in Fi- nance, pages 638–646

Finder: Finan- cial dataset for question answering and evaluating retrieval-augmented generation. InProceedings of the 6th ACM International Conference on AI in Fi- nance, pages 638–646. Fannie Mae. 2025.Fannie Mae Single-Family Selling Guide. Fannie Mae. Published December 10,

2025

[3] [3]

https://sf

Uniform residen- tial loan application. https://sf. freddiemac.com/tools-learning/ uniform-mortgage-data-program/ulad . Ac- cessed: 2026-05-21. Pranab Islam, Anand Kannappan, Douwe Kiela, Re- becca Qian, Nino Scherrer, and Bertie Vidgen

2026

[4] [4]

FinanceBench: A New Benchmark for Financial Question Answering

Financebench: A new benchmark for financial ques- tion answering.arXiv preprint arXiv:2311.11944. Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, and 1 others

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Language Models (Mostly) Know What They Know

Language mod- els (mostly) know what they know.arXiv preprint arXiv:2207.05221. Sanyam Kapoor, Nate Gruver, Manley Roberts, Arka Pal, Samuel Dooley, Micah Goldblum, and Andrew Wilson

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

InPro- ceedings of the 1st Workshop on Uncertainty-Aware NLP (UncertaiNLP 2024), pages 1–14

Calibration-tuning: Teaching large lan- guage models to know what they don’t know. InPro- ceedings of the 1st Workshop on Uncertainty-Aware NLP (UncertaiNLP 2024), pages 1–14. Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar

2024

[7] [7]

Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation

Semantic uncertainty: Linguistic invariances for un- certainty estimation in natural language generation. arXiv preprint arXiv:2302.09664. Lefteris Loukas, Manos Fergadiotis, Ilias Chalkidis, Eirini Spyropoulou, Prodromos Malakasiotis, Ion An- droutsopoulos, and Georgios Paliouras

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

In Proceedings of the 2023 conference on empirical methods in natural language processing, pages 9004–

Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. In Proceedings of the 2023 conference on empirical methods in natural language processing, pages 9004–

2023

[9] [9]

InProceedings of the 62nd Annual Meeting of the Association for Computational Lin- guistics (Volume 2: Short Papers), pages 445–458

Docfinqa: A long-context financial rea- soning dataset. InProceedings of the 62nd Annual Meeting of the Association for Computational Lin- guistics (Volume 2: Short Papers), pages 445–458. Kenneth J Rothman, Sander Greenland, Timothy L Lash, and 1 others. 2008.Modern epidemiology, vol- ume

2008

[10] [10]

In Proceedings of the Third Workshop on Economics and Natural Language Processing, pages 19–25

The global banking standards qa dataset (gbs-qa). In Proceedings of the Third Workshop on Economics and Natural Language Processing, pages 19–25. Wikipedia Contributors. 2026a. List of most popular given names. https://en.wikipedia.org/ wiki/List_of_most_popular_given_names. Wikipedia, accessed 2026-05-22. Wikipedia Contributors. 2026b. Lists of most comm...

2026

[11] [11]

In Findings of the Association for Computational Lin- guistics: EMNLP 2023, pages 9778–9795

On the calibra- tion of large language models and alignment. In Findings of the Association for Computational Lin- guistics: EMNLP 2023, pages 9778–9795. Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, and Tat-Seng Chua

2023

[12] [12]

seed": "generated-test-15d7a496

Tat-qa: A question answering benchmark on a hybrid of tabular and textual content in finance. InProceedings of the 59th annual meeting of the Association for Computational Linguistics and the 11th international joint conference on natural language processing (volume 1: long papers), pages 3277–3287. A Example Bank Statement { "seed": "generated-test-15d7a...

2026