MBABench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance
Pith reviewed 2026-06-30 17:09 UTC · model grok-4.3
The pith
LLM agents cannot reliably produce professional-quality financial spreadsheets from high-level instructions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Even the strongest agents frequently fall short of professional finance standards and degrade sharply as the difficulty increases beyond a few chained calculations, showing that current agents are not yet able to reliably produce professional-quality spreadsheets at the level of complexity real-world workflows demand.
What carries the argument
A three-dimensional evaluation taxonomy of Accuracy, Formula, and Format, each with fine-grained criteria that reflect how finance deliverables are reviewed and revised by multiple stakeholders.
If this is right
- Agents require new capabilities for maintaining correctness across chained calculations in financial models.
- Outputs must meet explicit standards for readability and ease of modification to pass stakeholder review.
- Benchmarks focused only on question answering or single edits miss the end-to-end construction gap shown here.
- Finance workflows cannot yet be automated end-to-end at professional quality with existing agent designs.
Where Pith is reading between the lines
- The taxonomy could be adapted to evaluate agent performance on other structured professional artifacts such as reports or presentations.
- Agents might benefit from built-in revision loops that incorporate feedback on formula and format issues.
- Task difficulty scaling in the benchmark could guide training data creation for more complex multi-step workflows.
Load-bearing premise
The three-dimensional taxonomy with its fine-grained criteria accurately reflects the multidimensional professional standards used when finance deliverables are reviewed and revised by multiple stakeholders.
What would settle it
A direct comparison in which the top agent produces complete spreadsheets on complex tasks that receive passing professional review scores across accuracy, formula integrity, and format criteria.
Figures
read the original abstract
LLM agents are increasingly expected to carry out end-to-end workflows, producing complete artifacts from high-level user instructions. To meet enterprise needs, frontier AI labs have developed agents that can construct entire spreadsheets from scratch. This is especially relevant in finance, where core workflows such as financial modeling, forecasting, and scenario analysis are commonly conducted through spreadsheets. Yet, existing spreadsheet benchmarks do not measure this advanced capability, focusing instead on question-answering or single-formula edits. To address this gap, we provide one of the first evaluations of agents on end-to-end spreadsheet tasks, focusing on economically critical financial workflows such as modeling and scenario analysis. Since deliverables therein are routinely reviewed and revised by multiple stakeholders, judging their quality necessarily involves high-level criteria such as readability or ease of modification. To reflect the multidimensional nature of solution quality, we develop an evaluation taxonomy comprising three dimensions: Accuracy, Formula, and Format, each comprising fine-grained criteria that reflect professional standards. The Claude family leads the benchmark and produces the most professional-looking outputs in our qualitative review, but even the strongest agents frequently fall short of professional finance standards and degrade sharply as the difficulty increases beyond a few chained calculations. This suggests that current agents are not yet able to reliably produce professional-quality spreadsheets at the level of complexity real-world workflows demand.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MBABench, one of the first benchmarks for LLM agents on end-to-end spreadsheet construction tasks in finance (e.g., modeling, forecasting, scenario analysis). It defines a three-dimensional evaluation taxonomy (Accuracy, Formula, Format) with fine-grained criteria claimed to reflect professional standards, evaluates leading agents, and concludes that even the strongest (Claude family) fall short of professional finance standards with sharp degradation beyond a few chained calculations.
Significance. If the taxonomy and scoring are validated, the work is significant for highlighting a gap between current agent capabilities and real-world finance workflow demands, moving beyond existing QA or single-edit spreadsheet benchmarks. The qualitative review of output professionalism provides additional value.
major comments (2)
- [Abstract] Abstract: The claim that the Accuracy-Formula-Format taxonomy's fine-grained criteria 'reflect professional standards' because deliverables are 'routinely reviewed and revised by multiple stakeholders' lacks any described derivation process, expert calibration, inter-rater study, or comparison against actual finance review rubrics. This is load-bearing for the central claim that agents fall short of professional standards.
- [Evaluation Methodology] Evaluation section (inferred from abstract): No details are supplied on task construction, number of examples per difficulty level, scoring rubrics, inter-rater reliability, or statistical tests for the reported performance degradation with complexity. These omissions prevent verification that the data support the claims about agent limitations.
minor comments (1)
- [Abstract] Abstract: The statement that this is 'one of the first evaluations' would benefit from explicit citations to the 'existing spreadsheet benchmarks' mentioned for context.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and commit to revisions that improve clarity and transparency without misrepresenting the work.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that the Accuracy-Formula-Format taxonomy's fine-grained criteria 'reflect professional standards' because deliverables are 'routinely reviewed and revised by multiple stakeholders' lacks any described derivation process, expert calibration, inter-rater study, or comparison against actual finance review rubrics. This is load-bearing for the central claim that agents fall short of professional standards.
Authors: We agree the manuscript would be strengthened by explicitly describing the taxonomy's development. The criteria draw from established financial modeling practices (e.g., emphasis on auditability, modifiability, and readability as discussed in standard references such as financial modeling guides). The multi-stakeholder review rationale follows directly from the nature of professional finance deliverables. However, no formal expert calibration or inter-rater study is reported. In revision we will add a dedicated paragraph detailing the taxonomy construction process with supporting citations and will note the absence of formal validation studies as a limitation. revision: yes
-
Referee: [Evaluation Methodology] Evaluation section (inferred from abstract): No details are supplied on task construction, number of examples per difficulty level, scoring rubrics, inter-rater reliability, or statistical tests for the reported performance degradation with complexity. These omissions prevent verification that the data support the claims about agent limitations.
Authors: We acknowledge that additional methodological detail is required for verifiability. The Evaluation section will be expanded to cover: (1) task construction process and how tasks map to real finance workflows, (2) exact counts of examples per difficulty tier, (3) the complete scoring rubrics for Accuracy, Formula, and Format, (4) clarification that evaluation combines rule-based checks with qualitative human review, and (5) any statistical observations on degradation with complexity. We will also state whether formal inter-rater reliability metrics were computed. revision: yes
Circularity Check
No significant circularity: empirical benchmark paper with no derivations or fitted predictions
full rationale
This is an empirical benchmark paper that introduces MBABench and evaluates LLM agents on spreadsheet tasks using a custom three-dimensional taxonomy (Accuracy, Formula, Format). The taxonomy is presented as reflecting professional standards without any claimed mathematical derivation, parameter fitting, or prediction step that reduces to the inputs by construction. No equations, self-citation load-bearing arguments, uniqueness theorems, or ansatzes appear in the provided text. The central claims rest on direct evaluation results and qualitative review rather than any self-referential reduction, making the paper self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The financial workflows chosen (modeling, forecasting, scenario analysis) are representative of core enterprise spreadsheet tasks.
- domain assumption High-level criteria such as readability and ease of modification are necessary for judging quality because deliverables are routinely reviewed by multiple stakeholders.
Forward citations
Cited by 2 Pith papers
-
IPO Finance Agent: Benchmark of LLM Financial Analysts Beyond Finance Agent v2, with Automated Rubric Generation, on the SpaceX (SPCX) IPO
IPO Finance Agent creates a new LLM benchmark for IPO due diligence on SpaceX S-1 with contextual retrieval and automated rubrics, where Qwen 3.7 Max scores 79.4% at $0.30/query, outperforming Finance Agent v2 on accu...
-
IPO Finance Agent: Benchmark of LLM Financial Analysts Beyond Finance Agent v2, with Automated Rubric Generation, on the SpaceX (SPCX) IPO
IPO Finance Agent benchmarks LLMs on SpaceX S-1 questions with contextual retrieval and auto-generated rubrics, reporting up to 79.8% accuracy and better cost-efficiency than prior Finance Agent v2 entries.
Reference graph
Works this paper leans on
-
[1]
Use Claude for Excel
Anthropic. Use Claude for Excel. https://support.claude.com/en/articles/ 12650343-use-claude-for-excel, 2025a. Accessed: 2026-05-01. OpenAI. ChatGPT for Excel and Google Sheets in beta. https://chatgpt.com/apps/ spreadsheets/, 2026a. Accessed: 2026-05-01. CFA Institute. Financial Modeling. https://www.cfainstitute.org/ programs/cfa-program/candidate-resou...
2026
-
[2]
Association for Financial Professionals
Accessed: 2026-05-01. Association for Financial Professionals. 2025 AFP FP&A Benchmarking Survey Report: Technology & Data. https://www.financialprofessionals.org/ training-resources/resources/survey-research-economic-data/Details/ afp-fpa-benchmarking-survey-report-technology,
2026
-
[3]
Zeyao Ma, Bohan Zhang, Jing Zhang, Jifan Yu, Xiaokang Zhang, Xiaohan Zhang, Sijia Luo, Xi Wang, and Jie Tang
Accessed: 2026-05-01. Zeyao Ma, Bohan Zhang, Jing Zhang, Jifan Yu, Xiaokang Zhang, Xiaohan Zhang, Sijia Luo, Xi Wang, and Jie Tang. SpreadsheetBench: Towards Challenging Real World Spreadsheet Manipulation. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volum...
2026
-
[4]
doi: 10.52202/079017-3007. URL https://proceedings.neurips.cc/paper_files/paper/2024/file/ ac840df270ac537dd74530a15c332684-Paper-Datasets_and_Benchmarks_Track.pdf. Yibin Chen, Yifu Yuan, Zeyu Zhang, Yan Zheng, Jinyi Liu, Fei Ni, Jianye Hao, Hangyu Mao, and Fuzheng Zhang. SheetAgent: Towards A Generalist Agent for Spreadsheet Reasoning and Manipulation vi...
-
[5]
NL2Formula: Generating spreadsheet formulas from natural lan- guage queries
Wei Zhao, Zhitao Hou, Siyuan Wu, Yan Gao, Haoyu Dong, Yao Wan, Hongyu Zhang, Yulei Sui, and Haidong Zhang. NL2Formula: Generating spreadsheet formulas from natural lan- guage queries. In Yvette Graham and Matthew Purver, editors,Findings of the Association for Computational Linguistics: EACL 2024, pages 2377–2388, St. Julian’s, Malta, March
2024
-
[6]
doi: 10.18653/v1/2024.findings-eacl.158
Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-eacl.158. URL https://aclanthology.org/2024.findings-eacl.158/. Haoyu Dong, Pengkun Zhang, Yan Gao, Xuanyu Dong, Yilin Cheng, Mingzhe Lu, Zikun Zhu, Adina Yakefu, and Shuxin Zheng. Finch: Benchmarking finance & accounting across spreadsheet-centric enterprise workflows,
-
[7]
Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows
URLhttps://arxiv.org/abs/2512.13168. 10 Tejal Patwardhan, Rachel Dias, Elizabeth Proehl, Grace Kim, Michele Wang, Olivia Watkins, Simón Posada Fishman, Marwan Aljubeh, Phoebe Thacker, Laurance Fauconnet, Natalie S. Kim, Patrick Chao, Samuel Miserendino, Gildas Chabot, David Li, Michael Sharman, Alexandra Barr, Amelia Glaese, and Jerry Tworek. GDPval: Eval...
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
URLhttps://arxiv.org/abs/2510.04374. Qianqian Xie, Weiguang Han, Zhengyu Chen, Ruoyu Xiang, Xiao Zhang, Yueru He, Mengxi Xiao, Dong Li, Yongfu Dai, Duanyu Feng, Yijing Xu, Haoqiang Kang, Ziyan Kuang, Chenhan Yuan, Kailai Yang, Zheheng Luo, Tianlin Zhang, Zhiwei Liu, Guojun Xiong, Zhiyang Deng, Yuechen Jiang, Zhiyuan Yao, Haohang Li, Yangyang Yu, Gang Hu, ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
ISBN 9798331314385
Curran Associates Inc. ISBN 9798331314385. Zhihan Zhang, Yixin Cao, and Lizi Liao. XFinBench: Benchmarking LLMs in complex finan- cial problem solving and reasoning. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Lin- guistics: ACL 2025, pages 8715–8758, Vienna, Austria, July
2025
-
[10]
XFinBench: Benchmarking LLMs in complex financial problem solving and reasoning
Association for Computa- tional Linguistics. ISBN 979-8-89176-256-5. doi: 10.18653/v1/2025.findings-acl.457. URL https://aclanthology.org/2025.findings-acl.457/. Michael Krumdick, Rik Koncel-Kedziorski, Viet Dac Lai, Varshini Reddy, Charles Lovering, and Chris Tanner. BizBench: A quantitative reasoning benchmark for business and finance. In Lun- Wei Ku, A...
-
[11]
Fang, J., Jiang, H., Wang, K., Ma, Y ., Shi, J., Wang, X., He, X., and Chua, T
Association for Computational Linguistics. doi: 10.18653/v1/ 2024.acl-long.452. URLhttps://aclanthology.org/2024.acl-long.452/. Qianqian Xie, Weiguang Han, Xiao Zhang, Yanzhao Lai, Min Peng, Alejandro Lopez-Lira, and Jimin Huang. PIXIU: A comprehensive benchmark, instruction dataset and large language model for finance. InThirty-seventh Conference on Neur...
-
[12]
BizFinBench: A Business-Driven Real-World Financial Benchmark for Evaluating LLMs
URLhttps://openreview.net/forum?id=vTrRq6vCQH. Guilong Lu, Xuntao Guo, Rongjunchen Zhang, Wenqiao Zhu, and Ji Liu. Bizfinbench: A business- driven real-world financial benchmark for evaluating llms.arXiv preprint arXiv:2505.19457,
-
[13]
History of Financial Modeling Competitions
FMWC. History of Financial Modeling Competitions. https://fmworldcup.com/history/, 2026a. Accessed: 2026-05-01. FMWC. Financial Modeling World Cup. https://fmworldcup.com/, 2026b. Accessed: 2026- 05-01. WSP. Wall Street Prep.https://www.wsp.com/en-us/,
2026
-
[14]
Accessed: 2026-05-01. OpenAI. Introducing ChatGPT Agent. https://openai.com/index/ introducing-chatgpt-agent/,
2026
-
[15]
Microsoft Corporation
Accessed: 2026-01-29. Microsoft Corporation. Playwright: A framework for Web Testing and Automation,
2026
-
[16]
Accessed: 2026-05-06. OpenAI. ChatGPT.https://chatgpt.com/, 2026b. Accessed: 2026-05-01. Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can Language Models Resolve Real-world Github Is- sues? InThe Twelfth International Conference on Learning Representations,
2026
-
[17]
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
URL https://arxiv.org/abs/2405.15793. Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel Fried, Uri Alon, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Claude Opus 4.5 System Card
Anthropic. Claude Opus 4.5 System Card. https://assets.anthropic.com/m/ 64823ba7485345a7/Claude-Opus-4-5-System-Card.pdf, 2025b. Accessed: 2026-01-29. A Evaluation Criteria — Sub-dimensions In this section, we provide more examples for the rubric’s subdimensions to illustrate what they capture. A.1 Using Absolute References Consider converting product pri...
2026
-
[19]
involved_tasks
B.6 Task Difficulty Classification — Exemplary Tasks Exemplary Tasks by Difficulty Level Illustrative competition tasks for each difficulty level Very Easy — Task 7 (Accounts Receivable) A single-step task with direct formula application, requiring no finance knowledge and no Excel skills. Easy — Task 380 (First Time Buyer) Involves only a few steps with ...
2015
-
[20]
xhigh",max_completion_tokens=128000 • Claude Opus 4.6: anthropic_effort=
Claude (Excel)An official Anthropic add-in available through the Microsoft Marketplace that embeds a Claude-powered sidebar directly inside Excel for Windows, Mac, and Excel on the web. Users install it via the Excel Add-ins menu, sign in with a qualifying Anthropic account (Pro, Max, Team, or Enterprise), and interact with Claude through a chat panel wit...
2025
-
[21]
ERROR: missing Q4_Q5 strike
is supported by our qualitative investigations. One example of agents’ poorer performance is the change in the font color of the labels. For instance, in one of ChatGPT (Excel)’s attempts, the agent changed the font color in two consecutive rows (see Figure 15). For a client-ready model, this lack of consistency in presentation would be unacceptable. Figu...
2019
-
[22]
decision
This requires reasoning about the spatial structure of the workbook and whether the freeze pane is placed to actually serve its navigational purpose, not simply checking whether one is present. "decision":"Fail", "summary":"The freeze pane on the ’Model’ sheet is set incorrectly, hindering readability as timeline headers do not remain visible when scrolli...
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.