pith. sign in

arxiv: 2605.22664 · v1 · pith:6RVQDMX6new · submitted 2026-05-21 · 💻 cs.AI

WorkstreamBench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance

Pith reviewed 2026-05-22 05:25 UTC · model grok-4.3

classification 💻 cs.AI
keywords LLM agentsspreadsheet tasksfinancial modelingbenchmarkend-to-end workflowsfinance
0
0 comments X

The pith

Current LLM agents fall short of professional standards when creating complete financial spreadsheets from high-level instructions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops WorkstreamBench to test LLM agents on building entire spreadsheets for finance tasks like modeling and scenario analysis. Existing benchmarks only check simple question answering or single changes, but real work requires full workflows that multiple people will review and edit. The authors create tasks with varying difficulty and score agent outputs along Accuracy of the numbers, correctness of the formulas, and Format for readability and ease of modification. Results show the best agents like those from Claude do better but still often produce work that does not meet professional levels, especially when more calculations are chained together. This evaluation highlights a gap in current AI capabilities for automating complex finance deliverables.

Core claim

LLM agents are not yet able to reliably produce professional-quality spreadsheets at the level of complexity real-world financial workflows demand, as shown by their sharp performance degradation on harder tasks despite leading models producing the most professional-looking outputs.

What carries the argument

A three-dimensional evaluation taxonomy with Accuracy, Formula, and Format dimensions, each containing fine-grained criteria that reflect professional finance standards for reviewed deliverables.

If this is right

  • If agents can close this gap, they could automate substantial portions of financial modeling and analysis in enterprise settings.
  • Performance degrades with increasing difficulty, implying that current models lack robust capabilities for long sequences of dependent calculations.
  • Qualitative review shows Claude family outputs look most professional, suggesting differences in how models handle structure and presentation.
  • The benchmark focuses on end-to-end tasks, which could guide development toward more practical AI tools for finance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Such a benchmark might be extended to other domains like accounting or data analysis where spreadsheets are central.
  • Improvements here could lead to AI assistants that not only compute but also create maintainable documents for teams.
  • Tracking progress on this benchmark over time could measure advancement in agentic AI for professional tasks.

Load-bearing premise

The assumption that the Accuracy, Formula, and Format dimensions together capture the key criteria that finance stakeholders apply when they review and revise spreadsheet work.

What would settle it

A controlled study in which finance professionals rate the quality of agent-generated spreadsheets on real tasks and check whether high benchmark scores align with their professional approval or rejection.

Figures

Figures reproduced from arXiv: 2605.22664 by Adam Shen, Ali Bauyrzhan, Daniel Guetta, Haoyang Liu, Harshith Srinivas Gear, Hongseok Namkoong, Joshua Fan, Julian Poeltl, Siri Du, Thomson Yen, Yili Liu, Yilin Meng.

Figure 1
Figure 1. Figure 1: Compared to prior work that focus on atomic tasks on spreadsheet ( [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The rubric (left) and the composite scores of agents on WorkstreamBench (right). The error bar indicates standard error. We found the Claude Web outperforms all others by a clear margin in all three core dimensions separately. However, substantial gap remains in current agents’ capabilities to meet professional standards. • To enable a holistic evaluation of the deliverable expected of end-to-end tasks, we… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison between a unreadable, monolithic function ( [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of difficulty and task types of WorkstreamBench. (Left) Roughly speaking, level 2 (Easy) is generally feasible without deep financial expertise, whereas level 4 (Medium-Hard) is not doable for recent grads without assistance from seasoned modelers. (Right) WorkstreamBench covers a wide variety of tasks. 3-Statement, for example, is the backbone of various financial modeling tasks. 4 Workstream… view at source ↗
Figure 5
Figure 5. Figure 5: An LLM judge output for the formula rubric. Here, the judge correctly identifies that the entries are hardcoded, resulting in a brittle spread￾sheet if model’s start date changes. Note that these mistakes would have other￾wise been difficult to detect with exact￾matching-based verification. Criteria Weighting. To enable quantitative cross-model comparison, we assign weights to the evaluation dimen￾sions an… view at source ↗
Figure 6
Figure 6. Figure 6: Mean attempt completion time (bottom) and mean composite score (top) plotted against task difficulty. (Left) API. (Right) GUI agents. Agents seem to recognize task difficulty, and generally spend more time as difficulty increases. However, their performance nonetheless degrade significantly, suggesting that harder spreadsheet tasks in WorkstreamBench poses difficulty that test-time scaling struggle to over… view at source ↗
Figure 7
Figure 7. Figure 7: Comparison between proper usage of absolute values [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison between proper usage of error handling [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparison between usage of hardcoded values [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Comparison between proper usage of dynamic ranges [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Referencing the entire column (B:B) forces an awkward layout: the commission rate and [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Comparison between three Excel structures that display the same format. The illogical [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: System architecture of the Playwright Excel Agent, illustrating the integration between [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: System architecture of the CLI Agent, illustrating the integration between user configs, [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Font color switch between two consecutive rows [PITH_FULL_IMAGE:figures/full_fig_p029_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Hardcoded values instead of formulas Additionally the models often fail to meet industry conventions. These industry conventions include consistent number formatting, the insertion of a separate column for the unit of the respective row, and the display of negative numbers in parentheses instead of a minus sign [PITH_FULL_IMAGE:figures/full_fig_p030_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Non-compliance with industry conventions such as number formatting [PITH_FULL_IMAGE:figures/full_fig_p030_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Visual documentation of failure modes for the Playwright Agents, categorized by GUI [PITH_FULL_IMAGE:figures/full_fig_p037_18.png] view at source ↗
read the original abstract

LLM agents are increasingly expected to carry out end-to-end workflows, producing complete artifacts from high-level user instructions. To meet enterprise needs, frontier AI labs have developed agents that can construct entire spreadsheets from scratch. This is especially relevant in finance, where core workflows such as financial modeling, forecasting, and scenario analysis are commonly conducted through spreadsheets. Yet, existing spreadsheet benchmarks do not measure this advanced capability, focusing instead on question-answering or single-formula edits. To address this gap, we provide one of the first evaluations of agents on end-to-end spreadsheet tasks, focusing on economically critical financial workflows such as modeling and scenario analysis. Since deliverables therein are routinely reviewed and revised by multiple stakeholders, judging their quality necessarily involves high-level criteria such as readability or ease of modification. To reflect the multidimensional nature of solution quality, we develop an evaluation taxonomy comprising three dimensions: Accuracy, Formula, and Format, each comprising fine-grained criteria that reflect professional standards. The Claude family leads the benchmark and produces the most professional-looking outputs in our qualitative review, but even the strongest agents frequently fall short of professional finance standards and degrade sharply as the difficulty increases beyond a few chained calculations. This suggests that current agents are not yet able to reliably produce professional-quality spreadsheets at the level of complexity real-world workflows demand.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces WorkstreamBench, one of the first benchmarks for evaluating LLM agents on end-to-end spreadsheet construction tasks in finance, such as financial modeling, forecasting, and scenario analysis. It defines a three-dimensional evaluation taxonomy (Accuracy, Formula, Format) with fine-grained criteria intended to reflect professional standards, evaluates frontier agents on these tasks, and reports that the Claude family produces the most professional-looking outputs while all agents frequently fall short of professional finance standards and degrade sharply with increasing difficulty beyond a few chained calculations.

Significance. If the taxonomy is shown to align with actual finance-professional review practices, the benchmark fills a clear gap left by prior spreadsheet evaluations focused on question-answering or single-formula edits. The work supplies concrete evidence of current agent limitations on complex, multi-stakeholder deliverables and offers a reusable evaluation framework that could steer development toward more reliable enterprise agents. The qualitative observation that Claude outputs appear most professional is a useful secondary finding.

major comments (1)
  1. Abstract: The central claim that agents 'frequently fall short of professional finance standards' rests on the assertion that the Accuracy/Formula/Format taxonomy 'reflect professional standards.' No evidence is provided of external validation—such as expert inter-rater agreement, correlation with real revision rates, or blind comparison against existing finance review rubrics—which is load-bearing for interpreting the measured performance gaps and difficulty scaling as evidence of professional inadequacy.
minor comments (1)
  1. Abstract: The headline results would be easier to contextualize if the number of tasks, difficulty levels, and agents evaluated were stated explicitly rather than left implicit.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback, which helps clarify the scope and limitations of our evaluation framework. We address the major comment point by point below.

read point-by-point responses
  1. Referee: Abstract: The central claim that agents 'frequently fall short of professional finance standards' rests on the assertion that the Accuracy/Formula/Format taxonomy 'reflect professional standards.' No evidence is provided of external validation—such as expert inter-rater agreement, correlation with real revision rates, or blind comparison against existing finance review rubrics—which is load-bearing for interpreting the measured performance gaps and difficulty scaling as evidence of professional inadequacy.

    Authors: We agree that the manuscript would benefit from clearer justification of how the taxonomy aligns with professional practice. The three dimensions and their criteria were derived from widely cited guidelines in financial modeling (e.g., best-practice recommendations from the CFA Institute, Wall Street training materials on model auditability, and common review criteria used in investment banking for model hand-off). However, we did not perform formal external validation such as inter-rater reliability studies with practicing finance professionals or correlation analyses against real-world revision rates. In the revised version we will (1) expand the Methods section with explicit references to the professional sources used to construct each criterion, (2) add a dedicated Limitations subsection that acknowledges the absence of direct expert validation and the interpretive caution this implies, and (3) revise the abstract and conclusion to state that agents fall short of the standards encoded in the taxonomy rather than asserting professional inadequacy without qualification. These changes will make the evidential basis for our claims more transparent. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

This is an empirical benchmark paper with no mathematical derivations, equations, or fitted parameters. The evaluation taxonomy is introduced as a methodological framework whose criteria are stated to reflect professional standards; this is a definitional choice for measurement rather than a prediction or result derived from prior inputs by construction. Central claims consist of direct performance measurements and qualitative observations against the stated criteria. No self-citation chains or uniqueness theorems are invoked as load-bearing premises. The work is self-contained as an empirical study, consistent with the default expectation of no circularity for such evaluations.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the representativeness of the chosen financial workflows and the validity of the three-axis taxonomy as a proxy for professional quality; no free parameters or invented physical entities are involved.

axioms (2)
  • domain assumption The selected tasks (financial modeling, forecasting, scenario analysis) are representative of real-world finance workflows that require multi-stakeholder review.
    The abstract positions these tasks as core to finance and the reason existing benchmarks are insufficient.
  • domain assumption The Accuracy-Formula-Format taxonomy reflects the criteria actually used by finance professionals when judging spreadsheet deliverables.
    The abstract states that judging quality involves high-level criteria such as readability and ease of modification.
invented entities (1)
  • WorkstreamBench benchmark no independent evidence
    purpose: To measure end-to-end spreadsheet construction capability in finance
    Newly proposed evaluation suite; no independent evidence outside this paper is provided in the abstract.

pith-pipeline@v0.9.0 · 5803 in / 1441 out tokens · 59916 ms · 2026-05-22T05:25:21.686697+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 3 internal anchors

  1. [1]

    Use Claude for Excel

    Anthropic. Use Claude for Excel. https://support.claude.com/en/articles/ 12650343-use-claude-for-excel, 2025a. Accessed: 2026-05-01. OpenAI. ChatGPT for Excel and Google Sheets in beta. https://chatgpt.com/apps/ spreadsheets/, 2026a. Accessed: 2026-05-01. CFA Institute. Financial Modeling. https://www.cfainstitute.org/ programs/cfa-program/candidate-resou...

  2. [2]

    Association for Financial Professionals

    Accessed: 2026-05-01. Association for Financial Professionals. 2025 AFP FP&A Benchmarking Survey Report: Technology & Data. https://www.financialprofessionals.org/ training-resources/resources/survey-research-economic-data/Details/ afp-fpa-benchmarking-survey-report-technology,

  3. [3]

    Zeyao Ma, Bohan Zhang, Jing Zhang, Jifan Yu, Xiaokang Zhang, Xiaohan Zhang, Sijia Luo, Xi Wang, and Jie Tang

    Accessed: 2026-05-01. Zeyao Ma, Bohan Zhang, Jing Zhang, Jifan Yu, Xiaokang Zhang, Xiaohan Zhang, Sijia Luo, Xi Wang, and Jie Tang. SpreadsheetBench: Towards Challenging Real World Spreadsheet Manipulation. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volum...

  4. [4]

    URL https://proceedings.neurips.cc/paper_files/paper/2024/file/ ac840df270ac537dd74530a15c332684-Paper-Datasets_and_Benchmarks_Track.pdf

    doi: 10.52202/079017-3007. URL https://proceedings.neurips.cc/paper_files/paper/2024/file/ ac840df270ac537dd74530a15c332684-Paper-Datasets_and_Benchmarks_Track.pdf. Yibin Chen, Yifu Yuan, Zeyu Zhang, Yan Zheng, Jinyi Liu, Fei Ni, Jianye Hao, Hangyu Mao, and Fuzheng Zhang. SheetAgent: Towards A Generalist Agent for Spreadsheet Reasoning and Manipulation vi...

  5. [5]

    NL2Formula: Generating spreadsheet formulas from natural lan- guage queries

    Wei Zhao, Zhitao Hou, Siyuan Wu, Yan Gao, Haoyu Dong, Yao Wan, Hongyu Zhang, Yulei Sui, and Haidong Zhang. NL2Formula: Generating spreadsheet formulas from natural lan- guage queries. In Yvette Graham and Matthew Purver, editors,Findings of the Association for Computational Linguistics: EACL 2024, pages 2377–2388, St. Julian’s, Malta, March

  6. [6]

    doi: 10.18653/v1/2024.findings-eacl.158

    Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-eacl.158. URL https://aclanthology.org/2024.findings-eacl.158/. Haoyu Dong, Pengkun Zhang, Yan Gao, Xuanyu Dong, Yilin Cheng, Mingzhe Lu, Zikun Zhu, Adina Yakefu, and Shuxin Zheng. Finch: Benchmarking finance & accounting across spreadsheet-centric enterprise workflows,

  7. [7]

    Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows

    URLhttps://arxiv.org/abs/2512.13168. 10 Tejal Patwardhan, Rachel Dias, Elizabeth Proehl, Grace Kim, Michele Wang, Olivia Watkins, Simón Posada Fishman, Marwan Aljubeh, Phoebe Thacker, Laurance Fauconnet, Natalie S. Kim, Patrick Chao, Samuel Miserendino, Gildas Chabot, David Li, Michael Sharman, Alexandra Barr, Amelia Glaese, and Jerry Tworek. GDPval: Eval...

  8. [8]

    URLhttps://arxiv.org/abs/2510.04374. Qianqian Xie, Weiguang Han, Zhengyu Chen, Ruoyu Xiang, Xiao Zhang, Yueru He, Mengxi Xiao, Dong Li, Yongfu Dai, Duanyu Feng, Yijing Xu, Haoqiang Kang, Ziyan Kuang, Chenhan Yuan, Kailai Yang, Zheheng Luo, Tianlin Zhang, Zhiwei Liu, Guojun Xiong, Zhiyang Deng, Yuechen Jiang, Zhiyuan Yao, Haohang Li, Yangyang Yu, Gang Hu, ...

  9. [9]

    ISBN 9798331314385

    Curran Associates Inc. ISBN 9798331314385. Zhihan Zhang, Yixin Cao, and Lizi Liao. XFinBench: Benchmarking LLMs in complex finan- cial problem solving and reasoning. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Lin- guistics: ACL 2025, pages 8715–8758, Vienna, Austria, July

  10. [10]

    ISBN 979-8-89176-256-5

    Association for Computa- tional Linguistics. ISBN 979-8-89176-256-5. doi: 10.18653/v1/2025.findings-acl.457. URL https://aclanthology.org/2025.findings-acl.457/. Michael Krumdick, Rik Koncel-Kedziorski, Viet Dac Lai, Varshini Reddy, Charles Lovering, and Chris Tanner. BizBench: A quantitative reasoning benchmark for business and finance. In Lun- Wei Ku, A...

  11. [11]

    In: Zong, C., Xia, F., Li, W., Navigli, R

    Association for Computational Linguistics. doi: 10.18653/v1/ 2024.acl-long.452. URLhttps://aclanthology.org/2024.acl-long.452/. Qianqian Xie, Weiguang Han, Xiao Zhang, Yanzhao Lai, Min Peng, Alejandro Lopez-Lira, and Jimin Huang. PIXIU: A comprehensive benchmark, instruction dataset and large language model for finance. InThirty-seventh Conference on Neur...

  12. [12]

    Guilong Lu, Xuntao Guo, Rongjunchen Zhang, Wenqiao Zhu, and Ji Liu

    URLhttps://openreview.net/forum?id=vTrRq6vCQH. Guilong Lu, Xuntao Guo, Rongjunchen Zhang, Wenqiao Zhu, and Ji Liu. Bizfinbench: A business- driven real-world financial benchmark for evaluating llms.arXiv preprint arXiv:2505.19457,

  13. [13]

    History of Financial Modeling Competitions

    FMWC. History of Financial Modeling Competitions. https://fmworldcup.com/history/, 2026a. Accessed: 2026-05-01. FMWC. Financial Modeling World Cup. https://fmworldcup.com/, 2026b. Accessed: 2026- 05-01. WSP. Wall Street Prep.https://www.wsp.com/en-us/,

  14. [14]

    Accessed: 2026-05-01. OpenAI. Introducing ChatGPT Agent. https://openai.com/index/ introducing-chatgpt-agent/,

  15. [15]

    Microsoft Corporation

    Accessed: 2026-01-29. Microsoft Corporation. Playwright: A framework for Web Testing and Automation,

  16. [16]

    Accessed: 2026-05-06. OpenAI. ChatGPT.https://chatgpt.com/, 2026b. Accessed: 2026-05-01. Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can Language Models Resolve Real-world Github Is- sues? InThe Twelfth International Conference on Learning Representations,

  17. [17]

    SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

    URL https://arxiv.org/abs/2405.15793. Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel Fried, Uri Alon, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854,

  18. [18]

    Claude Opus 4.5 System Card

    Anthropic. Claude Opus 4.5 System Card. https://assets.anthropic.com/m/ 64823ba7485345a7/Claude-Opus-4-5-System-Card.pdf, 2025b. Accessed: 2026-01-29. A Evaluation Criteria — Sub-dimensions In this section, we provide more examples for the rubric’s subdimensions to illustrate what they capture. A.1 Using Absolute References Consider converting product pri...

  19. [19]

    involved_tasks

    B.6 Task Difficulty Classification — Exemplary Tasks Exemplary Tasks by Difficulty Level Illustrative competition tasks for each difficulty level Very Easy — Task 7 (Accounts Receivable) A single-step task with direct formula application, requiring no finance knowledge and no Excel skills. Easy — Task 380 (First Time Buyer) Involves only a few steps with ...

  20. [20]

    xhigh",max_completion_tokens=128000 • Claude Opus 4.6: anthropic_effort=

    Claude (Excel)An official Anthropic add-in available through the Microsoft Marketplace that embeds a Claude-powered sidebar directly inside Excel for Windows, Mac, and Excel on the web. Users install it via the Excel Add-ins menu, sign in with a qualifying Anthropic account (Pro, Max, Team, or Enterprise), and interact with Claude through a chat panel wit...

  21. [21]

    ERROR: missing Q4_Q5 strike

    is supported by our qualitative investigations. One example of agents’ poorer performance is the change in the font color of the labels. For instance, in one of ChatGPT (Excel)’s attempts, the agent changed the font color in two consecutive rows (see Figure 15). For a client-ready model, this lack of consistency in presentation would be unacceptable. Figu...

  22. [22]

    decision

    This requires reasoning about the spatial structure of the workbook and whether the freeze pane is placed to actually serve its navigational purpose, not simply checking whether one is present. "decision":"Fail", "summary":"The freeze pane on the ’Model’ sheet is set incorrectly, hindering readability as timeline headers do not remain visible when scrolli...