pith. machine review for the scientific record. sign in

arxiv: 2604.11304 · v1 · submitted 2026-04-13 · 💻 cs.AI

Recognition: unknown

BankerToolBench: Evaluating AI Agents in End-to-End Investment Banking Workflows

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:52 UTC · model grok-4.3

classification 💻 cs.AI
keywords BankerToolBenchAI agentsinvestment bankingbenchmarkLLM evaluationprofessional workflowsfinancial modelingagentic AI
0
0 comments X

The pith

Even the best frontier AI model fails nearly half the rubric criteria in end-to-end investment banking workflows and produces no client-ready outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BankerToolBench, a benchmark of realistic junior investment banking tasks built with direct input from 502 professionals at leading firms. Agents must navigate data rooms, query market data and SEC filings tools, then produce multi-file deliverables such as Excel financial models, PowerPoint pitch decks, and reports. Even the strongest tested model, GPT-5.4, misses nearly half the 100-plus rubric items that veteran bankers use to judge quality, and none of its outputs receive a client-ready rating. These tasks can require up to 21 hours of human effort, so the gap directly affects whether high-value analytical work can be delegated to AI. The benchmark supplies automated scoring and failure analysis that highlights obstacles such as inconsistent information across generated artifacts.

Core claim

BankerToolBench evaluates AI agents on complete analytical workflows that junior investment bankers perform daily. The benchmark requires agents to interpret senior requests, retrieve and integrate data from industry platforms, and assemble polished deliverables that meet professional standards. When nine frontier models were tested, the top performer satisfied only slightly more than half the rubric criteria established by practicing bankers, and zero outputs were judged ready for client delivery.

What carries the argument

BankerToolBench is an open-source benchmark consisting of end-to-end junior banker workflows scored against more than 100 rubric criteria defined by veteran investment bankers to measure practical stakeholder utility.

If this is right

  • Cross-artifact consistency failures must be resolved before AI agents can reliably handle these multi-document workflows.
  • The benchmark supplies an automated method for tracking progress on agentic capabilities in high-stakes professional settings.
  • Current frontier models leave the economic value of automating up to 21-hour tasks unrealized.
  • Failure modes identified in the evaluation point to concrete directions for improving agent performance on complex, tool-using tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Benchmarks modeled on BankerToolBench could be created for other labor-intensive professions to measure where AI delegation becomes viable.
  • Specialized training on maintaining consistency across Excel, PowerPoint, and report formats may be required to close the observed gaps.
  • Success on this benchmark could serve as a measurable proxy for broader progress in reliable, multi-step agentic systems.
  • The zero client-ready rate suggests that even high rubric scores may still miss subtle professional standards that only domain experts can fully assess.

Load-bearing premise

The 100-plus rubric criteria and task selection, developed through collaboration with 502 investment bankers, accurately reflect client expectations and typical junior banker workflows.

What would settle it

A model that passes more than 90 percent of the rubric criteria on multiple BankerToolBench tasks and receives at least some client-ready ratings from practicing bankers would falsify the reported performance gap.

Figures

Figures reproduced from arXiv: 2604.11304 by Abdullah Arif, Anish Athalye, Asrith Devalaraju, Collin Schweiker, Curtis Northcutt, Elaine Lau, Francisco Guzm\'an, Guram Gogia, Haemi Nam, Hui Wen Goh, Jonas Mueller, Markus D\"ucker, Nasim Borazjanizadeh, Punit Arani, Ray Epps, Ronak Chaudhary, Rosemary Wei, Saed Qunbar, Sahil Bhaiwala, Samuel Eshun Danquah, Scott Millslagle, Skyler Wang, Ulyana Tkachenko, Vaibhav Kumar, Varsha Sandadi, Vijay Karumathil, Yi Liu.

Figure 1
Figure 1. Figure 1: An example BTB task (more examples and details in Appendix G). Finance is a multi-domain sector, and this study focuses on one of its most economically consequential and analytically demanding domains: investment banking (IB). In 2025 alone, the IB industry generated over $140 billion in fees (Financial Times, 2026). Junior bankers, a critical part of the sector’s day-to-day operations, often work 100-hour… view at source ↗
Figure 2
Figure 2. Figure 2: BankerToolBench Overview. Investment bankers provide a prompt reflecting a typical request they receive at work, and define a grading rubric for the requested deliverables. AI agents execute the task in an RL Environment that includes banker-provided files and API tools, and generate deliverables (spreadsheets, slides, documents) that are automatically evaluated against the rubric by a verifier to determin… view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of the number of rubric criteria per task (left), categories of criteria (right). 3.4. Verifier Conventional LLM-as-a-Judge verifiers are designed to score natural-language responses (Bai et al., 2022; Hashemi et al., 2024; Zheng et al., 2023), but BTB requires automated rubric-based grading of deliverable files. To address this need, we developed a novel agent-as-a-judge (Zhuge et al., 2025) … view at source ↗
Figure 4
Figure 4. Figure 4: Pipeline used to create BTB tasks and ensure data quality. All contributing experts and project leads have 2+ years of investment banking experience. Before creating tasks, contributing experts had to pass a rigorous assessment and complete rubrics training, with additional assessments and training for reviewers. stages and timelines. Since today’s AI agents are poorly versed in the standard conventions of… view at source ↗
Figure 5
Figure 5. Figure 5: Percentage of tasks where each model’s deliverable is considered acceptable (Pass@1), or where the best-of-3 runs is (Pass@3). Reported Pass@1 values are the mean across 3 runs, with error bars indicating standard deviations [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparing how models perform on BTB. Left: Score across all BTB tasks achieved by different models. Reported values are the mean across 3 runs with bars showing standard deviations. Right: Pairwise win rate, the percentage of tasks where the model indicated along the rows achieves a higher Score than the model indicated along the columns. Blue cells indicate pairs where the row model is superior. (53%), wh… view at source ↗
Figure 7
Figure 7. Figure 7: Model performance across the rubric criteria of each category. Each panel shows the score computed from per-criterion pass/fail judgments across all tasks but only considering criteria in the indicated category. Formatting and presentation issues are also common, e.g., in the following pitch deck example: Failing Rubric Criterion: PowerPoint deck uses a standard blue color palette consistent with XYZ bank … view at source ↗
Figure 8
Figure 8. Figure 8: Performance of different models, across tasks with different characteristics: deliverable type, provided input file type, and IB Product Group. Reported results are the mean across 3 runs with error bars indicating standard deviations. Appendix A.3 provides results for additional models [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Effect of progressively adding banker-specific domain and formatting context on scores achieved by various models. Reported values are the mean across 3 runs, with error bars showing standard deviations. To test this hypothesis, we asked task contrib￾utors to supplement their realistic prompt with two layers of extra context that would not typ￾ically be supplied in their job, but more clearly specify the t… view at source ↗
read the original abstract

Existing AI benchmarks lack the fidelity to assess economically meaningful progress on professional workflows. To evaluate frontier AI agents in a high-value, labor-intensive profession, we introduce BankerToolBench (BTB): an open-source benchmark of end-to-end analytical workflows routinely performed by junior investment bankers. To develop an ecologically valid benchmark grounded in representative work environments, we collaborated with 502 investment bankers from leading firms. BTB requires agents to execute senior banker requests by navigating data rooms, using industry tools (market data platform, SEC filings database), and generating multi-file deliverables--including Excel financial models, PowerPoint pitch decks, and PDF/Word reports. Completing a BTB task takes bankers up to 21 hours, underscoring the economic stakes of successfully delegating this work to AI. BTB enables automated evaluation of any LLM or agent, scoring deliverables against 100+ rubric criteria defined by veteran investment bankers to capture stakeholder utility. Testing 9 frontier models, we find that even the best-performing model (GPT-5.4) fails nearly half of the rubric criteria and bankers rate 0% of its outputs as client-ready. Our failure analysis reveals key obstacles (such as breakdowns in cross-artifact consistency) and improvement directions for agentic AI in high-stakes professional workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces BankerToolBench (BTB), an open-source benchmark for AI agents performing end-to-end junior investment banking workflows. Developed via collaboration with 502 bankers from leading firms, BTB tasks require agents to navigate data rooms, use tools like market data platforms and SEC filings databases, and produce multi-file deliverables (Excel models, PowerPoint decks, PDF/Word reports). Automated scoring against 100+ rubric criteria defined by veteran bankers shows that even the best model (GPT-5.4) fails nearly half the criteria, with bankers rating 0% of outputs as client-ready; the work includes failure analysis on obstacles such as cross-artifact consistency.

Significance. If the tasks and rubrics accurately reflect real workflows, the results highlight substantial limitations in frontier AI agents for high-stakes professional tasks with clear economic value (workflows taking up to 21 hours). The practitioner collaboration and open-source release provide a valuable resource for evaluating agentic systems in specialized domains, and the failure modes identified could inform targeted improvements.

major comments (2)
  1. [Methods] Methods section: The central claims rest on the ecological validity of tasks and 100+ rubric criteria developed through collaboration with 502 bankers, yet the manuscript provides no details on elicitation methods, firm/task sampling, inter-rater reliability, pilot validation, or external review of the criteria. Without these, the reported failure rates (~50% for GPT-5.4) and 0% client-ready assessment cannot be independently interpreted or verified as representative of client expectations.
  2. [Evaluation and Results] Evaluation and Results: The automated scoring procedure against the rubric is described at a high level, but the paper does not specify how scoring is implemented for complex artifacts (e.g., Excel consistency checks or PowerPoint content evaluation), nor does it release the full rubric or scoring code. This makes the quantitative results difficult to reproduce and raises questions about potential biases in the automated component.
minor comments (2)
  1. [Abstract] The abstract and introduction refer to 'GPT-5.4' without clarifying whether this is a specific released model, internal version, or placeholder; this should be disambiguated for readers.
  2. [Results] Table or figure presenting per-model failure rates would benefit from including confidence intervals or variance across tasks to strengthen the comparative claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key areas where additional transparency will strengthen the manuscript. We address each major comment below and will revise the paper accordingly.

read point-by-point responses
  1. Referee: [Methods] Methods section: The central claims rest on the ecological validity of tasks and 100+ rubric criteria developed through collaboration with 502 bankers, yet the manuscript provides no details on elicitation methods, firm/task sampling, inter-rater reliability, pilot validation, or external review of the criteria. Without these, the reported failure rates (~50% for GPT-5.4) and 0% client-ready assessment cannot be independently interpreted or verified as representative of client expectations.

    Authors: We agree that the current Methods section is insufficiently detailed for independent assessment of ecological validity. In the revised manuscript we will expand the section to document the elicitation process (structured interviews and workflow capture protocols used with the 502 bankers), the sampling approach across firms and task types, inter-rater reliability statistics for the rubric criteria, pilot-study results, and the external review steps performed by veteran bankers. These additions will allow readers to evaluate how representative the tasks and rubrics are of actual client expectations. revision: yes

  2. Referee: [Evaluation and Results] Evaluation and Results: The automated scoring procedure against the rubric is described at a high level, but the paper does not specify how scoring is implemented for complex artifacts (e.g., Excel consistency checks or PowerPoint content evaluation), nor does it release the full rubric or scoring code. This makes the quantitative results difficult to reproduce and raises questions about potential biases in the automated component.

    Authors: We acknowledge that the automated scoring description is too high-level and that the full rubric and code were not released with the initial submission. We will add a dedicated subsection detailing the scoring implementation for multi-file artifacts, including the specific checks used for Excel model consistency, data integrity, and PowerPoint narrative/layout evaluation. We will also release the complete rubric and the scoring codebase as part of the open-source benchmark package to support full reproducibility and external scrutiny of the automated component. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with independent evaluation results

full rationale

The paper introduces an empirical benchmark (BankerToolBench) and reports direct performance measurements of frontier models against 100+ rubric criteria. No equations, derivations, fitted parameters, or self-referential definitions appear in the provided text. Task and rubric construction via banker collaboration is presented as an input to the evaluation rather than a quantity derived from the reported outcomes. The failure rates and client-readiness claims are therefore not reduced to the authors' own parameters or prior self-citations by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that the selected tasks and rubrics, developed with input from 502 bankers, are representative of real junior banker work and client standards. No free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Tasks and 100+ rubric criteria developed with 502 investment bankers accurately capture ecologically valid junior banker workflows and stakeholder utility.
    This assumption underpins the claim that the benchmark measures economically meaningful progress.

pith-pipeline@v0.9.0 · 5651 in / 1324 out tokens · 29558 ms · 2026-05-10T15:52:37.563422+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 15 canonical work pages · 8 internal anchors

  1. [1]

    A. F. Akyürek, A. Gosai, C. B. C. Zhang, V. Gupta, J. Jeong, A. Gunjal, T. Rabbani, M. Mazzone, D. Randolph, M. M. Meymand, et al. Prbench: Large-scale expert rubrics for evaluating high-stakes professional reasoning.arXiv preprint arXiv:2511.11562,

  2. [2]

    HealthBench: Evaluating Large Language Models Towards Improved Human Health

    URL https://www.anthropic. com/research/anthropic-economic-index-january-2026-report. R. K. Arora, J. Wei, R. S. Hicks, P. Bowman, J. Quiñonero-Candela, F. Tsimpourlas, M. Sharman, M. Shah, A. Vallone, A. Beutel, et al. Healthbench: Evaluating large language models towards improved human health.arXiv preprint arXiv:2505.08775,

  3. [3]

    Y. Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnon, C. Chen, C. Olsson, C. Olah, D. Hernandez, D. Drain, D. Ganguli, D. Li, E. Tran- Johnson, E. Perez, J. Kerr, J. Mueller, J. Ladish, J. Landau, K. Ndousse, K. Lukosuite, L. Lovitt, M. Sellitto, N. Elhage, N. Schiefer, N. Mercado, N. DasSarma, R. ...

  4. [4]

    Finance agent benchmark: Benchmarking llms on real-world financial research tasks.arXiv preprint arXiv:2508.00828,

    A. Bigeard, L. Nashold, R. Krishnan, and S. Wu. Finance agent benchmark: Benchmarking llms on real-world financial research tasks.arXiv preprint arXiv:2508.00828,

  5. [5]

    20 BankerToolBench: Evaluating AI Agents in End-to-End Investment Banking Workflows Z. Chen, W. Chen, C. Smiley, S. Shah, I. Borova, D. Langdon, R. Moussa, M. Beane, T.-H. Huang, B. R. Routledge, et al. FinQA: A dataset of numerical reasoning over financial data. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,

  6. [6]

    X. Guo, H. Xia, Z. Liu, H. Cao, Z. Yang, Z. Liu, S. Wang, J. Niu, C. Wang, Y. Wang, et al. Fineval: A chinese financial domain knowledge evaluation benchmark for large language models. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies,

  7. [7]

    Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

    S. Levine, A. Kumar, G. Tucker, and J. Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643,

  8. [8]

    Repomod-bench: Abenchmarkforcode repository modernization via implementation-agnostic testing.arXiv preprint arXiv:2602.22518,

    X.Li, N.Ben-Israel, Y.Raz, B.Ahmed, D.Serebro, andA.Raux. Repomod-bench: Abenchmarkforcode repository modernization via implementation-agnostic testing.arXiv preprint arXiv:2602.22518,

  9. [9]

    URL https://www.anthropic.com/research/ economic-index-march-2026-report. A. J. Menkveld. High frequency trading and the new market makers.Journal of financial Markets, 16 (4):712–740,

  10. [10]

    Accessed: 2026-04-29

    T. Patwardhan, R. Dias, E. Proehl, G. Kim, M. Wang, O. Watkins, S. P. Fishman, M. Aljubeh, P. Thacker, L. Fauconnet, et al. GDPval: Evaluating AI model performance on real-world economically valuable tasks.arXiv preprint arXiv:2510.04374,

  11. [11]

    C. Peng, L. Wu, and Y. Zhou. Re-evaluating evmbench: Are ai agents ready for smart contract security? arXiv preprint arXiv:2603.10795,

  12. [13]

    URLhttps://arxiv.org/abs/2604.07666. Y. Qin, S. Liang, Y. Ye, K. Zhu, L. Yan, Y. Lu, Y. Lin, X. Cong, X. Tang, B. Qian, et al. ToolLLM: Facilitating large language models to master 16000+ real-world APIs. InInternational Conference on Learning Representations,

  13. [14]

    Autorubric: Unifying Rubric-based LLM Evaluation

    D. Rao and C. Callison-Burch. Autorubric: A unified framework for rubric-based LLM evaluation. arXiv preprint arXiv:2603.00077,

  14. [15]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Z.Shao,P.Wang,Q.Zhu,R.Xu,J.Song,X.Bi,H.Zhang,M.Zhang,Y.Li,etal. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  15. [16]

    H. Son. Goldman Sachs taps Anthropic’s Claude to automate accounting, compliance roles.https:// www.cnbc.com/2026/02/06/anthropic-goldman-sachs-ai-model-accounting.html, February

  16. [17]

    X. Wang, B. Li, Y. Song, F. F. Xu, X. Tang, M. Zhuge, J. Pan, Y. Song, B. Li, J. Singh, H. H. Tran, F. Li, R. Ma, M. Zheng, B. Qian, Y. Shao, N. Muennighoff, Y. Zhang, B. Hui, J. Lin, R. Brennan, H. Peng, H. Ji, and G. Neubig. OpenHands: An open platform for AI software developers as generalist agents. InInternational Conference on Learning Representation...

  17. [18]

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, C. Zheng, D. Liu, F. Zhou, F. Huang, F. Hu, H. Ge, H. Wei, H. Lin, J. Tang, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Zhou, J. Lin, K. Dang, K. Bao, K. Yang, L. Yu, L. Deng, M. Li, M. Xue, M. Li, P. Zhang, P. Wang, Q. Zhu, R. Men, R. Gao, S. Liu, S. Luo, T. ...

  18. [19]

    S. Yao, N. Shinn, P. Razavi, and K. Narasimhan.𝜏-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045,

  19. [20]

    Y. Zhao, Y. Li, C. Li, and R. Zhang. Multihiertt: Numerical reasoning over multi hierarchical tabular and textual data.arXiv preprint arXiv:2206.01347,

  20. [21]

    We consider using the thresholdedScoreas a predictor to approximate these labels, which offer a more interpretable quality measure of the deliverable. Over this ratings data, we assess the quality of our predictor via itsrelative risk: Relative Risk= Pr(Acceptable|Score≥𝑇) Pr(Acceptable|Score< 𝑇) We select the threshold𝑇 which leads to the best predictor ...

  21. [22]

    import <pkg>; help(<pkg>)

    and the built-in agent harnesses without modification, running evaluations via:harbor run . Agent harnesses use their own built-in functionality and heuristics to determine when to terminate, handle tool call failures (e.g., due to incorrect arguments supplied) or error signals, and manage time/step budgets, etc. C.1. Task prompt template Agent harnesses ...

  22. [23]

    E.1. Details of Construct Definition and Workflow Taxonomy To identify the workflows performed by junior investment bankers that are common and high value (Section 4.1), we conducted 2 surveys of the broader IB population: a Job Task Analysis (JTA) survey, and a survey on AI-value in their job. We ensured representative survey samples by stratifying recru...

  23. [24]

    RTX & GD - Merger Mode

    should have a header labeled “RTX & GD - Merger Mode.” All formulas should be linked to the cell (i.e, C21) and not named cells. The Model should include these sections in the following order. List of all the transaction assumptions Transaction source and uses Write-up of assets and a purchase price allocation on the balance sheet. The Model should includ...

  24. [25]

    The model should include a detailed Net Cost Mix section

    Valuation assumptions: - WACC of 12.0% - Mid-year discounting convention - Terminal value calculated using a 15x EV/EBITDA exit multiple - Use share price as of December 31, 2024, ($81.26) The model should include a detailed net revenue mix section based on Uber’s revenue by service line (Mobility, Delivery, and Freight) and also by geography (United Stat...