pith. sign in

arxiv: 2605.19099 · v1 · pith:KLPNNOSVnew · submitted 2026-05-18 · 💻 cs.AI · cs.CL· cs.MA

DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

Pith reviewed 2026-05-20 10:12 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.MA
keywords DecisionBenchemergent delegationlong-horizon agentic workflowscounterfactual ceilingrouting fidelityGAIAmulti-turn tasksorchestration methods
0
0 comments X p. Extension
pith:KLPNNOSV Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{KLPNNOSV}

Prints a linked pith:KLPNNOSV badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

A benchmark reveals that perfect delegation among peer models could lift agent performance by 15-31 points across standard task suites.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DecisionBench as a fixed substrate for testing emergent delegation in long-horizon agentic workflows. It locks in tasks from GAIA, tau-bench, and BFCL multi-turn, an 11-model pool from seven vendors, and a simple delegation interface, then runs reference conditions to measure quality, routing fidelity, and other axes. The central result is that end-task quality stays statistically flat across awareness levels, yet a counterfactual ceiling shows perfect delegation would add 15-31 percentage points on every suite. This matters because it separates the orchestration signal from raw quality and supplies a reusable yardstick for future routers, memories, and adaptive strategies. The substrate stays neutral on how peer information is generated so new methods can be plugged in and scored directly.

Core claim

DecisionBench fixes a task suite, peer-model pool, delegation interface, skill annotations, and multi-axis metrics including a counterfactual-delegation ceiling. Reference sweeps across five awareness conditions on 23,375 instances show mean quality is indistinguishable while routing fidelity-at-1 ranges from 7.5% to 29.5%, with delivery channel mattering more than description content. The counterfactual ceiling places perfect delegation 15-31 points above observed performance on every suite, quantifying large unrealized headroom for orchestration methods.

What carries the argument

The counterfactual-delegation ceiling, which scores the performance that would result if every task instance were routed to its single best peer model from the fixed pool.

If this is right

  • Quality-only metrics miss the orchestration signal entirely.
  • Channel choice (on-demand tool versus preloaded description) dominates routing accuracy at comparable quality.
  • Future learned routers, richer memories, and adaptive profile methods can be scored directly against the same ceiling.
  • The substrate isolates delegation improvements from changes in base model capabilities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Closing even half the delegation gap could become a higher-leverage research target than further scaling individual models.
  • The benchmark could later incorporate dynamic model pools or cost-sensitive routing to test trade-offs the current fixed pool leaves implicit.
  • If the ceiling holds under richer task distributions, delegation-aware training objectives may warrant dedicated development alongside standard fine-tuning.

Load-bearing premise

The fixed task suites, model pool, and delegation interface together represent the broader space of long-horizon agentic workflows so that the observed gaps generalize.

What would settle it

Measure end-task quality on the same task instances when an oracle always delegates to the single best model for that instance and check whether the gain falls inside or outside the reported 15-31 point range.

Figures

Figures reproduced from arXiv: 2605.19099 by Ao Qu, Megan Wang, Yi Ling Yu, Yuxuan Gao, Zijian Carl Ma.

Figure 1
Figure 1. Figure 1: DecisionBench overview. Left (substrate, §3). The benchmark fixes a task suite (GAIA, τ -bench, BFCL multi-turn), a peer-model pool (11 models, 7 vendor families), a delegation interface (call_model, plus an optional read_profile channel), an annotation layer (frozen 7-skill taxon￾omy and deterministic step tagger; App. C), and a metric suite (quality, cost, latency, delegation rate, fidelity-at-k, self-pr… view at source ↗
Figure 2
Figure 2. Figure 2: Quality–cost Pareto frontier per benchmark (GAIA: general tool-use; [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: (a) Agents (of 11) whose best aware variant strictly Pareto-dominates blind, by aware [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Left: per-agent decomposition of the aware-c2 GAIA lift over blind into the tool-availability [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: (a) Cross-vendor delegation flow (aware-* aggregated): rows are orchestrator vendors, [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Counterfactual-delegation ceiling per (model, benchmark): actual blind quality (purple) [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Per-agent best-aware lift vs. blind quality, with parabolic fit. The mid-capability pattern on [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Per-agent GAIA quality lift over blind, by aware variant. Bars sorted by best lift across the [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Per-skill awareness lift, aggregated across the 11 agents (quality on tasks bucketed by their [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗
read the original abstract

We introduce DecisionBench, a benchmark substrate for emergent delegation in long-horizon agentic workflows. The substrate fixes a task suite (GAIA, tau-bench, BFCL multi-turn), a peer-model pool (11 models, 7 vendor families), a delegation interface (call_model plus an optional read_profile channel), a deterministic skill-annotation layer, and a multi-axis metric suite covering quality, cost, latency, delegation rate, routing fidelity-at-k, vendor self-preference, and a counterfactual-delegation ceiling. The substrate is agnostic to how peer information is generated or delivered, so learned routers, richer peer memories, adaptive profile construction, and multi-step delegation can all be evaluated against it. We characterize the substrate with a five-condition reference sweep on the full pool (n=23,375 task instances). Three benchmark-level findings emerge: (i) mean end-task quality is statistically indistinguishable across the four awareness conditions (|beta| <= 0.010, p >= 0.21), so quality-only evaluation would miss the orchestration signal; (ii) routing fidelity-at-1 ranges from 7.5% to 29.5% across conditions at near-equal mean quality, with delivery channel (on-demand tool vs. preloaded description) dominating description content; (iii) a counterfactual ceiling places perfect delegation 15-31 percentage points above measured performance on every suite, locating large unrealized headroom for future orchestration methods. We release the substrate, annotation layer, reference intervention suite, analysis pipeline, and 220 per-condition run archives.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces DecisionBench, a benchmark substrate for emergent delegation in long-horizon agentic workflows. It standardizes a task suite (GAIA, tau-bench, BFCL multi-turn), a pool of 11 peer models across 7 vendor families, a delegation interface (call_model with optional read_profile), a deterministic skill-annotation layer, and a multi-axis metric suite. A five-condition reference sweep over 23,375 instances yields three findings: mean end-task quality is statistically indistinguishable across awareness conditions (|β| ≤ 0.010, p ≥ 0.21); routing fidelity-at-1 ranges 7.5–29.5% with delivery channel mattering more than description content; and a counterfactual-delegation ceiling indicates 15–31 percentage points of unrealized headroom above measured performance on every suite. The substrate, annotations, reference interventions, analysis pipeline, and run archives are released.

Significance. If the benchmark substrate and its reference characterization hold, the work supplies a reusable, interface-agnostic evaluation framework that separates quality from orchestration signals and quantifies attainable headroom for delegation methods. Explicit release of the substrate, deterministic annotation layer, 220 per-condition run archives, and analysis pipeline constitutes a concrete reproducibility asset that future work on learned routers or adaptive profiles can directly build upon.

major comments (1)
  1. [Abstract / Counterfactual ceiling results] Abstract and results on the counterfactual ceiling: the headline claim of 15–31 pp unrealized headroom rests on the counterfactual-delegation ceiling. The manuscript must explicitly state whether this ceiling is obtained by per-instance hindsight oracle selection (best model after seeing the outcome) or by a policy that decides using only the fixed interface (call_model + optional read_profile) and the deterministic skill annotations available at decision time. If the former, the reported gap mixes information unavailable to any realistic delegation policy and therefore overstates attainable headroom; a revised ceiling computed under the actual information constraints of the interface should be added.
minor comments (2)
  1. [Methods] The abstract supplies concrete statistics (|β| ≤ 0.010, p ≥ 0.21) yet the full methods section should include explicit error-bar computation, data-exclusion rules, and per-suite sample sizes so that the statistical-indistinguishability claim can be independently verified.
  2. [Results] Figure or table presenting the 15–31 pp range should report the exact per-suite values and the precise definition of the counterfactual used, rather than an aggregate range.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thorough review and valuable feedback on the manuscript. We address the major comment regarding the counterfactual-delegation ceiling below.

read point-by-point responses
  1. Referee: [Abstract / Counterfactual ceiling results] Abstract and results on the counterfactual ceiling: the headline claim of 15–31 pp unrealized headroom rests on the counterfactual-delegation ceiling. The manuscript must explicitly state whether this ceiling is obtained by per-instance hindsight oracle selection (best model after seeing the outcome) or by a policy that decides using only the fixed interface (call_model + optional read_profile) and the deterministic skill annotations available at decision time. If the former, the reported gap mixes information unavailable to any realistic delegation policy and therefore overstates attainable headroom; a revised ceiling computed under the actual information constraints of the interface should be added.

    Authors: We agree with the referee that the current presentation of the counterfactual-delegation ceiling requires clarification. The ceiling is indeed computed using per-instance hindsight oracle selection: for each task instance, we identify the model that delivered the highest quality outcome after execution. This approach quantifies the maximum attainable performance under perfect information about outcomes, thereby highlighting the potential headroom for improved delegation strategies. However, as the referee notes, this incorporates outcome information unavailable at decision time. We will revise the manuscript to explicitly state this in the abstract and the relevant results section. Furthermore, we will add a new analysis computing a revised ceiling that operates strictly under the information constraints of the delegation interface, relying solely on the deterministic skill annotations and the call_model/read_profile channels available prior to execution. This will provide a more conservative and realistic estimate of attainable headroom for policies using only pre-decision information. revision: yes

Circularity Check

0 steps flagged

No significant circularity; benchmark substrate reports direct empirical measurements

full rationale

The paper defines DecisionBench as an external benchmark substrate with fixed public task suites (GAIA, tau-bench, BFCL), a peer-model pool, and a multi-axis metric suite that explicitly includes a counterfactual-delegation ceiling. Reported results are characterizations via reference sweeps on n=23,375 instances, with findings on quality indistinguishability, routing fidelity, and measured gaps to the ceiling. No derivation chain reduces by construction to fitted parameters, self-definitions, or self-citation load-bearing steps; the ceiling is a defined upper-bound metric within the substrate rather than a predicted quantity derived from internal fits. The work remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark design rests on domain assumptions about task representativeness and model-pool diversity; no free parameters or invented entities are introduced beyond the substrate definition.

axioms (1)
  • domain assumption The chosen task suites (GAIA, tau-bench, BFCL multi-turn) and 11-model pool adequately sample long-horizon agentic delegation scenarios.
    Stated when defining the fixed task suite and peer-model pool in the abstract.

pith-pipeline@v0.9.0 · 5832 in / 1308 out tokens · 47749 ms · 2026-05-20T10:12:43.488442+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 29 internal anchors

  1. [1]

    MT-Bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues, 2024

    Ge Bai, Jie Liu, Xingyuan Bu, Yancheng He, Jiaheng Liu, Zhanhui Zhou, Zhuoran Lin, Wenbo Su, Tiezheng Ge, Bo Zheng, and Wanli Ouyang. MT-Bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues, 2024. URL https: //arxiv.org/abs/2402.14762

  2. [2]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional AI: Harmlessness from AI feedback, 2022. URLhttps://arxiv.org/abs/2212.08073

  3. [3]

    LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

    Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. LongBench: A bilingual, multitask benchmark for long context understanding, 2023. URL https://arxiv. org/abs/2308.14508

  4. [4]

    Fitting linear mixed-effects models using lme4.Journal of Statistical Software, 67(1):1–48, 2015

    Douglas Bates, Martin Mächler, Ben Bolker, and Steve Walker. Fitting linear mixed-effects models using lme4.Journal of Statistical Software, 67(1):1–48, 2015. doi: 10.18637/jss.v67.i01

  5. [5]

    MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

    Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Aleksander Madry, and Lilian Weng. MLE-bench: Evaluating machine learning agents on machine learning engineering, 2024. URL https://arxiv.org/abs/2410.07095

  6. [6]

    FrugalGPT: How to use large language models while reducing cost and improving performance, 2023

    Lingjiao Chen, Matei Zaharia, and James Zou. FrugalGPT: How to use large language models while reducing cost and improving performance, 2023. URLhttps://arxiv.org/abs/2305. 05176

  7. [7]

    Mind2Web: Towards a Generalist Agent for the Web

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2Web: Towards a generalist agent for the web, 2023. URL https://arxiv.org/ abs/2306.06070

  8. [8]

    Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks V . S. Lakshmanan, and Ahmed Hassan Awadallah. Hybrid LLM: Cost-efficient and quality-aware query routing, 2024. URLhttps://arxiv.org/abs/2404.14618

  9. [9]

    Improving Factuality and Reasoning in Language Models through Multiagent Debate

    Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate, 2023. URL https: //arxiv.org/abs/2305.14325

  10. [10]

    Tibshirani.An Introduction to the Bootstrap

    Bradley Efron and Robert J. Tibshirani.An Introduction to the Bootstrap. Chapman & Hall/CRC, 1993

  11. [11]

    MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

    Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. MetaGPT: Meta programming for a multi-agent collaborative framework, 2023. URLhttps://arxiv.org/abs/2308.00352

  12. [12]

    RULER: What's the Real Context Size of Your Long-Context Language Models?

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. RULER: What’s the real context size of your long-context language models?, 2024. URLhttps://arxiv.org/abs/2404.06654

  13. [13]

    RouterBench: A Benchmark for Multi-LLM Routing System

    Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, and Shriyash Kaustubh Upadhyay. RouterBench: A benchmark for multi-LLM routing system, 2024. URLhttps://arxiv.org/abs/2403.12031

  14. [14]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues?,

  15. [15]

    URLhttps://arxiv.org/abs/2310.06770

  16. [16]

    Decomposed prompting: A modular approach for solving complex tasks,

    Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. Decomposed prompting: A modular approach for solving complex tasks,

  17. [17]

    URLhttps://arxiv.org/abs/2210.02406

  18. [18]

    MT-Eval: A multi-turn capabilities evaluation benchmark for large language models

    Wai-Chung Kwan, Xingshan Zeng, Yufei Wang, Yusen Sun, Liangyou Li, Lifeng Shang, Qun Liu, and Kam-Fai Wong. MT-Eval: A multi-turn capabilities evaluation benchmark for large language models, 2024. URLhttps://arxiv.org/abs/2401.16745

  19. [19]

    SkillsBench: How Skills Work in AI Agents

    Laude Institute. SkillsBench: How Skills Work in AI Agents. https://www.skillsbench. ai/, 2026. Open-source benchmark for skill-aware agent configurations. 14

  20. [20]

    From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

    Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Banghua Zhu, Joseph E. Gonzalez, and Ion Stoica. From crowdsourced data to high-quality benchmarks: Arena-hard and BenchBuilder pipeline, 2024. URLhttps://arxiv.org/abs/2406.11939

  21. [21]

    AgentBench: Evaluating LLMs as agents, 2023

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. AgentBench: Evaluating LLMs as agents, 2023. URL https://arxiv.org/abs/2308. 03688

  22. [22]

    LMArena: Crowdsourced LLM preference leaderboard

    LMSYS. LMArena: Crowdsourced LLM preference leaderboard. https://lmarena.ai/ leaderboard, 2026

  23. [23]

    AutoMix: Automatically mixing language models, 2023

    Aman Madaan, Pranjal Aggarwal, Ankit Anand, Srividya Pranavi Potharaju, Swaroop Mishra, Pei Zhou, Aditya Gupta, Dheeraj Rajagopal, Karthik Kappaganthu, Yiming Yang, Shyam Upadhyay, and Manaal Faruqui. AutoMix: Automatically mixing language models, 2023. URL https://arxiv.org/abs/2310.12963

  24. [24]

    Self-Refine: Iterative Refinement with Self-Feedback

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback, 2023. URL https://arxiv.org/abs/2303.17651

  25. [25]

    GAIA: a benchmark for General AI Assistants

    Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: a benchmark for General AI Assistants, 2023. URL https://arxiv.org/ abs/2311.12983

  26. [26]

    Gonzalez, M

    Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M. Waleed Kadous, and Ion Stoica. RouteLLM: Learning to route LLMs with preference data,

  27. [27]

    URLhttps://arxiv.org/abs/2406.18665

  28. [28]

    OpenRouter model directory and pricing

    OpenRouter. OpenRouter model directory and pricing. https://openrouter.ai/models, 2026

  29. [29]

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback,...

  30. [30]

    Bowman, and Shi Feng

    Arjun Panickssery, Samuel R. Bowman, and Shi Feng. LLM evaluators recognize and favor their own generations, 2024. URLhttps://arxiv.org/abs/2404.13076

  31. [31]

    Generative Agents: Interactive Simulacra of Human Behavior

    Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior, 2023. URL https://arxiv.org/abs/2304.03442

  32. [32]

    Patil, Huanzhi Mao, Charlie Cheng-Jie Ji, Fanjia Yan, Vishnu Suresh, Ion Stoica, and Joseph E

    Shishir G. Patil, Huanzhi Mao, Charlie Cheng-Jie Ji, Fanjia Yan, Vishnu Suresh, Ion Stoica, and Joseph E. Gonzalez. Berkeley function calling leaderboard (BFCL) v3 / v4.https://gorilla. cs.berkeley.edu/leaderboard.html, 2024. Multi-turn function-calling benchmark

  33. [33]

    ToolLLM: Facilitating large language models to master 16000+ real-world APIs, 2023

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. ToolLLM: Facilitating large language models to master 16000+ real-world APIs, 2023. URL https://arxiv.org/abs/2307. 16789

  34. [34]

    Verbosity bias in preference labeling by large language models, 2023

    Keita Saito, Akifumi Wachi, Koki Wataoka, and Youhei Akimoto. Verbosity bias in preference labeling by large language models, 2023. URLhttps://arxiv.org/abs/2310.10076

  35. [35]

    SWE-Bench Pro: A multi-language benchmark for repository-level coding

    Scale AI. SWE-Bench Pro: A multi-language benchmark for repository-level coding. https: //github.com/scaleapi/SWE-bench_Pro-os, 2025. Public dataset and evaluation harness

  36. [36]

    SWE-Bench Pro public Leaderboard

    Scale Labs. SWE-Bench Pro public Leaderboard. https://labs.scale.com/ leaderboard/swe_bench_pro_public, 2026

  37. [37]

    Statsmodels: Econometric and statistical modeling with Python

    Skipper Seabold and Josef Perktold. Statsmodels: Econometric and statistical modeling with Python. In9th Python in Science Conference, 2010. 15

  38. [38]

    HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

    Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. HuggingGPT: Solving AI tasks with ChatGPT and its friends in HuggingFace, 2023. URL https://arxiv.org/abs/2303.17580

  39. [39]

    Reflexion: Language Agents with Verbal Reinforcement Learning

    Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning, 2023. URL https://arxiv.org/abs/2303.11366

  40. [40]

    ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

    Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. ALFWorld: Aligning text and embodied environments for interactive learning, 2020. URLhttps://arxiv.org/abs/2010.03768

  41. [41]

    Voyager: An open-ended embodied agent with large language models,

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models,

  42. [42]

    URLhttps://arxiv.org/abs/2305.16291

  43. [43]

    Mixture-of-agents enhances large language model capabilities, 2024

    Junlin Wang, Jue Wang, Ben Athiwaratkun, Ce Zhang, and James Zou. Mixture-of-agents enhances large language model capabilities, 2024. URL https://arxiv.org/abs/2406. 04692

  44. [44]

    Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models

    Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, and Ee-Peng Lim. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models, 2023. URLhttps://arxiv.org/abs/2305.04091

  45. [45]

    Large language models are not robust multiple choice selectors,

    Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large language models are not robust multiple choice selectors,

  46. [46]

    URLhttps://arxiv.org/abs/2309.03882

  47. [47]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2022. URLhttps://arxiv.org/abs/2201.11903

  48. [48]

    AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. AutoGen: Enabling next-gen LLM applications via multi-agent conversation, 2023. URLhttps://arxiv.org/abs/2308.08155

  49. [49]

    The Rise and Potential of Large Language Model Based Agents: A Survey

    Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, et al. The rise and potential of large language model based agents: A survey, 2023. URLhttps://arxiv.org/abs/2309.07864

  50. [50]

    OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024. URL https: //arxiv.org/abs/2...

  51. [51]

    WebShop: Towards scalable real-world web interaction with grounded language agents, 2022

    Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. WebShop: Towards scalable real-world web interaction with grounded language agents, 2022. URL https://arxiv.org/ abs/2207.01206

  52. [52]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models, 2022. URL https: //arxiv.org/abs/2210.03629

  53. [53]

    Tree of Thoughts: Deliberate Problem Solving with Large Language Models

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models, 2023. URLhttps://arxiv.org/abs/2305.10601

  54. [54]

    $\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

    Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A benchmark for tool-agent-user interaction in real-world domains, 2024. URL https://arxiv.org/abs/ 2406.12045

  55. [55]

    Zhang, Neil Perry, Riya Dulepet, Joey Ji, Justin W

    Andy K. Zhang, Neil Perry, Riya Dulepet, Joey Ji, Justin W. Jones, Celeste Menders Lin, Eliot Hussein, Samantha Lopez, Andres Yuan, Arnav Zhang, et al. Cybench: A framework for evaluating cybersecurity capabilities and risk of language models, 2024. URL https: //arxiv.org/abs/2408.08926

  56. [56]

    EcoAssistant: Using LLM assistants more affordably and accurately, 2023

    Jieyu Zhang, Ranjay Krishna, Ahmed Hassan Awadallah, and Chi Wang. EcoAssistant: Using LLM assistants more affordably and accurately, 2023. URL https://arxiv.org/abs/2310. 03046. 16

  57. [57]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, 2023. URL https: //arxiv.org/abs/2306.05685

  58. [58]

    public-only

    Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A realistic web environment for building autonomous agents, 2023. URL https://arxiv.org/abs/2307. 13854. 17 Organization of Appendix The appendix is organized as follows: in App. A we list the eleven-a...

  59. [59]

    Infra-only. If every tool call is to call_model or read_profile (DecisionBench infras- tructure tools), tag with the private _infra_delegation marker and return early; this step does not contribute to graded skill stats

  60. [60]

    Numerical computation. If any tool name is in {calculator, evaluate_expression, eval_python, python_eval, math_eval, compute}, OR the tool’s arguments con- tain ≥3 numerical tokens (long-digit / decimal / currency / date or time patterns), tag numerical_computation

  61. [61]

    Information retrieval. If any tool name contains web_search, search, fetch_url, browse, find_user_id, find_user, lookup, get_user_details, get_order, list_orders, get_product, list_products, get_reservation, list_reservation, search_direct_flight, search_onestop_flight, parse_pdf, extract_table,ocr,read_document, taginformation_retrieval

  62. [62]

    Otherwise, tagtool_schema_adherence

    Tool-schema adherence. Otherwise, tagtool_schema_adherence. Non-tool branch(text-only assistant turn):

  63. [63]

    If the suite is τ-bench AND the assistant text matches any of the regexes in Table 5 (case-insensitive), tagdomain_policy_compliance

    Domain-policy compliance. If the suite is τ-bench AND the assistant text matches any of the regexes in Table 5 (case-insensitive), tagdomain_policy_compliance

  64. [64]

    If the prompt-token count for this turn is ≥15,000 (from usage.prompt_tokens, falling back to a 4-chars-per-token heuristic), tag long_input_handling

    Long-input handling. If the prompt-token count for this turn is ≥15,000 (from usage.prompt_tokens, falling back to a 4-chars-per-token heuristic), tag long_input_handling. 19

  65. [65]

    If suite is GAIA AND ≥2 prior tool calls in this task, tag multi_step_reasoning; on GAIA with<2prior tool calls, tagNone

    Multi-step reasoning. If suite is GAIA AND ≥2 prior tool calls in this task, tag multi_step_reasoning; on GAIA with<2prior tool calls, tagNone

  66. [66]

    image-grounded extraction

    Multi-turn state tracking. If suite is τ-bench or BFCL, tag multi_turn_state_tracking; otherwiseNone. Policy-compliance regex (case-insensitive) \bagainst\s+(?:our\s+|the\s+)?policy\b \bnot\s+permitted\b \bI\s+cannot\b.{0,40}\bpolicy\b \btransfer.{0,20}human\s+agent \boutside\s+(?:my|our)\s+scope\b \bplease\s+confirm\b \bI\s+(?:will\s+)?need\s+(?:your\s+)...

  67. [67]

    perfect-orchestration

    Single-step delegation, perfect skill identification. For each task we tag the dominant skill and assume the agent delegates the entire task to the Stage-1-best peer for that skill in a single call_model. Real GAIA tasks decompose into 3–7 steps with potentially different dominant skills; a multi-step ceiling that allowed per-step delegation would be high...

  68. [68]

    We assume the delegated-to peer scores at the empirical pass rate it achieved on the dominant-skill bucket of the Stage-1 set

    Peer answers at its Stage-1 pass rate. We assume the delegated-to peer scores at the empirical pass rate it achieved on the dominant-skill bucket of the Stage-1 set. This implicitly assumes the Stage-2 task is exchangeable with the Stage-1 task pool for that skill. Per-task difficulty variation inside a skill bucket is real (e.g., long-input-handling on a...

  69. [69]

    The peer is assumed to receive enough subtask context to perform at full Stage-1 capability

    No context-loss penalty. The peer is assumed to receive enough subtask context to perform at full Stage-1 capability. In practice the orchestrator must compress its trajectory state into the call_model subtask string and the peer answers without seeing earlier turns; we sensitivity-test this in Table 11 below

  70. [70]

    70% of Stage-1

    No coordination cost. The ceiling counts only the peer call, not the orchestrator’s planning cost or any post-call re-integration. In practice an orchestrator pays for both. Peer-realization rate GAIAτ-bench BFCL 100% of Stage-1 (reported in §6.6)+0.269 +0.153 +0.313 90% of Stage-1 (mild context loss)+0.230 +0.123 +0.272 80% of Stage-1 (heavy context loss...