DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

arxiv: 2605.19099 · v1 · pith:KLPNNOSVnew · submitted 2026-05-18 · 💻 cs.AI · cs.CL· cs.MA

DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

Yuxuan Gao , Megan Wang , Yi Ling Yu , Zijian Carl Ma , Ao Qu This is my paper

Pith reviewed 2026-05-20 10:12 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.MA

keywords DecisionBenchemergent delegationlong-horizon agentic workflowscounterfactual ceilingrouting fidelityGAIAmulti-turn tasksorchestration methods

0 comments p. Extension

pith:KLPNNOSV Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{KLPNNOSV}

Prints a linked pith:KLPNNOSV badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

A benchmark reveals that perfect delegation among peer models could lift agent performance by 15-31 points across standard task suites.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DecisionBench as a fixed substrate for testing emergent delegation in long-horizon agentic workflows. It locks in tasks from GAIA, tau-bench, and BFCL multi-turn, an 11-model pool from seven vendors, and a simple delegation interface, then runs reference conditions to measure quality, routing fidelity, and other axes. The central result is that end-task quality stays statistically flat across awareness levels, yet a counterfactual ceiling shows perfect delegation would add 15-31 percentage points on every suite. This matters because it separates the orchestration signal from raw quality and supplies a reusable yardstick for future routers, memories, and adaptive strategies. The substrate stays neutral on how peer information is generated so new methods can be plugged in and scored directly.

Core claim

DecisionBench fixes a task suite, peer-model pool, delegation interface, skill annotations, and multi-axis metrics including a counterfactual-delegation ceiling. Reference sweeps across five awareness conditions on 23,375 instances show mean quality is indistinguishable while routing fidelity-at-1 ranges from 7.5% to 29.5%, with delivery channel mattering more than description content. The counterfactual ceiling places perfect delegation 15-31 points above observed performance on every suite, quantifying large unrealized headroom for orchestration methods.

What carries the argument

The counterfactual-delegation ceiling, which scores the performance that would result if every task instance were routed to its single best peer model from the fixed pool.

If this is right

Quality-only metrics miss the orchestration signal entirely.
Channel choice (on-demand tool versus preloaded description) dominates routing accuracy at comparable quality.
Future learned routers, richer memories, and adaptive profile methods can be scored directly against the same ceiling.
The substrate isolates delegation improvements from changes in base model capabilities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Closing even half the delegation gap could become a higher-leverage research target than further scaling individual models.
The benchmark could later incorporate dynamic model pools or cost-sensitive routing to test trade-offs the current fixed pool leaves implicit.
If the ceiling holds under richer task distributions, delegation-aware training objectives may warrant dedicated development alongside standard fine-tuning.

Load-bearing premise

The fixed task suites, model pool, and delegation interface together represent the broader space of long-horizon agentic workflows so that the observed gaps generalize.

What would settle it

Measure end-task quality on the same task instances when an oracle always delegates to the single best model for that instance and check whether the gain falls inside or outside the reported 15-31 point range.

Figures

Figures reproduced from arXiv: 2605.19099 by Ao Qu, Megan Wang, Yi Ling Yu, Yuxuan Gao, Zijian Carl Ma.

**Figure 1.** Figure 1: DecisionBench overview. Left (substrate, §3). The benchmark fixes a task suite (GAIA, τ -bench, BFCL multi-turn), a peer-model pool (11 models, 7 vendor families), a delegation interface (call_model, plus an optional read_profile channel), an annotation layer (frozen 7-skill taxonomy and deterministic step tagger; App. C), and a metric suite (quality, cost, latency, delegation rate, fidelity-at-k, self-pr… view at source ↗

**Figure 2.** Figure 2: Quality–cost Pareto frontier per benchmark (GAIA: general tool-use; [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: (a) Agents (of 11) whose best aware variant strictly Pareto-dominates blind, by aware [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Left: per-agent decomposition of the aware-c2 GAIA lift over blind into the tool-availability [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: (a) Cross-vendor delegation flow (aware-* aggregated): rows are orchestrator vendors, [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Counterfactual-delegation ceiling per (model, benchmark): actual blind quality (purple) [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Per-agent best-aware lift vs. blind quality, with parabolic fit. The mid-capability pattern on [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗

**Figure 8.** Figure 8: Per-agent GAIA quality lift over blind, by aware variant. Bars sorted by best lift across the [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗

**Figure 9.** Figure 9: Per-skill awareness lift, aggregated across the 11 agents (quality on tasks bucketed by their [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

read the original abstract

We introduce DecisionBench, a benchmark substrate for emergent delegation in long-horizon agentic workflows. The substrate fixes a task suite (GAIA, tau-bench, BFCL multi-turn), a peer-model pool (11 models, 7 vendor families), a delegation interface (call_model plus an optional read_profile channel), a deterministic skill-annotation layer, and a multi-axis metric suite covering quality, cost, latency, delegation rate, routing fidelity-at-k, vendor self-preference, and a counterfactual-delegation ceiling. The substrate is agnostic to how peer information is generated or delivered, so learned routers, richer peer memories, adaptive profile construction, and multi-step delegation can all be evaluated against it. We characterize the substrate with a five-condition reference sweep on the full pool (n=23,375 task instances). Three benchmark-level findings emerge: (i) mean end-task quality is statistically indistinguishable across the four awareness conditions (|beta| <= 0.010, p >= 0.21), so quality-only evaluation would miss the orchestration signal; (ii) routing fidelity-at-1 ranges from 7.5% to 29.5% across conditions at near-equal mean quality, with delivery channel (on-demand tool vs. preloaded description) dominating description content; (iii) a counterfactual ceiling places perfect delegation 15-31 percentage points above measured performance on every suite, locating large unrealized headroom for future orchestration methods. We release the substrate, annotation layer, reference intervention suite, analysis pipeline, and 220 per-condition run archives.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces DecisionBench, a benchmark substrate for emergent delegation in long-horizon agentic workflows. It standardizes a task suite (GAIA, tau-bench, BFCL multi-turn), a pool of 11 peer models across 7 vendor families, a delegation interface (call_model with optional read_profile), a deterministic skill-annotation layer, and a multi-axis metric suite. A five-condition reference sweep over 23,375 instances yields three findings: mean end-task quality is statistically indistinguishable across awareness conditions (|β| ≤ 0.010, p ≥ 0.21); routing fidelity-at-1 ranges 7.5–29.5% with delivery channel mattering more than description content; and a counterfactual-delegation ceiling indicates 15–31 percentage points of unrealized headroom above measured performance on every suite. The substrate, annotations, reference interventions, analysis pipeline, and run archives are released.

Significance. If the benchmark substrate and its reference characterization hold, the work supplies a reusable, interface-agnostic evaluation framework that separates quality from orchestration signals and quantifies attainable headroom for delegation methods. Explicit release of the substrate, deterministic annotation layer, 220 per-condition run archives, and analysis pipeline constitutes a concrete reproducibility asset that future work on learned routers or adaptive profiles can directly build upon.

major comments (1)

[Abstract / Counterfactual ceiling results] Abstract and results on the counterfactual ceiling: the headline claim of 15–31 pp unrealized headroom rests on the counterfactual-delegation ceiling. The manuscript must explicitly state whether this ceiling is obtained by per-instance hindsight oracle selection (best model after seeing the outcome) or by a policy that decides using only the fixed interface (call_model + optional read_profile) and the deterministic skill annotations available at decision time. If the former, the reported gap mixes information unavailable to any realistic delegation policy and therefore overstates attainable headroom; a revised ceiling computed under the actual information constraints of the interface should be added.

minor comments (2)

[Methods] The abstract supplies concrete statistics (|β| ≤ 0.010, p ≥ 0.21) yet the full methods section should include explicit error-bar computation, data-exclusion rules, and per-suite sample sizes so that the statistical-indistinguishability claim can be independently verified.
[Results] Figure or table presenting the 15–31 pp range should report the exact per-suite values and the precise definition of the counterfactual used, rather than an aggregate range.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thorough review and valuable feedback on the manuscript. We address the major comment regarding the counterfactual-delegation ceiling below.

read point-by-point responses

Referee: [Abstract / Counterfactual ceiling results] Abstract and results on the counterfactual ceiling: the headline claim of 15–31 pp unrealized headroom rests on the counterfactual-delegation ceiling. The manuscript must explicitly state whether this ceiling is obtained by per-instance hindsight oracle selection (best model after seeing the outcome) or by a policy that decides using only the fixed interface (call_model + optional read_profile) and the deterministic skill annotations available at decision time. If the former, the reported gap mixes information unavailable to any realistic delegation policy and therefore overstates attainable headroom; a revised ceiling computed under the actual information constraints of the interface should be added.

Authors: We agree with the referee that the current presentation of the counterfactual-delegation ceiling requires clarification. The ceiling is indeed computed using per-instance hindsight oracle selection: for each task instance, we identify the model that delivered the highest quality outcome after execution. This approach quantifies the maximum attainable performance under perfect information about outcomes, thereby highlighting the potential headroom for improved delegation strategies. However, as the referee notes, this incorporates outcome information unavailable at decision time. We will revise the manuscript to explicitly state this in the abstract and the relevant results section. Furthermore, we will add a new analysis computing a revised ceiling that operates strictly under the information constraints of the delegation interface, relying solely on the deterministic skill annotations and the call_model/read_profile channels available prior to execution. This will provide a more conservative and realistic estimate of attainable headroom for policies using only pre-decision information. revision: yes

Circularity Check

0 steps flagged

No significant circularity; benchmark substrate reports direct empirical measurements

full rationale

The paper defines DecisionBench as an external benchmark substrate with fixed public task suites (GAIA, tau-bench, BFCL), a peer-model pool, and a multi-axis metric suite that explicitly includes a counterfactual-delegation ceiling. Reported results are characterizations via reference sweeps on n=23,375 instances, with findings on quality indistinguishability, routing fidelity, and measured gaps to the ceiling. No derivation chain reduces by construction to fitted parameters, self-definitions, or self-citation load-bearing steps; the ceiling is a defined upper-bound metric within the substrate rather than a predicted quantity derived from internal fits. The work remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark design rests on domain assumptions about task representativeness and model-pool diversity; no free parameters or invented entities are introduced beyond the substrate definition.

axioms (1)

domain assumption The chosen task suites (GAIA, tau-bench, BFCL multi-turn) and 11-model pool adequately sample long-horizon agentic delegation scenarios.
Stated when defining the fixed task suite and peer-model pool in the abstract.

pith-pipeline@v0.9.0 · 5832 in / 1308 out tokens · 47749 ms · 2026-05-20T10:12:43.488442+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce DecisionBench, a benchmark substrate for emergent delegation in long-horizon agentic workflows... metric suite covering quality, cost, latency, delegation rate, routing fidelity-at-k, vendor self-preference, and a counterfactual-delegation ceiling.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Counterfactual-delegation ceiling... perfect delegation 15–31 percentage points above measured performance

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 29 internal anchors

[1]

MT-Bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues, 2024

Ge Bai, Jie Liu, Xingyuan Bu, Yancheng He, Jiaheng Liu, Zhanhui Zhou, Zhuoran Lin, Wenbo Su, Tiezheng Ge, Bo Zheng, and Wanli Ouyang. MT-Bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues, 2024. URL https: //arxiv.org/abs/2402.14762

work page arXiv 2024
[2]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional AI: Harmlessness from AI feedback, 2022. URLhttps://arxiv.org/abs/2212.08073

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. LongBench: A bilingual, multitask benchmark for long context understanding, 2023. URL https://arxiv. org/abs/2308.14508

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Fitting linear mixed-effects models using lme4.Journal of Statistical Software, 67(1):1–48, 2015

Douglas Bates, Martin Mächler, Ben Bolker, and Steve Walker. Fitting linear mixed-effects models using lme4.Journal of Statistical Software, 67(1):1–48, 2015. doi: 10.18637/jss.v67.i01

work page doi:10.18637/jss.v67.i01 2015
[5]

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Aleksander Madry, and Lilian Weng. MLE-bench: Evaluating machine learning agents on machine learning engineering, 2024. URL https://arxiv.org/abs/2410.07095

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

FrugalGPT: How to use large language models while reducing cost and improving performance, 2023

Lingjiao Chen, Matei Zaharia, and James Zou. FrugalGPT: How to use large language models while reducing cost and improving performance, 2023. URLhttps://arxiv.org/abs/2305. 05176

work page 2023
[7]

Mind2Web: Towards a Generalist Agent for the Web

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2Web: Towards a generalist agent for the web, 2023. URL https://arxiv.org/ abs/2306.06070

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks V . S. Lakshmanan, and Ahmed Hassan Awadallah. Hybrid LLM: Cost-efficient and quality-aware query routing, 2024. URLhttps://arxiv.org/abs/2404.14618

work page arXiv 2024
[9]

Improving Factuality and Reasoning in Language Models through Multiagent Debate

Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate, 2023. URL https: //arxiv.org/abs/2305.14325

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Tibshirani.An Introduction to the Bootstrap

Bradley Efron and Robert J. Tibshirani.An Introduction to the Bootstrap. Chapman & Hall/CRC, 1993

work page 1993
[11]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. MetaGPT: Meta programming for a multi-agent collaborative framework, 2023. URLhttps://arxiv.org/abs/2308.00352

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. RULER: What’s the real context size of your long-context language models?, 2024. URLhttps://arxiv.org/abs/2404.06654

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

RouterBench: A Benchmark for Multi-LLM Routing System

Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, and Shriyash Kaustubh Upadhyay. RouterBench: A benchmark for multi-LLM routing system, 2024. URLhttps://arxiv.org/abs/2403.12031

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues?,

work page
[15]

URLhttps://arxiv.org/abs/2310.06770

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Decomposed prompting: A modular approach for solving complex tasks,

Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. Decomposed prompting: A modular approach for solving complex tasks,

work page
[17]

URLhttps://arxiv.org/abs/2210.02406

work page internal anchor Pith review Pith/arXiv arXiv
[18]

MT-Eval: A multi-turn capabilities evaluation benchmark for large language models

Wai-Chung Kwan, Xingshan Zeng, Yufei Wang, Yusen Sun, Liangyou Li, Lifeng Shang, Qun Liu, and Kam-Fai Wong. MT-Eval: A multi-turn capabilities evaluation benchmark for large language models, 2024. URLhttps://arxiv.org/abs/2401.16745

work page arXiv 2024
[19]

SkillsBench: How Skills Work in AI Agents

Laude Institute. SkillsBench: How Skills Work in AI Agents. https://www.skillsbench. ai/, 2026. Open-source benchmark for skill-aware agent configurations. 14

work page 2026
[20]

From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Banghua Zhu, Joseph E. Gonzalez, and Ion Stoica. From crowdsourced data to high-quality benchmarks: Arena-hard and BenchBuilder pipeline, 2024. URLhttps://arxiv.org/abs/2406.11939

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

AgentBench: Evaluating LLMs as agents, 2023

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. AgentBench: Evaluating LLMs as agents, 2023. URL https://arxiv.org/abs/2308. 03688

work page 2023
[22]

LMArena: Crowdsourced LLM preference leaderboard

LMSYS. LMArena: Crowdsourced LLM preference leaderboard. https://lmarena.ai/ leaderboard, 2026

work page 2026
[23]

AutoMix: Automatically mixing language models, 2023

Aman Madaan, Pranjal Aggarwal, Ankit Anand, Srividya Pranavi Potharaju, Swaroop Mishra, Pei Zhou, Aditya Gupta, Dheeraj Rajagopal, Karthik Kappaganthu, Yiming Yang, Shyam Upadhyay, and Manaal Faruqui. AutoMix: Automatically mixing language models, 2023. URL https://arxiv.org/abs/2310.12963

work page arXiv 2023
[24]

Self-Refine: Iterative Refinement with Self-Feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback, 2023. URL https://arxiv.org/abs/2303.17651

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

GAIA: a benchmark for General AI Assistants

Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: a benchmark for General AI Assistants, 2023. URL https://arxiv.org/ abs/2311.12983

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Gonzalez, M

Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M. Waleed Kadous, and Ion Stoica. RouteLLM: Learning to route LLMs with preference data,

work page
[27]

URLhttps://arxiv.org/abs/2406.18665

work page internal anchor Pith review Pith/arXiv arXiv
[28]

OpenRouter model directory and pricing

OpenRouter. OpenRouter model directory and pricing. https://openrouter.ai/models, 2026

work page 2026
[29]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback,...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[30]

Bowman, and Shi Feng

Arjun Panickssery, Samuel R. Bowman, and Shi Feng. LLM evaluators recognize and favor their own generations, 2024. URLhttps://arxiv.org/abs/2404.13076

work page arXiv 2024
[31]

Generative Agents: Interactive Simulacra of Human Behavior

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior, 2023. URL https://arxiv.org/abs/2304.03442

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Patil, Huanzhi Mao, Charlie Cheng-Jie Ji, Fanjia Yan, Vishnu Suresh, Ion Stoica, and Joseph E

Shishir G. Patil, Huanzhi Mao, Charlie Cheng-Jie Ji, Fanjia Yan, Vishnu Suresh, Ion Stoica, and Joseph E. Gonzalez. Berkeley function calling leaderboard (BFCL) v3 / v4.https://gorilla. cs.berkeley.edu/leaderboard.html, 2024. Multi-turn function-calling benchmark

work page 2024
[33]

ToolLLM: Facilitating large language models to master 16000+ real-world APIs, 2023

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. ToolLLM: Facilitating large language models to master 16000+ real-world APIs, 2023. URL https://arxiv.org/abs/2307. 16789

work page 2023
[34]

Verbosity bias in preference labeling by large language models, 2023

Keita Saito, Akifumi Wachi, Koki Wataoka, and Youhei Akimoto. Verbosity bias in preference labeling by large language models, 2023. URLhttps://arxiv.org/abs/2310.10076

work page arXiv 2023
[35]

SWE-Bench Pro: A multi-language benchmark for repository-level coding

Scale AI. SWE-Bench Pro: A multi-language benchmark for repository-level coding. https: //github.com/scaleapi/SWE-bench_Pro-os, 2025. Public dataset and evaluation harness

work page 2025
[36]

SWE-Bench Pro public Leaderboard

Scale Labs. SWE-Bench Pro public Leaderboard. https://labs.scale.com/ leaderboard/swe_bench_pro_public, 2026

work page 2026
[37]

Statsmodels: Econometric and statistical modeling with Python

Skipper Seabold and Josef Perktold. Statsmodels: Econometric and statistical modeling with Python. In9th Python in Science Conference, 2010. 15

work page 2010
[38]

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. HuggingGPT: Solving AI tasks with ChatGPT and its friends in HuggingFace, 2023. URL https://arxiv.org/abs/2303.17580

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Reflexion: Language Agents with Verbal Reinforcement Learning

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning, 2023. URL https://arxiv.org/abs/2303.11366

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. ALFWorld: Aligning text and embodied environments for interactive learning, 2020. URLhttps://arxiv.org/abs/2010.03768

work page internal anchor Pith review Pith/arXiv arXiv 2020
[41]

Voyager: An open-ended embodied agent with large language models,

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models,

work page
[42]

URLhttps://arxiv.org/abs/2305.16291

work page internal anchor Pith review Pith/arXiv arXiv
[43]

Mixture-of-agents enhances large language model capabilities, 2024

Junlin Wang, Jue Wang, Ben Athiwaratkun, Ce Zhang, and James Zou. Mixture-of-agents enhances large language model capabilities, 2024. URL https://arxiv.org/abs/2406. 04692

work page 2024
[44]

Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models

Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, and Ee-Peng Lim. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models, 2023. URLhttps://arxiv.org/abs/2305.04091

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

Large language models are not robust multiple choice selectors,

Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large language models are not robust multiple choice selectors,

work page
[46]

URLhttps://arxiv.org/abs/2309.03882

work page arXiv
[47]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2022. URLhttps://arxiv.org/abs/2201.11903

work page internal anchor Pith review Pith/arXiv arXiv 2022
[48]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. AutoGen: Enabling next-gen LLM applications via multi-agent conversation, 2023. URLhttps://arxiv.org/abs/2308.08155

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

The Rise and Potential of Large Language Model Based Agents: A Survey

Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, et al. The rise and potential of large language model based agents: A survey, 2023. URLhttps://arxiv.org/abs/2309.07864

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024. URL https: //arxiv.org/abs/2...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[51]

WebShop: Towards scalable real-world web interaction with grounded language agents, 2022

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. WebShop: Towards scalable real-world web interaction with grounded language agents, 2022. URL https://arxiv.org/ abs/2207.01206

work page arXiv 2022
[52]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models, 2022. URL https: //arxiv.org/abs/2210.03629

work page internal anchor Pith review Pith/arXiv arXiv 2022
[53]

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models, 2023. URLhttps://arxiv.org/abs/2305.10601

work page internal anchor Pith review Pith/arXiv arXiv 2023
[54]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A benchmark for tool-agent-user interaction in real-world domains, 2024. URL https://arxiv.org/abs/ 2406.12045

work page internal anchor Pith review Pith/arXiv arXiv 2024
[55]

Zhang, Neil Perry, Riya Dulepet, Joey Ji, Justin W

Andy K. Zhang, Neil Perry, Riya Dulepet, Joey Ji, Justin W. Jones, Celeste Menders Lin, Eliot Hussein, Samantha Lopez, Andres Yuan, Arnav Zhang, et al. Cybench: A framework for evaluating cybersecurity capabilities and risk of language models, 2024. URL https: //arxiv.org/abs/2408.08926

work page arXiv 2024
[56]

EcoAssistant: Using LLM assistants more affordably and accurately, 2023

Jieyu Zhang, Ranjay Krishna, Ahmed Hassan Awadallah, and Chi Wang. EcoAssistant: Using LLM assistants more affordably and accurately, 2023. URL https://arxiv.org/abs/2310. 03046. 16

work page 2023
[57]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, 2023. URL https: //arxiv.org/abs/2306.05685

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

public-only

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A realistic web environment for building autonomous agents, 2023. URL https://arxiv.org/abs/2307. 13854. 17 Organization of Appendix The appendix is organized as follows: in App. A we list the eleven-a...

work page 2023
[59]

Infra-only. If every tool call is to call_model or read_profile (DecisionBench infras- tructure tools), tag with the private _infra_delegation marker and return early; this step does not contribute to graded skill stats

work page
[60]

Numerical computation. If any tool name is in {calculator, evaluate_expression, eval_python, python_eval, math_eval, compute}, OR the tool’s arguments con- tain ≥3 numerical tokens (long-digit / decimal / currency / date or time patterns), tag numerical_computation

work page
[61]

Information retrieval. If any tool name contains web_search, search, fetch_url, browse, find_user_id, find_user, lookup, get_user_details, get_order, list_orders, get_product, list_products, get_reservation, list_reservation, search_direct_flight, search_onestop_flight, parse_pdf, extract_table,ocr,read_document, taginformation_retrieval

work page
[62]

Otherwise, tagtool_schema_adherence

Tool-schema adherence. Otherwise, tagtool_schema_adherence. Non-tool branch(text-only assistant turn):

work page
[63]

If the suite is τ-bench AND the assistant text matches any of the regexes in Table 5 (case-insensitive), tagdomain_policy_compliance

Domain-policy compliance. If the suite is τ-bench AND the assistant text matches any of the regexes in Table 5 (case-insensitive), tagdomain_policy_compliance

work page
[64]

If the prompt-token count for this turn is ≥15,000 (from usage.prompt_tokens, falling back to a 4-chars-per-token heuristic), tag long_input_handling

Long-input handling. If the prompt-token count for this turn is ≥15,000 (from usage.prompt_tokens, falling back to a 4-chars-per-token heuristic), tag long_input_handling. 19

work page
[65]

If suite is GAIA AND ≥2 prior tool calls in this task, tag multi_step_reasoning; on GAIA with<2prior tool calls, tagNone

Multi-step reasoning. If suite is GAIA AND ≥2 prior tool calls in this task, tag multi_step_reasoning; on GAIA with<2prior tool calls, tagNone

work page
[66]

image-grounded extraction

Multi-turn state tracking. If suite is τ-bench or BFCL, tag multi_turn_state_tracking; otherwiseNone. Policy-compliance regex (case-insensitive) \bagainst\s+(?:our\s+|the\s+)?policy\b \bnot\s+permitted\b \bI\s+cannot\b.{0,40}\bpolicy\b \btransfer.{0,20}human\s+agent \boutside\s+(?:my|our)\s+scope\b \bplease\s+confirm\b \bI\s+(?:will\s+)?need\s+(?:your\s+)...

work page 2026
[67]

perfect-orchestration

Single-step delegation, perfect skill identification. For each task we tag the dominant skill and assume the agent delegates the entire task to the Stage-1-best peer for that skill in a single call_model. Real GAIA tasks decompose into 3–7 steps with potentially different dominant skills; a multi-step ceiling that allowed per-step delegation would be high...

work page
[68]

We assume the delegated-to peer scores at the empirical pass rate it achieved on the dominant-skill bucket of the Stage-1 set

Peer answers at its Stage-1 pass rate. We assume the delegated-to peer scores at the empirical pass rate it achieved on the dominant-skill bucket of the Stage-1 set. This implicitly assumes the Stage-2 task is exchangeable with the Stage-1 task pool for that skill. Per-task difficulty variation inside a skill bucket is real (e.g., long-input-handling on a...

work page
[69]

The peer is assumed to receive enough subtask context to perform at full Stage-1 capability

No context-loss penalty. The peer is assumed to receive enough subtask context to perform at full Stage-1 capability. In practice the orchestrator must compress its trajectory state into the call_model subtask string and the peer answers without seeing earlier turns; we sensitivity-test this in Table 11 below

work page
[70]

70% of Stage-1

No coordination cost. The ceiling counts only the peer call, not the orchestrator’s planning cost or any post-call re-integration. In practice an orchestrator pays for both. Peer-realization rate GAIAτ-bench BFCL 100% of Stage-1 (reported in §6.6)+0.269 +0.153 +0.313 90% of Stage-1 (mild context loss)+0.230 +0.123 +0.272 80% of Stage-1 (heavy context loss...

work page

[1] [1]

MT-Bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues, 2024

Ge Bai, Jie Liu, Xingyuan Bu, Yancheng He, Jiaheng Liu, Zhanhui Zhou, Zhuoran Lin, Wenbo Su, Tiezheng Ge, Bo Zheng, and Wanli Ouyang. MT-Bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues, 2024. URL https: //arxiv.org/abs/2402.14762

work page arXiv 2024

[2] [2]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional AI: Harmlessness from AI feedback, 2022. URLhttps://arxiv.org/abs/2212.08073

work page internal anchor Pith review Pith/arXiv arXiv 2022

[3] [3]

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. LongBench: A bilingual, multitask benchmark for long context understanding, 2023. URL https://arxiv. org/abs/2308.14508

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

Fitting linear mixed-effects models using lme4.Journal of Statistical Software, 67(1):1–48, 2015

Douglas Bates, Martin Mächler, Ben Bolker, and Steve Walker. Fitting linear mixed-effects models using lme4.Journal of Statistical Software, 67(1):1–48, 2015. doi: 10.18637/jss.v67.i01

work page doi:10.18637/jss.v67.i01 2015

[5] [5]

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Aleksander Madry, and Lilian Weng. MLE-bench: Evaluating machine learning agents on machine learning engineering, 2024. URL https://arxiv.org/abs/2410.07095

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

FrugalGPT: How to use large language models while reducing cost and improving performance, 2023

Lingjiao Chen, Matei Zaharia, and James Zou. FrugalGPT: How to use large language models while reducing cost and improving performance, 2023. URLhttps://arxiv.org/abs/2305. 05176

work page 2023

[7] [7]

Mind2Web: Towards a Generalist Agent for the Web

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2Web: Towards a generalist agent for the web, 2023. URL https://arxiv.org/ abs/2306.06070

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks V . S. Lakshmanan, and Ahmed Hassan Awadallah. Hybrid LLM: Cost-efficient and quality-aware query routing, 2024. URLhttps://arxiv.org/abs/2404.14618

work page arXiv 2024

[9] [9]

Improving Factuality and Reasoning in Language Models through Multiagent Debate

Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate, 2023. URL https: //arxiv.org/abs/2305.14325

work page internal anchor Pith review Pith/arXiv arXiv 2023

[10] [10]

Tibshirani.An Introduction to the Bootstrap

Bradley Efron and Robert J. Tibshirani.An Introduction to the Bootstrap. Chapman & Hall/CRC, 1993

work page 1993

[11] [11]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. MetaGPT: Meta programming for a multi-agent collaborative framework, 2023. URLhttps://arxiv.org/abs/2308.00352

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. RULER: What’s the real context size of your long-context language models?, 2024. URLhttps://arxiv.org/abs/2404.06654

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

RouterBench: A Benchmark for Multi-LLM Routing System

Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, and Shriyash Kaustubh Upadhyay. RouterBench: A benchmark for multi-LLM routing system, 2024. URLhttps://arxiv.org/abs/2403.12031

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues?,

work page

[15] [15]

URLhttps://arxiv.org/abs/2310.06770

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Decomposed prompting: A modular approach for solving complex tasks,

Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. Decomposed prompting: A modular approach for solving complex tasks,

work page

[17] [17]

URLhttps://arxiv.org/abs/2210.02406

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

MT-Eval: A multi-turn capabilities evaluation benchmark for large language models

Wai-Chung Kwan, Xingshan Zeng, Yufei Wang, Yusen Sun, Liangyou Li, Lifeng Shang, Qun Liu, and Kam-Fai Wong. MT-Eval: A multi-turn capabilities evaluation benchmark for large language models, 2024. URLhttps://arxiv.org/abs/2401.16745

work page arXiv 2024

[19] [19]

SkillsBench: How Skills Work in AI Agents

Laude Institute. SkillsBench: How Skills Work in AI Agents. https://www.skillsbench. ai/, 2026. Open-source benchmark for skill-aware agent configurations. 14

work page 2026

[20] [20]

From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Banghua Zhu, Joseph E. Gonzalez, and Ion Stoica. From crowdsourced data to high-quality benchmarks: Arena-hard and BenchBuilder pipeline, 2024. URLhttps://arxiv.org/abs/2406.11939

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

AgentBench: Evaluating LLMs as agents, 2023

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. AgentBench: Evaluating LLMs as agents, 2023. URL https://arxiv.org/abs/2308. 03688

work page 2023

[22] [22]

LMArena: Crowdsourced LLM preference leaderboard

LMSYS. LMArena: Crowdsourced LLM preference leaderboard. https://lmarena.ai/ leaderboard, 2026

work page 2026

[23] [23]

AutoMix: Automatically mixing language models, 2023

Aman Madaan, Pranjal Aggarwal, Ankit Anand, Srividya Pranavi Potharaju, Swaroop Mishra, Pei Zhou, Aditya Gupta, Dheeraj Rajagopal, Karthik Kappaganthu, Yiming Yang, Shyam Upadhyay, and Manaal Faruqui. AutoMix: Automatically mixing language models, 2023. URL https://arxiv.org/abs/2310.12963

work page arXiv 2023

[24] [24]

Self-Refine: Iterative Refinement with Self-Feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback, 2023. URL https://arxiv.org/abs/2303.17651

work page internal anchor Pith review Pith/arXiv arXiv 2023

[25] [25]

GAIA: a benchmark for General AI Assistants

Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: a benchmark for General AI Assistants, 2023. URL https://arxiv.org/ abs/2311.12983

work page internal anchor Pith review Pith/arXiv arXiv 2023

[26] [26]

Gonzalez, M

Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M. Waleed Kadous, and Ion Stoica. RouteLLM: Learning to route LLMs with preference data,

work page

[27] [27]

URLhttps://arxiv.org/abs/2406.18665

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

OpenRouter model directory and pricing

OpenRouter. OpenRouter model directory and pricing. https://openrouter.ai/models, 2026

work page 2026

[29] [29]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback,...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[30] [30]

Bowman, and Shi Feng

Arjun Panickssery, Samuel R. Bowman, and Shi Feng. LLM evaluators recognize and favor their own generations, 2024. URLhttps://arxiv.org/abs/2404.13076

work page arXiv 2024

[31] [31]

Generative Agents: Interactive Simulacra of Human Behavior

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior, 2023. URL https://arxiv.org/abs/2304.03442

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [32]

Patil, Huanzhi Mao, Charlie Cheng-Jie Ji, Fanjia Yan, Vishnu Suresh, Ion Stoica, and Joseph E

Shishir G. Patil, Huanzhi Mao, Charlie Cheng-Jie Ji, Fanjia Yan, Vishnu Suresh, Ion Stoica, and Joseph E. Gonzalez. Berkeley function calling leaderboard (BFCL) v3 / v4.https://gorilla. cs.berkeley.edu/leaderboard.html, 2024. Multi-turn function-calling benchmark

work page 2024

[33] [33]

ToolLLM: Facilitating large language models to master 16000+ real-world APIs, 2023

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. ToolLLM: Facilitating large language models to master 16000+ real-world APIs, 2023. URL https://arxiv.org/abs/2307. 16789

work page 2023

[34] [34]

Verbosity bias in preference labeling by large language models, 2023

Keita Saito, Akifumi Wachi, Koki Wataoka, and Youhei Akimoto. Verbosity bias in preference labeling by large language models, 2023. URLhttps://arxiv.org/abs/2310.10076

work page arXiv 2023

[35] [35]

SWE-Bench Pro: A multi-language benchmark for repository-level coding

Scale AI. SWE-Bench Pro: A multi-language benchmark for repository-level coding. https: //github.com/scaleapi/SWE-bench_Pro-os, 2025. Public dataset and evaluation harness

work page 2025

[36] [36]

SWE-Bench Pro public Leaderboard

Scale Labs. SWE-Bench Pro public Leaderboard. https://labs.scale.com/ leaderboard/swe_bench_pro_public, 2026

work page 2026

[37] [37]

Statsmodels: Econometric and statistical modeling with Python

Skipper Seabold and Josef Perktold. Statsmodels: Econometric and statistical modeling with Python. In9th Python in Science Conference, 2010. 15

work page 2010

[38] [38]

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. HuggingGPT: Solving AI tasks with ChatGPT and its friends in HuggingFace, 2023. URL https://arxiv.org/abs/2303.17580

work page internal anchor Pith review Pith/arXiv arXiv 2023

[39] [39]

Reflexion: Language Agents with Verbal Reinforcement Learning

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning, 2023. URL https://arxiv.org/abs/2303.11366

work page internal anchor Pith review Pith/arXiv arXiv 2023

[40] [40]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. ALFWorld: Aligning text and embodied environments for interactive learning, 2020. URLhttps://arxiv.org/abs/2010.03768

work page internal anchor Pith review Pith/arXiv arXiv 2020

[41] [41]

Voyager: An open-ended embodied agent with large language models,

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models,

work page

[42] [42]

URLhttps://arxiv.org/abs/2305.16291

work page internal anchor Pith review Pith/arXiv arXiv

[43] [43]

Mixture-of-agents enhances large language model capabilities, 2024

Junlin Wang, Jue Wang, Ben Athiwaratkun, Ce Zhang, and James Zou. Mixture-of-agents enhances large language model capabilities, 2024. URL https://arxiv.org/abs/2406. 04692

work page 2024

[44] [44]

Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models

Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, and Ee-Peng Lim. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models, 2023. URLhttps://arxiv.org/abs/2305.04091

work page internal anchor Pith review Pith/arXiv arXiv 2023

[45] [45]

Large language models are not robust multiple choice selectors,

Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large language models are not robust multiple choice selectors,

work page

[46] [46]

URLhttps://arxiv.org/abs/2309.03882

work page arXiv

[47] [47]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2022. URLhttps://arxiv.org/abs/2201.11903

work page internal anchor Pith review Pith/arXiv arXiv 2022

[48] [48]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. AutoGen: Enabling next-gen LLM applications via multi-agent conversation, 2023. URLhttps://arxiv.org/abs/2308.08155

work page internal anchor Pith review Pith/arXiv arXiv 2023

[49] [49]

The Rise and Potential of Large Language Model Based Agents: A Survey

Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, et al. The rise and potential of large language model based agents: A survey, 2023. URLhttps://arxiv.org/abs/2309.07864

work page internal anchor Pith review Pith/arXiv arXiv 2023

[50] [50]

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024. URL https: //arxiv.org/abs/2...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[51] [51]

WebShop: Towards scalable real-world web interaction with grounded language agents, 2022

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. WebShop: Towards scalable real-world web interaction with grounded language agents, 2022. URL https://arxiv.org/ abs/2207.01206

work page arXiv 2022

[52] [52]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models, 2022. URL https: //arxiv.org/abs/2210.03629

work page internal anchor Pith review Pith/arXiv arXiv 2022

[53] [53]

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models, 2023. URLhttps://arxiv.org/abs/2305.10601

work page internal anchor Pith review Pith/arXiv arXiv 2023

[54] [54]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A benchmark for tool-agent-user interaction in real-world domains, 2024. URL https://arxiv.org/abs/ 2406.12045

work page internal anchor Pith review Pith/arXiv arXiv 2024

[55] [55]

Zhang, Neil Perry, Riya Dulepet, Joey Ji, Justin W

Andy K. Zhang, Neil Perry, Riya Dulepet, Joey Ji, Justin W. Jones, Celeste Menders Lin, Eliot Hussein, Samantha Lopez, Andres Yuan, Arnav Zhang, et al. Cybench: A framework for evaluating cybersecurity capabilities and risk of language models, 2024. URL https: //arxiv.org/abs/2408.08926

work page arXiv 2024

[56] [56]

EcoAssistant: Using LLM assistants more affordably and accurately, 2023

Jieyu Zhang, Ranjay Krishna, Ahmed Hassan Awadallah, and Chi Wang. EcoAssistant: Using LLM assistants more affordably and accurately, 2023. URL https://arxiv.org/abs/2310. 03046. 16

work page 2023

[57] [57]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, 2023. URL https: //arxiv.org/abs/2306.05685

work page internal anchor Pith review Pith/arXiv arXiv 2023

[58] [58]

public-only

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A realistic web environment for building autonomous agents, 2023. URL https://arxiv.org/abs/2307. 13854. 17 Organization of Appendix The appendix is organized as follows: in App. A we list the eleven-a...

work page 2023

[59] [59]

Infra-only. If every tool call is to call_model or read_profile (DecisionBench infras- tructure tools), tag with the private _infra_delegation marker and return early; this step does not contribute to graded skill stats

work page

[60] [60]

Numerical computation. If any tool name is in {calculator, evaluate_expression, eval_python, python_eval, math_eval, compute}, OR the tool’s arguments con- tain ≥3 numerical tokens (long-digit / decimal / currency / date or time patterns), tag numerical_computation

work page

[61] [61]

Information retrieval. If any tool name contains web_search, search, fetch_url, browse, find_user_id, find_user, lookup, get_user_details, get_order, list_orders, get_product, list_products, get_reservation, list_reservation, search_direct_flight, search_onestop_flight, parse_pdf, extract_table,ocr,read_document, taginformation_retrieval

work page

[62] [62]

Otherwise, tagtool_schema_adherence

Tool-schema adherence. Otherwise, tagtool_schema_adherence. Non-tool branch(text-only assistant turn):

work page

[63] [63]

If the suite is τ-bench AND the assistant text matches any of the regexes in Table 5 (case-insensitive), tagdomain_policy_compliance

Domain-policy compliance. If the suite is τ-bench AND the assistant text matches any of the regexes in Table 5 (case-insensitive), tagdomain_policy_compliance

work page

[64] [64]

If the prompt-token count for this turn is ≥15,000 (from usage.prompt_tokens, falling back to a 4-chars-per-token heuristic), tag long_input_handling

Long-input handling. If the prompt-token count for this turn is ≥15,000 (from usage.prompt_tokens, falling back to a 4-chars-per-token heuristic), tag long_input_handling. 19

work page

[65] [65]

If suite is GAIA AND ≥2 prior tool calls in this task, tag multi_step_reasoning; on GAIA with<2prior tool calls, tagNone

Multi-step reasoning. If suite is GAIA AND ≥2 prior tool calls in this task, tag multi_step_reasoning; on GAIA with<2prior tool calls, tagNone

work page

[66] [66]

image-grounded extraction

Multi-turn state tracking. If suite is τ-bench or BFCL, tag multi_turn_state_tracking; otherwiseNone. Policy-compliance regex (case-insensitive) \bagainst\s+(?:our\s+|the\s+)?policy\b \bnot\s+permitted\b \bI\s+cannot\b.{0,40}\bpolicy\b \btransfer.{0,20}human\s+agent \boutside\s+(?:my|our)\s+scope\b \bplease\s+confirm\b \bI\s+(?:will\s+)?need\s+(?:your\s+)...

work page 2026

[67] [67]

perfect-orchestration

Single-step delegation, perfect skill identification. For each task we tag the dominant skill and assume the agent delegates the entire task to the Stage-1-best peer for that skill in a single call_model. Real GAIA tasks decompose into 3–7 steps with potentially different dominant skills; a multi-step ceiling that allowed per-step delegation would be high...

work page

[68] [68]

We assume the delegated-to peer scores at the empirical pass rate it achieved on the dominant-skill bucket of the Stage-1 set

Peer answers at its Stage-1 pass rate. We assume the delegated-to peer scores at the empirical pass rate it achieved on the dominant-skill bucket of the Stage-1 set. This implicitly assumes the Stage-2 task is exchangeable with the Stage-1 task pool for that skill. Per-task difficulty variation inside a skill bucket is real (e.g., long-input-handling on a...

work page

[69] [69]

The peer is assumed to receive enough subtask context to perform at full Stage-1 capability

No context-loss penalty. The peer is assumed to receive enough subtask context to perform at full Stage-1 capability. In practice the orchestrator must compress its trajectory state into the call_model subtask string and the peer answers without seeing earlier turns; we sensitivity-test this in Table 11 below

work page

[70] [70]

70% of Stage-1

No coordination cost. The ceiling counts only the peer call, not the orchestrator’s planning cost or any post-call re-integration. In practice an orchestrator pays for both. Peer-realization rate GAIAτ-bench BFCL 100% of Stage-1 (reported in §6.6)+0.269 +0.153 +0.313 90% of Stage-1 (mild context loss)+0.230 +0.123 +0.272 80% of Stage-1 (heavy context loss...

work page