pith. sign in

super hub Canonical reference

AgentBench: Evaluating LLMs as Agents

Canonical reference. 86% of citing Pith papers cite this work as background.

144 Pith papers citing it
Background 86% of classified citations
abstract

The potential of Large Language Model (LLM) as agents has been widely acknowledged recently. Thus, there is an urgent need to quantitatively \textit{evaluate LLMs as agents} on challenging tasks in interactive environments. We present AgentBench, a multi-dimensional benchmark that consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities. Our extensive test over \num API-based and open-sourced (OSS) LLMs shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in performance between them and many OSS competitors that are no larger than 70B. We identify the typical reasons of failures in environments and LLMs, showing that poor long-term reasoning, decision-making, and instruction following abilities are the main obstacles for developing usable LLM agents. Improving instruction following and training on high quality multi-round alignment data could improve agent performance. And different from existing assumptions, training on code present ambivalent impacts on different agent tasks. Datasets, environments, and an integrated evaluation package for AgentBench are released at https://github.com/THUDM/AgentBench.

hub tools

citation-role summary

background 38 dataset 5 baseline 1

citation-polarity summary

claims ledger

  • abstract The potential of Large Language Model (LLM) as agents has been widely acknowledged recently. Thus, there is an urgent need to quantitatively \textit{evaluate LLMs as agents} on challenging tasks in interactive environments. We present AgentBench, a multi-dimensional benchmark that consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities. Our extensive test over \num API-based and open-sourced (OSS) LLMs shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in perfo

authors

co-cited works

clear filters

representative citing papers

DART: Semantic Recoverability for Structured Tool Agents

cs.AI · 2026-05-22 · unverdicted · novelty 7.0

DART is a modular runtime that certifies semantically recoverable boundaries for failed tool-agent instances and selects admissible restore points that preserve downstream commitments or blocks recovery.

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

cs.CL · 2026-05-21 · unverdicted · novelty 7.0 · 2 refs

Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.

ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents

cs.AI · 2026-05-13 · unverdicted · novelty 7.0 · 2 refs

ClawForge is a generator framework that creates reproducible executable benchmarks for command-line agents under state conflict, with ClawForge-Bench showing frontier models reach at most 45.3% strict accuracy and that state inspection drives most performance gaps.

AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents

cs.AI · 2026-05-08 · unverdicted · novelty 7.0 · 2 refs

AgentEscapeBench is a benchmark of 270 tasks across five difficulty tiers that measures LLM agents' ability to manage long-range tool dependencies, state tracking, and intermediate result propagation, revealing sharp performance drops with increasing depth.

citing papers explorer

Showing 50 of 63 citing papers after filters.