super hub Canonical reference

AgentBench: Evaluating LLMs as Agents

Hanchen Zhang, Hanyu Lai, Hao Yu, Xiao Liu, Xuanyu Lei, Yifan Xu · 2023 · cs.AI · arXiv 2308.03688

Canonical reference. 86% of citing Pith papers cite this work as background.

208 Pith papers citing it

Background 86% of classified citations

open full Pith review browse 208 citing papers more from Hanchen Zhang arXiv PDF

abstract

The potential of Large Language Model (LLM) as agents has been widely acknowledged recently. Thus, there is an urgent need to quantitatively \textit{evaluate LLMs as agents} on challenging tasks in interactive environments. We present AgentBench, a multi-dimensional benchmark that consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities. Our extensive test over \num API-based and open-sourced (OSS) LLMs shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in performance between them and many OSS competitors that are no larger than 70B. We identify the typical reasons of failures in environments and LLMs, showing that poor long-term reasoning, decision-making, and instruction following abilities are the main obstacles for developing usable LLM agents. Improving instruction following and training on high quality multi-round alignment data could improve agent performance. And different from existing assumptions, training on code present ambivalent impacts on different agent tasks. Datasets, environments, and an integrated evaluation package for AgentBench are released at https://github.com/THUDM/AgentBench.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 38 dataset 5 baseline 1

citation-polarity summary

background 38 use dataset 4 baseline 1 unclear 1

claims ledger

abstract The potential of Large Language Model (LLM) as agents has been widely acknowledged recently. Thus, there is an urgent need to quantitatively \textit{evaluate LLMs as agents} on challenging tasks in interactive environments. We present AgentBench, a multi-dimensional benchmark that consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities. Our extensive test over \num API-based and open-sourced (OSS) LLMs shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in perfo

authors

Hanchen Zhang Hanyu Lai Hao Yu Xiao Liu Xuanyu Lei Yifan Xu

co-cited works

representative citing papers

EnergyAgentBench: Benchmarking LLM Agents on Live Energy Infrastructure Data

econ.EM · 2026-05-13 · accept · novelty 8.0

EnergyAgentBench is a new benchmark with 70 task variants that evaluates LLM agents on live energy data for datacenter siting, long-horizon optimization, and causal grid diagnosis.

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

cs.CL · 2026-05-11 · unverdicted · novelty 8.0

A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.

SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

cs.AI · 2026-05-10 · accept · novelty 8.0 · 2 refs

SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.

PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments

cs.AI · 2026-05-04 · conditional · novelty 8.0

PhysicianBench is a new benchmark of 100 physician-reviewed, execution-grounded tasks in live EHR environments where the best LLM agent reaches only 46% success and open-source models reach 19%.

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

cs.CR · 2024-06-19 · unverdicted · novelty 8.0

AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

cs.AI · 2024-04-11 · accept · novelty 8.0

OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

cs.CL · 2023-08-28 · unverdicted · novelty 8.0

LongBench is the first bilingual multi-task benchmark for long context understanding in LLMs, containing 21 datasets in 6 categories with average lengths of 6711 words (English) and 13386 characters (Chinese).

A$^{2}$utoLPBench: An Auto-Generated, Agent-Friendly LP Benchmark via Inverse-KKT Construction

cs.AI · 2026-07-02 · conditional · novelty 7.0

A²utoLPBench is a generator that produces unlimited LP word problems with ground-truth answers known by construction via inverse-KKT, bundled with a Docker environment for agent evaluation.

Self-GC: Self-Governing Context for Long-Horizon LLM Agents

cs.AI · 2026-07-01 · unverdicted · novelty 7.0

Self-GC governs agent context as indexed objects with planner-proposed actions, achieving 84.85% no-impact on future continuations on a hard set versus 54-70% for baselines.

MultiUAV-Plat: An LLM-Oriented Platform, Benchmark and Framework for Multi-UAV Collaborative Task Planning

cs.AI · 2026-06-30 · unverdicted · novelty 7.0

MultiUAV-Plat supplies a new RESTful simulation platform and 1500-task benchmark where Agent4Drone reaches 57.9% task pass rate versus 30.6% for ReAct baseline across 75 multi-UAV missions.

Whose Side Is Your Agent On? Multi-Party Principal Loyalty in LLM Agents

cs.AI · 2026-06-29 · unverdicted · novelty 7.0

PrincipalBench exposes a sharp split in frontier LLMs between selective and over-refusing behavior on multi-party loyalty, with prompt scaffolding and KL distillation reducing harm rates but only along an existing leak/over-refusal trade-off.

SpreadsheetBench 2: Evaluating Agents on End-to-End Business Spreadsheet Workflows

cs.SE · 2026-06-29 · unverdicted · novelty 7.0

SpreadsheetBench 2 provides 321 expert-validated tasks from authentic business data showing frontier LLMs reach only 34.89% overall accuracy on end-to-end spreadsheet workflows.

CLQT: A Closed-Loop, Cost-Aware, Strategy-Consistent Benchmark for Diagnostic Evaluation of LLM Portfolio-Management Agents

cs.AI · 2026-06-29 · unverdicted · novelty 7.0

CLQT is a new closed-loop, cost-aware benchmark that diagnoses LLM trading agent capabilities through strategy-consistent metrics and hash-verifiable trails rather than outcome rankings.

Agentic Abstention: Do Agents Know When to Stop Instead of Act?

cs.AI · 2026-06-27 · unverdicted · novelty 7.0

LLM agents often fail to abstain at the right time in uncertain multi-turn tasks, and the CONVOLVE context engineering method raises timely abstention rates on WebShop from 26.7 to 57.4 without parameter updates.

Glite ARF: Verifier-Driven Research with Parallel LLM Coding Agents

cs.MA · 2026-06-25 · accept · novelty 7.0

Glite ARF introduces a verifier-driven three-role framework for parallel LLM coding agents, demonstrated by first- and second-place finishes in the BEA 2026 vocabulary-difficulty shared task across three languages with 29.9-35.9% RMSE reduction at ~$450 API cost.

RigorBench: Benchmarking Engineering Process Discipline in Autonomous AI Coding Agents

cs.SE · 2026-06-21 · unverdicted · novelty 7.0

RigorBench evaluates AI coding agents on process discipline via five pillars and reports 41% higher process scores and 17% better outcome correctness with structured approaches on 30 tasks.

Power Systems Agent Benchmark: Executable Evaluation of AI Agents in Electric Power Engineering

cs.AI · 2026-06-18 · unverdicted · novelty 7.0 · 2 refs

Introduces the Power Systems Agent Benchmark with 41 task families across eight power engineering areas for executable evaluation of AI agents using deterministic feasibility checks.

StaminaBench: Stress-Testing Coding Agents over 100 Interaction Turns

cs.SE · 2026-06-17 · unverdicted · novelty 7.0

StaminaBench evaluates coding agents over 100 procedurally generated change requests to a REST API, finding that tested models fail within 5-6 turns without feedback but improve up to 12x with test feedback and good harnesses.

The Gate Is Only as Honest as Its Contracts: ContractGuard for the Contract Layer of Risk-Aware Causal Gating

cs.CR · 2026-06-17 · unverdicted · novelty 7.0

ContractGuard verifies tool contracts in RACG systems to prevent effect forgery, restoring zero injection success on benchmarks and six hosted models against adaptive attackers.

Layer-Isolated Evaluation: Gating the Deterministic Scaffold of a Production LLM Agent with a No-LLM, Regression-Locked Test Harness

cs.CL · 2026-06-10 · accept · novelty 7.0

Layer-isolated evaluation decomposes LLM agents into per-layer deterministic no-LLM test slices whose locked baselines localize regressions that aggregate pass rates mask.

ISE: An Execution-Grounded Recipe for Multi-Turn OS-Agent Trajectories

cs.CL · 2026-06-09 · conditional · novelty 7.0

ISE creates 23,132 execution-grounded multi-turn OS agent trajectories via intent simulation and live execution, improving agent performance on ClawEval from 19.3 to 37.7 pass@1 with Qwen3-8B.

Beyond Goodhart's Law: A Dynamic Benchmark for Evaluating Compliance in Multi-Agent Systems

cs.AI · 2026-06-05 · unverdicted · novelty 7.0

MAC-Bench is a new adversarial benchmark that converts legal texts into executable scenarios via the SERV pipeline to measure procedural compliance in multi-agent LLM systems using CSR and MG metrics.

ADK Arena: Evaluating Agent Development Kits via LLM-as-a-Developer

cs.SE · 2026-06-04 · unverdicted · novelty 7.0

ADK Arena evaluates 51 Python ADKs by having an LLM learn each framework's API, write and repair agent code, and run on benchmarks, finding 57% success rate, 5.6x cost variation, no dominant framework, and substitutable information sources.

AIP: A Graph Representation for Learning and Governing Agent Skills

cs.AI · 2026-06-03 · unverdicted · novelty 7.0

AIP models skills as graphs of discrete steps connected by typed I/O edges under a validated schema, raising agent mean reward from 0.60 to 0.71 and pass rate from 53% to 67% on 27 SkillsBench tasks while enabling node-level fixes.

citing papers explorer

Showing 13 of 13 citing papers after filters.

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents cs.CR · 2024-06-19 · unverdicted · none · ref 32 · internal anchor
AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments cs.AI · 2024-04-11 · accept · none · ref 32 · internal anchor
OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering cs.CL · 2024-10-09 · unverdicted · none · ref 22 · internal anchor
MLE-bench evaluates frontier language models as ML engineering agents on 75 Kaggle competitions, with the top setup (o1-preview + AIDE) reaching bronze medal level in 16.9% of tasks.
Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark cs.AI · 2024-10-06 · unverdicted · none · ref 28 · internal anchor
PolyMATH is a new 5,000-image benchmark where top MLLMs reach at most 41 percent accuracy on multi-modal mathematical reasoning, with ablation showing minimal gain from text over images.
$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains cs.AI · 2024-06-17 · unverdicted · none · ref 14 · internal anchor
τ-bench shows state-of-the-art agents like GPT-4o succeed on under 50% of tool-using, rule-following tasks and are inconsistent across repeated trials.
WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? cs.LG · 2024-03-12 · unverdicted · none · ref 14 · internal anchor
WorkArena benchmark shows LLM web agents achieve partial success on enterprise tasks but have a substantial gap to full automation and perform worse with open-source models.
Training Language Models to Self-Correct via Reinforcement Learning cs.LG · 2024-09-19 · unverdicted · none · ref 136 · internal anchor
SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.
WebCanvas: Benchmarking Web Agents in Online Environments cs.CL · 2024-06-18 · unverdicted · none · ref 17 · internal anchor
WebCanvas creates a dynamic benchmark for web agents with a noise-resistant evaluation metric, the Mind2Web-Live dataset of 542 tasks, and open-source tools and agent framework for ongoing online testing.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code cs.SE · 2024-03-12 · unverdicted · none · ref 148 · internal anchor
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
Understanding the planning of LLM agents: A survey cs.AI · 2024-02-05 · accept · none · ref 27 · internal anchor
A survey that provides a taxonomy of methods for improving planning in LLM-based agents across task decomposition, plan selection, external modules, reflection, and memory.
Agent AI: Surveying the Horizons of Multimodal Interaction cs.AI · 2024-01-07 · unverdicted · none · ref 244 · internal anchor
The paper defines Agent AI as interactive multimodal systems that perceive grounded data and generate embodied actions, arguing this approach can mitigate hallucinations in foundation models.
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods cs.CL · 2024-12-07 · accept · none · ref 149 · internal anchor
A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.
The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling: A Survey cs.AI · 2024-04-17 · unverdicted · none · ref 18 · internal anchor
A survey of emerging AI agent architectures that organizes single and multi-agent designs around reasoning, planning, tool use, communication, and reflection phases.

AgentBench: Evaluating LLMs as Agents

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer