EnergyAgentBench is a new benchmark with 70 task variants that evaluates LLM agents on live energy data for datacenter siting, long-horizon optimization, and causal grid diagnosis.
super hub Canonical reference
AgentBench: Evaluating LLMs as Agents
Canonical reference. 86% of citing Pith papers cite this work as background.
abstract
The potential of Large Language Model (LLM) as agents has been widely acknowledged recently. Thus, there is an urgent need to quantitatively \textit{evaluate LLMs as agents} on challenging tasks in interactive environments. We present AgentBench, a multi-dimensional benchmark that consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities. Our extensive test over \num API-based and open-sourced (OSS) LLMs shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in performance between them and many OSS competitors that are no larger than 70B. We identify the typical reasons of failures in environments and LLMs, showing that poor long-term reasoning, decision-making, and instruction following abilities are the main obstacles for developing usable LLM agents. Improving instruction following and training on high quality multi-round alignment data could improve agent performance. And different from existing assumptions, training on code present ambivalent impacts on different agent tasks. Datasets, environments, and an integrated evaluation package for AgentBench are released at https://github.com/THUDM/AgentBench.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract The potential of Large Language Model (LLM) as agents has been widely acknowledged recently. Thus, there is an urgent need to quantitatively \textit{evaluate LLMs as agents} on challenging tasks in interactive environments. We present AgentBench, a multi-dimensional benchmark that consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities. Our extensive test over \num API-based and open-sourced (OSS) LLMs shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in perfo
authors
co-cited works
representative citing papers
A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.
SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.
PhysicianBench is a new benchmark of 100 physician-reviewed, execution-grounded tasks in live EHR environments where the best LLM agent reaches only 46% success and open-source models reach 19%.
AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.
OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
LongBench is the first bilingual multi-task benchmark for long context understanding in LLMs, containing 21 datasets in 6 categories with average lengths of 6711 words (English) and 13386 characters (Chinese).
PrincipalBench exposes a sharp split in frontier LLMs between selective and over-refusing behavior on multi-party loyalty, with prompt scaffolding and KL distillation reducing harm rates but only along an existing leak/over-refusal trade-off.
SpreadsheetBench 2 provides 321 expert-validated tasks from authentic business data showing frontier LLMs reach only 34.89% overall accuracy on end-to-end spreadsheet workflows.
CLQT is a new closed-loop, cost-aware benchmark that diagnoses LLM trading agent capabilities through strategy-consistent metrics and hash-verifiable trails rather than outcome rankings.
LLM agents often fail to abstain at the right time in uncertain multi-turn tasks, and the CONVOLVE context engineering method raises timely abstention rates on WebShop from 26.7 to 57.4 without parameter updates.
OR-Space is a benchmark for LLM agents performing full-lifecycle optimization tasks across Build, Revise, and Explain modes in executable multi-artifact workspaces.
JobBench is a new benchmark with 130 occupational tasks where the best of 36 tested AI models achieves only 45.9% success.
ScaleWoB generates 100+ synthetic interactive GUI environments and 1000+ verifiable tasks as web pages, releasing a 120-task mobile benchmark where state-of-the-art agents achieve 27.92% success (17.82% on long-horizon tasks) versus 92.08% for humans, with synthetic results generalizing to real apps
Introduces QGP and PushBench to evaluate LLM agent persistence on quantitative goals, showing specialized controllers outperform baselines on verifier-checked artifact collection tasks.
Introduces MCP-TDP benchmark showing near-100% attack success on models like GPT-4o for tool description poisoning and proposes reactive self-correction defense.
DART is a modular runtime that certifies semantically recoverable boundaries for failed tool-agent instances and selects admissible restore points that preserve downstream commitments or blocks recovery.
Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.
PrefBench benchmark shows zero-shot LLMs achieve deal rates above 0.99 but seller profits only slightly above random and far below a simple concession heuristic across 7,500 episodes.
SCICONVBENCH is a new benchmark evaluating LLMs on multi-turn disambiguation and inconsistency resolution for task formulation in computational science, with frontier models reaching only 52.7% success on fluid mechanics disambiguation cases.
BioXArena benchmarks LLM agents on generating end-to-end ML pipelines for 76 multi-modal biomedical tasks, with MLEvolve plus Gemini-3.1-Pro scoring highest at 0.666.
π-Bench is a new benchmark for evaluating proactive personal assistant agents on 100 multi-turn tasks that include hidden intents, inter-task dependencies, and cross-session continuity.
ClawForge is a generator framework that creates reproducible executable benchmarks for command-line agents under state conflict, with ClawForge-Bench showing frontier models reach at most 45.3% strict accuracy and that state inspection drives most performance gaps.
RS-Claw enables remote sensing agents to actively explore tools via hierarchical skill trees, achieving up to 86% token compression and outperforming flat registration and RAG baselines on Earth-Bench.
citing papers explorer
-
SpreadsheetBench 2: Evaluating Agents on End-to-End Business Spreadsheet Workflows
SpreadsheetBench 2 provides 321 expert-validated tasks from authentic business data showing frontier LLMs reach only 34.89% overall accuracy on end-to-end spreadsheet workflows.
-
Skill Drift Is Contract Violation: Proactive Maintenance for LLM Agent Skill Libraries
SkillGuard extracts executable environment contracts from LLM skill documents to detect only relevant drifts, reporting zero false positives on 599 cases, 100% precision in known-drift tests, and raising one-round repair success from 10% to 78%.
-
An Empirical Study of Testing Practices in Open Source AI Agent Frameworks and Agentic Applications
Empirical study of open-source AI agents shows testing effort concentrates on deterministic tools and workflows (over 70%) while the FM-based plan body gets under 5% and prompts appear in only 1% of tests.
-
EvidenT: An Evidence-Preserving Framework for Iterative System-Level Package Repair
EvidenT repairs 53.88% of real-world RISC-V system-level package build failures by preserving repair history and build artifacts in a closed-loop validation system, outperforming baselines by a wide margin.
-
Characterizing Faults in Agentic AI: A Taxonomy of Types, Symptoms, and Root Causes
An empirical study of real-world issues yields a taxonomy of 34 fault types, symptoms, and root causes in agentic AI systems, validated by 145 practitioners.
-
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
-
RESOURCE2SKILL: Distilling Executable Agent Skills from Human-Created Multimodal Resources
RESOURCE2SKILL converts multimodal human resources into a hierarchical Skill Wiki of executable agent skills, reporting +11.9 percentage point average gains over no-skill baselines across seven authoring domains.
-
Testing Agentic Workflows with Structural Coverage Criteria
Introduces coordination-graph structural coverage for multi-agent workflows with DSPy-generated tests, evaluated on ten SDK benchmarks covering 49 agents and 403 obligations.
-
Runtime-Structured Task Decomposition for Agentic Coding Systems
Runtime-structured task decomposition reduces retry costs in agentic coding systems by up to 51.7% versus monolithic prompts by rerunning only failed subtasks on two software engineering workloads.
-
Spec Kit Agents: Context-Grounded Agentic Workflows
A multi-agent SDD framework with phase-level context-grounding hooks improves LLM-judged quality by 0.15 points and SWE-bench Lite Pass@1 by 1.7 percent while preserving near-perfect test compatibility.
-
What makes a harness a harness: necessary and sufficient conditions for an agent harness
Proposes and tests a constitutive definition of 'agent harness' via conceptual analysis of literature and six real systems.
-
LLM-Based Multi-Agent Systems for Code Generation: A Multi-Vocal Literature Review
A review of 114 studies classifies motivations into nine categories, analyzes common models and benchmarks, synthesizes challenges into six categories with 26 subcategories and solutions, and identifies six future research directions with 18 subcategories.
- Rethinking Code Review in the Age of AI: A Vision for Agentic Code Review
- AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation