super hub Canonical reference

AgentBench: Evaluating LLMs as Agents

Hanchen Zhang, Hanyu Lai, Hao Yu, Xiao Liu, Xuanyu Lei, Yifan Xu · 2023 · cs.AI · arXiv 2308.03688

Canonical reference. 86% of citing Pith papers cite this work as background.

158 Pith papers citing it

Background 86% of classified citations

open full Pith review browse 158 citing papers more from Hanchen Zhang arXiv PDF

abstract

The potential of Large Language Model (LLM) as agents has been widely acknowledged recently. Thus, there is an urgent need to quantitatively \textit{evaluate LLMs as agents} on challenging tasks in interactive environments. We present AgentBench, a multi-dimensional benchmark that consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities. Our extensive test over \num API-based and open-sourced (OSS) LLMs shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in performance between them and many OSS competitors that are no larger than 70B. We identify the typical reasons of failures in environments and LLMs, showing that poor long-term reasoning, decision-making, and instruction following abilities are the main obstacles for developing usable LLM agents. Improving instruction following and training on high quality multi-round alignment data could improve agent performance. And different from existing assumptions, training on code present ambivalent impacts on different agent tasks. Datasets, environments, and an integrated evaluation package for AgentBench are released at https://github.com/THUDM/AgentBench.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 38 dataset 5 baseline 1

citation-polarity summary

background 38 use dataset 4 baseline 1 unclear 1

claims ledger

abstract The potential of Large Language Model (LLM) as agents has been widely acknowledged recently. Thus, there is an urgent need to quantitatively \textit{evaluate LLMs as agents} on challenging tasks in interactive environments. We present AgentBench, a multi-dimensional benchmark that consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities. Our extensive test over \num API-based and open-sourced (OSS) LLMs shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in perfo

authors

Hanchen Zhang Hanyu Lai Hao Yu Xiao Liu Xuanyu Lei Yifan Xu

co-cited works

representative citing papers

EnergyAgentBench: Benchmarking LLM Agents on Live Energy Infrastructure Data

econ.EM · 2026-05-13 · accept · novelty 8.0

EnergyAgentBench is a new benchmark with 70 task variants that evaluates LLM agents on live energy data for datacenter siting, long-horizon optimization, and causal grid diagnosis.

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

cs.CL · 2026-05-11 · unverdicted · novelty 8.0

A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.

SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

cs.AI · 2026-05-10 · accept · novelty 8.0 · 2 refs

SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.

PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments

cs.AI · 2026-05-04 · conditional · novelty 8.0

PhysicianBench is a new benchmark of 100 physician-reviewed, execution-grounded tasks in live EHR environments where the best LLM agent reaches only 46% success and open-source models reach 19%.

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

cs.CR · 2024-06-19 · unverdicted · novelty 8.0

AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

cs.AI · 2024-04-11 · accept · novelty 8.0

OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

cs.CL · 2023-08-28 · unverdicted · novelty 8.0

LongBench is the first bilingual multi-task benchmark for long context understanding in LLMs, containing 21 datasets in 6 categories with average lengths of 6711 words (English) and 13386 characters (Chinese).

CLQT: A Closed-Loop, Cost-Aware, Strategy-Consistent Benchmark for Diagnostic Evaluation of LLM Portfolio-Management Agents

cs.AI · 2026-06-29 · unverdicted · novelty 7.0

CLQT is a new closed-loop, cost-aware benchmark that diagnoses LLM trading agent capabilities through strategy-consistent metrics and hash-verifiable trails rather than outcome rankings.

Agentic Abstention: Do Agents Know When to Stop Instead of Act?

cs.AI · 2026-06-27 · unverdicted · novelty 7.0

LLM agents often fail to abstain at the right time in uncertain multi-turn tasks, and the CONVOLVE context engineering method raises timely abstention rates on WebShop from 26.7 to 57.4 without parameter updates.

OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optimization Agents

cs.AI · 2026-05-27 · unverdicted · novelty 7.0

OR-Space is a benchmark for LLM agents performing full-lifecycle optimization tasks across Build, Revise, and Explain modes in executable multi-artifact workspaces.

JobBench: Aligning Agent Work With Human Will

cs.AI · 2026-05-25 · unverdicted · novelty 7.0

JobBench is a new benchmark with 130 occupational tasks where the best of 36 tested AI models achieves only 45.9% success.

Push Your Agent: Measuring and Enforcing Quantitative Goal Persistence in Long-Horizon LLM Agents

cs.LG · 2026-05-22 · unverdicted · novelty 7.0

Introduces QGP and PushBench to evaluate LLM agent persistence on quantitative goals, showing specialized controllers outperform baselines on verifier-checked artifact collection tasks.

DART: Semantic Recoverability for Structured Tool Agents

cs.AI · 2026-05-22 · unverdicted · novelty 7.0

DART is a modular runtime that certifies semantically recoverable boundaries for failed tool-agent instances and selects admissible restore points that preserve downstream commitments or blocks recovery.

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

cs.CL · 2026-05-21 · unverdicted · novelty 7.0 · 2 refs

Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.

PrefBench: Evaluating Zero-Shot LLM Agents in Hidden-Preference Personalized Pricing Negotiations

cs.GT · 2026-05-19 · accept · novelty 7.0

PrefBench benchmark shows zero-shot LLMs achieve deal rates above 0.99 but seller profits only slightly above random and far below a simple concession heuristic across 7,500 episodes.

SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science

cs.AI · 2026-05-18 · unverdicted · novelty 7.0

SCICONVBENCH is a new benchmark evaluating LLMs on multi-turn disambiguation and inconsistency resolution for task formulation in computational science, with frontier models reaching only 52.7% success on fluid mechanics disambiguation cases.

BioXArena: Benchmarking LLM Agents on Multi-Modal Biomedical Machine Learning Tasks

cs.CE · 2026-05-15 · unverdicted · novelty 7.0

BioXArena benchmarks LLM agents on generating end-to-end ML pipelines for 76 multi-modal biomedical tasks, with MLEvolve plus Gemini-3.1-Pro scoring highest at 0.666.

$\pi$-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows

cs.AI · 2026-05-14 · unverdicted · novelty 7.0 · 2 refs

π-Bench is a new benchmark for evaluating proactive personal assistant agents on 100 multi-turn tasks that include hidden intents, inter-task dependencies, and cross-session continuity.

ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents

cs.AI · 2026-05-13 · unverdicted · novelty 7.0 · 2 refs

ClawForge is a generator framework that creates reproducible executable benchmarks for command-line agents under state conflict, with ClawForge-Bench showing frontier models reach at most 45.3% strict accuracy and that state inspection drives most performance gaps.

RS-Claw: Progressive Active Tool Exploration via Hierarchical Skill Trees for Remote Sensing Agents

cs.AI · 2026-05-13 · unverdicted · novelty 7.0

RS-Claw enables remote sensing agents to actively explore tools via hierarchical skill trees, achieving up to 86% token compression and outperforming flat registration and RAG baselines on Earth-Bench.

Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

cs.AI · 2026-05-12 · conditional · novelty 7.0

BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.

RecoAtlas: From Semantic Plausibility to Set-Level Utility in LLM Recommendation Agents

cs.IR · 2026-05-11 · unverdicted · novelty 7.0

RecoAtlas is a benchmark that evaluates LLM recommendation agents on behavior-grounded metrics for relevance, complementarity, and diversity in addition to semantic coherence.

Skill Drift Is Contract Violation: Proactive Maintenance for LLM Agent Skill Libraries

cs.SE · 2026-05-09 · conditional · novelty 7.0

SkillGuard extracts executable environment contracts from LLM skill documents to detect only relevant drifts, reporting zero false positives on 599 cases, 100% precision in known-drift tests, and raising one-round repair success from 10% to 78%.

Beyond the All-in-One Agent: Benchmarking Role-Specialized Multi-Agent Collaboration in Enterprise Workflows

cs.MA · 2026-05-09 · unverdicted · novelty 7.0

EntCollabBench shows that today's LLM agents still struggle with delegation, context transfer, parameter grounding, workflow closure, and decision commitment when tested in a simulated enterprise with 11 role-specialized agents.

citing papers explorer

Showing 50 of 158 citing papers.

From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems cs.MA · 2025-06-05 · accept · none · ref 104 · internal anchor
A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.
Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games cs.AI · 2025-06-04 · unverdicted · none · ref 5 · internal anchor
Orak is a foundational benchmark providing training data, interfaces, and evaluation tools for LLM agents across diverse video game genres.
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering cs.CL · 2024-10-09 · unverdicted · none · ref 22 · internal anchor
MLE-bench evaluates frontier language models as ML engineering agents on 75 Kaggle competitions, with the top setup (o1-preview + AIDE) reaching bronze medal level in 16.9% of tasks.
Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark cs.AI · 2024-10-06 · unverdicted · none · ref 28 · internal anchor
PolyMATH is a new 5,000-image benchmark where top MLLMs reach at most 41 percent accuracy on multi-modal mathematical reasoning, with ablation showing minimal gain from text over images.
$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains cs.AI · 2024-06-17 · unverdicted · none · ref 14 · internal anchor
τ-bench shows state-of-the-art agents like GPT-4o succeed on under 50% of tool-using, rule-following tasks and are inconsistent across repeated trials.
WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? cs.LG · 2024-03-12 · unverdicted · none · ref 14 · internal anchor
WorkArena benchmark shows LLM web agents achieve partial success on enterprise tasks but have a substantial gap to full automation and perform worse with open-source models.
GAIA: a benchmark for General AI Assistants cs.CL · 2023-11-21 · unverdicted · none · ref 55 · 2 links · internal anchor
GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
OSWorld2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks cs.AI · 2026-06-28 · unverdicted · none · ref 54 · internal anchor
OSWorld 2.0 is a benchmark of 108 realistic long-horizon computer-use tasks where current agents achieve only 20.6% binary completion, struggling with state inference and constraint tracking.
SciVisAgentSkills: Design and Evaluation of Agent Skills for Scientific Data Analysis and Visualization cs.AI · 2026-06-04 · unverdicted · none · ref 30 · internal anchor
SciVisAgentSkills provides reusable agent skills that raise mean task scores on a 108-task SciVis benchmark when paired with Codex and Claude Code agents.
unix-ctf: Procedural Environments for Unix-Competence Reinforcement Learning cs.CR · 2026-05-27 · unverdicted · none · ref 6 · internal anchor
unix-ctf procedurally generates 656 Unix CTF tasks across 155 techniques; fine-tuning Qwen3-8B on them raises solve rate from 11.6% to 43.6% on a 15-skill holdout and yields +33 pp in Forensics on InterCode-CTF.
Got a Secret? LLM Agents Can't Keep It: Evaluating Privacy in Multi-Agent Systems cs.AI · 2026-05-26 · unverdicted · none · ref 18 · internal anchor
Multi-agent social simulations show LLM privacy violations rising from 19.95% to 45.30%, with leakage spreading contagiously (8x after peer disclosure) and explicit instructions leaving rates above 37.8%.
ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence cs.AI · 2026-05-25 · unverdicted · none · ref 36 · internal anchor
ScientistOne introduces Chain-of-Evidence and an audit system that achieves zero hallucinated references, perfect score verification, and top method-code alignment while matching or beating human experts on five frontier tasks and generalizing to six more.
When Planning Fails Despite Correct Execution: On Epistemic Calibration for LLM-Based Multi-Agent Systems cs.AI · 2026-05-22 · unverdicted · none · ref 82 · internal anchor
Introduces EPC-AW to mitigate epistemic miscalibration in LLM multi-agent planning via consistency-based selection and refinement, reporting 9.75% average success improvement.
Design and Report Benchmarks for Knowledge Work cs.AI · 2026-05-22 · unverdicted · none · ref 64 · internal anchor
Proposes a three-step benchmark design method (define work activity, specify tested setting, score work product) derived from work studies and O*NET, demonstrated via three case analyses.
SynAE: A Framework for Measuring the Quality of Synthetic Data for Tool-Calling Agent Evaluations cs.CL · 2026-05-21 · unverdicted · none · ref 16 · internal anchor
SynAE is a multi-metric framework that evaluates how well synthetic benchmarks replicate real data characteristics for multi-turn tool-calling agent testing.
The Scaling Laws of Skills in LLM Agent Systems cs.CL · 2026-05-15 · unverdicted · none · ref 17 · internal anchor
Empirical analysis across 15 LLMs and 1,141 skills identifies a logarithmic routing decay law and a multiplicative execution law coupled by a single fitted slope parameter b that enables targeted library optimizations improving routing accuracy and downstream task pass rates.
Exploiting LLM Agent Supply Chains via Payload-less Skills cs.CR · 2026-05-14 · conditional · none · ref 17 · internal anchor
Semantic Compliance Hijacking lets attackers hijack LLM agents by disguising malicious instructions as compliance rules in skills, reaching up to 77.67% success on confidentiality breaches and 67.33% on RCE while evading all tested scanners.
Domain Restriction via Multi SAE Layer Transitions cs.AI · 2026-05-12 · unverdicted · none · ref 18 · internal anchor
Multi-layer SAE transitions capture domain-specific signatures that distinguish OOD texts in Gemma-2 models.
On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment cs.AI · 2026-05-12 · unverdicted · none · ref 23 · internal anchor
FATE lets LLM agents self-evolve safer behaviors by generating and filtering repairs from their own failure trajectories using verifiers and Pareto optimization.
HAGE: Harnessing Agentic Memory via RL-Driven Weighted Graph Evolution cs.AI · 2026-05-11 · unverdicted · none · ref 58 · internal anchor
HAGE proposes a trainable weighted graph memory framework with LLM intent classification, dynamic edge modulation, and RL optimization that improves long-horizon reasoning accuracy in agentic LLMs over static baselines.
TIDE-Bench: Task-Aware and Diagnostic Evaluation of Tool-Integrated Reasoning cs.AI · 2026-05-10 · unverdicted · none · ref 20 · internal anchor
TIDE-Bench is a new benchmark for tool-integrated reasoning that combines diverse tasks, multi-aspect metrics covering answer quality, process reliability, efficiency and cost, plus filtered challenging test sets.
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces cs.AI · 2026-05-09 · unverdicted · none · ref 136 · internal anchor
OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
AgentCollabBench: Diagnosing When Good Agents Make Bad Collaborators cs.CL · 2026-05-09 · unverdicted · none · ref 26 · internal anchor
AgentCollabBench shows that multi-agent reliability is limited by communication topology, with converging-DAG nodes causing synthesis bottlenecks that discard constraints and explain 7-40% of information loss variance.
EvidenT: An Evidence-Preserving Framework for Iterative System-Level Package Repair cs.SE · 2026-05-09 · unverdicted · none · ref 26 · internal anchor
EvidenT repairs 53.88% of real-world RISC-V system-level package build failures by preserving repair history and build artifacts in a closed-loop validation system, outperforming baselines by a wide margin.
Why Does Agentic Safety Fail to Generalize Across Tasks? cs.LG · 2026-05-07 · conditional · none · ref 67 · internal anchor
Agentic safety fails to generalize across tasks because the task-to-safe-controller mapping has a higher Lipschitz constant than the task-to-controller mapping alone, as proven in linear-quadratic control and demonstrated in quadcopter and LLM experiments.
Beyond Task Success: Measuring Workflow Fidelity in LLM-Based Agentic Payment Systems cs.AI · 2026-05-07 · unverdicted · none · ref 10 · internal anchor
ASR, a new trajectory-fidelity metric, detects that 10 of 18 LLMs skip confirmation steps in payment agents despite perfect scores on prior metrics, and ASR-guided refinements improve task success by up to 93.8 percentage points.
Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges cs.AI · 2026-05-07 · unverdicted · none · ref 28 · internal anchor
LLM safety judges flip verdicts on equivalent policy rewrites up to 9.1% of the time and cannot distinguish meaningful from meaningless changes, requiring new invariance-based reliability metrics.
NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles cs.AI · 2026-05-03 · unverdicted · none · ref 17 · 2 links · internal anchor
NeuroState-Bench supplies human-calibrated tasks and probes that measure commitment integrity in LLM agents and shows this measure diverges from ordinary task success.
Evaluating Agentic AI in the Wild: Failure Modes, Drift Patterns, and a Production Evaluation Framework cs.AI · 2026-05-02 · unverdicted · none · ref 8 · internal anchor
The paper presents a taxonomy of seven production-specific failure modes for agentic AI, demonstrates that existing metrics fail to detect four of them entirely, and proposes the PAEF five-dimension framework for continuous production evaluation with an open-source implementation.
AgentFloor: How Far Up the tool use Ladder Can Small Open-Weight Models Go? cs.AI · 2026-05-01 · unverdicted · none · ref 25 · internal anchor
Small open-weight models match GPT-5 on routine agent tool-use tasks but lag on long-horizon planning, supporting tiered routing to reduce costs in agentic systems.
Trace-Level Analysis of Information Contamination in Multi-Agent Systems cs.AI · 2026-04-30 · unverdicted · none · ref 20 · internal anchor
Agent workflows can diverge substantially from contaminated inputs yet recover correct answers, or stay similar while failing, as measured by trace divergence on GAIA tasks.
LATTICE: Evaluating Decision Support Utility of Crypto Agents cs.CR · 2026-04-29 · unverdicted · none · ref 18 · internal anchor
LATTICE is a scalable LLM-judge benchmark for crypto agent decision support that reveals performance trade-offs among real-world copilots across dimensions and tasks.
CHORUS: An Agentic Framework for Generating Realistic Deliberation Data cs.AI · 2026-04-22 · unverdicted · none · ref 9 · internal anchor
Chorus generates realistic deliberation discussions via LLM agents with memory and Poisson-timed participation, validated by 30 experts on realism, coherence, and utility.
An AI Agent Execution Environment to Safeguard User Data cs.CR · 2026-04-21 · unverdicted · none · ref 36 · internal anchor
GAAP guarantees confidentiality of private user data for AI agents by enforcing user-specified permissions deterministically through persistent information flow tracking, without trusting the agent or requiring attack-free models.
How Adversarial Environments Mislead Agentic AI? cs.AI · 2026-04-20 · unverdicted · none · ref 7 · internal anchor
Adversarial compromise of tool outputs misleads agentic AI via breadth and depth attacks, revealing that epistemic and navigational robustness are distinct and often trade off against each other.
QRAFTI: An Agentic Framework for Empirical Research in Quantitative Finance cs.MA · 2026-04-20 · unverdicted · none · ref 25 · 2 links · internal anchor
QRAFTI is a multi-agent framework using tool-calling and reflection-based planning to emulate quant research tasks like factor replication and signal testing on financial data.
AIT Academy: Cultivating the Complete Agent with a Confucian Three-Domain Curriculum cs.AI · 2026-04-20 · unverdicted · none · ref 1 · internal anchor
AIT Academy introduces a tripartite curriculum for AI agents across natural science, humanities, and social science domains, with reported gains of 15.9 points in security and 7 points in social reasoning under specific scheduling.
SLALOM: Simulation Lifecycle Analysis via Longitudinal Observation Metrics for Social Simulation cs.MA · 2026-04-13 · unverdicted · none · ref 16 · internal anchor
SLALOM uses phase constraints called gates and dynamic time warping to quantitatively measure whether simulated social trajectories follow empirically plausible paths instead of just reaching correct end states.
ForkKV: Scaling Multi-LoRA Agent Serving via Copy-on-Write Disaggregated KV Cache cs.DC · 2026-04-07 · unverdicted · none · ref 35 · internal anchor
ForkKV uses copy-on-write disaggregated KV cache with DualRadixTree and ResidualAttention kernels to deliver up to 3x throughput over prior multi-LoRA serving systems with negligible quality loss.
Do Agent Societies Develop Intellectual Elites? The Hidden Power Laws of Collective Cognition in LLM Multi-Agent Systems cs.MA · 2026-04-03 · unverdicted · none · ref 35 · internal anchor
LLM agent societies develop power-law coordination cascades and intellectual elites through an integration bottleneck that grows with system size.
Beyond Benchmark Islands: Toward Representative Trustworthiness Evaluation for Agentic AI cs.CL · 2026-03-16 · unverdicted · none · ref 12 · internal anchor
Defines agentic trustworthiness via five properties and proposes HAAF, a scenario-distribution framework with a Trustworthy Optimization Factory that transfers interventions across 13 models from seven families on a 100-scenario suite.
Characterizing Faults in Agentic AI: A Taxonomy of Types, Symptoms, and Root Causes cs.SE · 2026-03-06 · unverdicted · none · ref 24 · internal anchor
An empirical study of real-world issues yields a taxonomy of 34 fault types, symptoms, and root causes in agentic AI systems, validated by 145 practitioners.
SoK: Agentic Skills -- Beyond Tool Use in LLM Agents cs.CR · 2026-02-24 · unverdicted · none · ref 75 · internal anchor
The paper systematizes agentic skills beyond tool use, providing design pattern and representation-scope taxonomies plus security analysis of malicious skill infiltration in agent marketplaces.
NeuroSymActive: Differentiable Neural-Symbolic Reasoning with Active Exploration for Knowledge Graph Question Answering cs.CL · 2026-02-17 · unverdicted · none · ref 72 · internal anchor
NeuroSymActive combines soft-unification symbolic modules, a neural path evaluator, and Monte-Carlo-style active exploration to reach strong answer accuracy on KGQA benchmarks while cutting graph lookups and model calls versus standard retrieval baselines.
Learning to Configure Agentic AI Systems cs.AI · 2026-02-12 · unverdicted · none · ref 10 · 2 links · internal anchor
ARC learns per-query agent configurations via a lightweight hierarchical SMDP policy, delivering 31.3% higher reasoning accuracy, 13.95% higher tool-use accuracy, and doubled success on an agent benchmark compared to budget-matched baselines.
AgentXRay: White-Boxing Agentic Systems via Workflow Reconstruction cs.AI · 2026-02-05 · unverdicted · none · ref 11 · internal anchor
AgentXRay formulates workflow reconstruction as combinatorial optimization and uses Monte Carlo Tree Search with Red-Black Pruning to approximate black-box agent behaviors via output-based proxy metrics.
MirrorBench: A Benchmark to Evaluate Conversational User-Proxy Agents for Human-Likeness cs.AI · 2026-01-13 · unverdicted · none · ref 22 · internal anchor
MirrorBench defines a reproducible benchmark combining lexical metrics (MATTR, Yule's K, HD-D) and LLM-judge metrics with calibration controls to measure human-likeness of user-proxy agents across four datasets.
CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents in Cryptocurrency cs.CL · 2025-11-29 · unverdicted · none · ref 6 · internal anchor
CryptoBench is a new dynamic benchmark for LLM agents in cryptocurrency that reveals a retrieval-prediction imbalance in model performance.
Exploring the Secondary Risks of Large Language Models cs.LG · 2025-06-14 · unverdicted · none · ref 31 · internal anchor
Introduces secondary risks as a new class of LLM failures from benign prompts, defines two primitives, proposes SecLens search framework, and releases SecRiskBench showing risks are widespread across 16 models.
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents cs.CL · 2025-06-13 · conditional · none · ref 14 · internal anchor
DeepResearch Bench supplies 100 expert-crafted PhD-level tasks and two human-aligned evaluation frameworks to measure deep research agents on report quality and citation accuracy.

AgentBench: Evaluating LLMs as Agents

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer