LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries

Dinghan Shen; Jianbing Han; Kaiqiang Song; Mian Zhang; Ming Yin; Sathish Reddy Indurthi; Shujian Liu; Silei Xu; Simin Ma; Sixun Dong

arxiv: 2508.15760 · v2 · pith:IS7AL2KEnew · submitted 2025-08-21 · 💻 cs.CL · cs.AI

LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries

Ming Yin , Dinghan Shen , Silei Xu , Sixun Dong , Mian Zhang , Yebowen Hu , Shujian Liu , Jianbing Han

show 6 more authors

Simin Ma Song Wang Sathish Reddy Indurthi Xun Wang Yiran Chen Kaiqiang Song

This is my paper

classification 💻 cs.CL cs.AI

keywords toolagentlivemcp-101real-worldtoolsagentscallingmulti-step

0 comments

read the original abstract

Tool calling has emerged as a critical capability for AI agents. In contrast to conventional tool calling frameworks that rely on static, provider-specific tool definitions, the Model Context Protocol (MCP) offers a unified interface to discover and invoke tools dynamically. However, there is a significant gap in benchmarking multi-step tasks using diverse MCP tools in realistic, dynamic scenarios. In this work, we present LiveMCP-101, a benchmark of 101 real-world queries that require coordinated use of multiple MCP tools. To address temporal variability in real-world tool responses, we introduce a parallel evaluation framework where a reference agent executes a validated plan simultaneously to produce real-time reference outputs. Experiments show that even frontier LLMs achieve a success rate below 60\%, highlighting challenges in multi-step tool use. Comprehensive error analysis identifies seven failure modes spanning tool planning, parameterization, and output handling, pointing to concrete directions for improving current models. LiveMCP-101 sets a rigorous standard for evaluating real-world agent capabilities, advancing toward autonomous agent systems that reliably execute complex tasks through MCP tool orchestration.

This paper has not been read by Pith yet.

LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries

discussion (0)