pith. sign in

arxiv: 2605.10787 · v2 · pith:EEKNOJNHnew · submitted 2026-05-11 · 💻 cs.AI · cs.SE

ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox

Pith reviewed 2026-05-21 08:04 UTC · model grok-4.3

classification 💻 cs.AI cs.SE
keywords LLM agentstool usebenchmarkinterdependent toolsModel Context Protocolagent bottleneckssuccess rate evaluationcommercial automation
0
0 comments X

The pith

Top LLM agents fail to exceed 60% success in dynamic interdependent tool tasks, trailing humans at 90%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops ComplexMCP, a benchmark for assessing LLM agents in environments with interdependent tools that change state and can fail unexpectedly, modeled after commercial software. Evaluations of various models show they top out below 60 percent task success, compared to 90 percent for people. By examining the step-by-step actions of the agents, the work isolates three recurring problems that hold back performance. This establishes that existing agents are not equipped for the complexities of real automation and provides a standard for measuring future improvements.

Core claim

ComplexMCP uses the Model Context Protocol to create seven stateful sandboxes containing over 300 carefully tested tools, employing a seed-driven architecture to generate dynamic environment states and unpredictable API failures. When LLMs are tested in both full context and RAG modes, the highest success rates stay under 60 percent while humans reach 90 percent. Granular analysis of execution paths highlights tool retrieval saturation as action spaces grow, over-confidence that causes agents to omit necessary environment checks, and strategic defeatism where agents explain away failures instead of attempting fixes.

What carries the argument

ComplexMCP benchmark built via seed-driven architecture on the Model Context Protocol to simulate interdependent, stateful tools with failures.

If this is right

  • LLM agents require enhanced mechanisms for handling large and scaling tool sets to avoid retrieval saturation.
  • Agents need to be designed or prompted to perform environment verifications to reduce errors from over-confidence.
  • Strategies must be developed to encourage agents to persist and recover from partial failures instead of rationalizing defeat.
  • The benchmark positions itself as a necessary test for building more reliable autonomous systems in commercial settings.
  • Performance shortfalls appear consistent whether using full context or retrieval augmented generation approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • These failure modes point toward the value of adding structured planning or verification modules outside the core language model.
  • Similar issues likely arise in other agent applications involving real-time state changes and tool dependencies.
  • Future benchmarks could incorporate multi-turn recovery challenges to specifically target the defeatism problem.
  • Insights from this could lead to training methods that reward long-term task completion over short-term action sequences.

Load-bearing premise

The 7 stateful sandboxes with over 300 tools built on the Model Context Protocol represent a realistic model of dynamic interdependent real-world commercial software automation including unpredictable failures.

What would settle it

An experiment where agents are augmented with forced verification steps and recovery planning and then tested to see if their success rate on ComplexMCP rises substantially above 60 percent.

Figures

Figures reproduced from arXiv: 2605.10787 by Hongyang Chen, Longyue Wang, Weihua Luo, Xue Yang, Yuanyang Li.

Figure 1
Figure 1. Figure 1: The Overview of ComplexMCP: Our framework integrates stateful sandboxes and stateless MCP servers via a seed-driven mechanism. trade history, and more. Any action at yielding side ef￾fects—such as dispatching a message or modifying a shop￾ping cart—triggers a deterministic state transition St+1 = f(St, at). Our framework comprises seven integrated applications: LightOS, LightTalk, LightShop, LightWeather, … view at source ↗
Figure 2
Figure 2. Figure 2: A partial visualization of the inter-tool dependency net￾work within the LightTalk application (showing only a selected subset of tools for clarity). Arrows denote prerequisite relation￾ships and data flows. For example, a successful ”send message” operation in complex scenarios may necessitate a multi-step exe￾cution trajectory (highlighted in green): initiating network accel￾eration, resolving the target… view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of task complexity within the instruction set. (Top) Number of unique tools required per instruction; (Bottom) Total frequency of tool invocations within the ground-truth trajec￾tories. 4. Experiments We conduct a comprehensive evaluation of representative state-of-the-art commercial large language models (LLMs) to assess their performance on the ComplexMCP bench￾mark. The evaluated models inc… view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of token volume and estimated costs for Gemini-3-Flash under the ”full-context” ReAct strategy. Scaling Down the Action Space: Does API Retriever Help? To mitigate action-space explosion and thus reduce prompt overhead, prior works like ToolLLM (Qin et al., 2023) and RAG-MCP (Gan & Sun, 2025) employ kNN￾based retrieval (Peterson, 2009) to fetch semantically rele￾vant APIs. However, we investig… view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of challenge patterns identified through trajectory analysis. Tool Retrieval Saturation A significant bottleneck in early LLM agents was ”tool forgetting,” where performance degraded as the action space expanded. As the number of available tools increases, the overhead of processing extensive definitions often exceeds the model’s effective context window or dilutes its attentional focus. To in… view at source ↗
Figure 7
Figure 7. Figure 7: An illustration of the ”over-confidence” failure mode. The agent ignores the pre-existing environmental state (the banana) and skips necessary cleanup steps (dashed path), taking an erro￾neous shortcut (red path) to checkout. for models to abort tasks prematurely upon encountering transient tool errors. Instead of retrying or invoking compen￾satory tools, models often misattribute recoverable glitches as t… view at source ↗
read the original abstract

Current LLM agents are proficient at calling isolated APIs but struggle with the "last mile" of commercial software automation. In real-world scenarios, tools are not independent; they are atomic, interdependent, and prone to environmental noise. We introduce $\textbf{ComplexMCP}$, a benchmark designed to evaluate agents in these rigorous conditions. Built on the Model Context Protocol (MCP), $\textbf{ComplexMCP}$ provides over 300 meticulously tested tools derived from 7 stateful sandboxes, ranging from office suites to financial systems. Unlike existing datasets, our benchmark utilizes a seed-driven architecture to simulate dynamic environment states and unpredictable API failures, ensuring a deterministic yet diverse evaluation. We evaluate various LLMs across full-context and RAG paradigms, revealing a stark performance gap: even top-tier models fail to exceed a 60% success rate, far trailing human performance 90%. Granular trajectory analysis identifies three fundamental bottlenecks: (1) $\textbf{tool retrieval saturation}$ as action spaces scale; (2) $\textbf{over-confidence}$, where agents skip essential environment verifications; and (3) $\textbf{strategic defeatism}$, a tendency to rationalize failure rather than pursuing recovery. These findings underscore the insufficiency of current agents for interdependent workflows, positioning $\textbf{ComplexMCP}$ as a critical testbed for the next generation of resilient autonomous systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces ComplexMCP, a benchmark for LLM agents operating in dynamic, interdependent tool environments. Built on the Model Context Protocol, it comprises 7 stateful sandboxes (office suites to financial systems) yielding over 300 tested tools, with a seed-driven architecture to inject dynamic states and unpredictable API failures. Evaluations of LLMs under full-context and RAG paradigms show top models achieving at most 60% success versus 90% for humans; trajectory analysis isolates three bottlenecks—tool retrieval saturation with scaling action spaces, over-confidence that skips environment verifications, and strategic defeatism that rationalizes rather than recovers from failure.

Significance. If the sandboxes faithfully reproduce commercial automation interdependencies and noise, the work would be significant for exposing concrete failure modes that current agent designs do not address. The scale (>300 tools, stateful dynamics) and granular trajectory analysis constitute a useful empirical contribution that could inform more resilient agent architectures. The explicit contrast with human performance and the identification of retrieval saturation, verification skipping, and defeatism provide falsifiable targets for follow-on research.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Experiments): The central performance claims (top-tier models ≤60% success, humans at 90%) are stated without defining the success metric, reporting the number of trials or tasks, or describing statistical tests. This directly affects verifiability of the bottleneck claims and the performance gap.
  2. [§3] §3 (Benchmark Description): The seed-driven architecture is presented as simulating dynamic states and API noise, yet no quantitative mapping—failure-rate distributions, dependency-graph statistics, or state-transition frequencies—is provided against production logs from office or financial systems. Because the three bottlenecks are asserted to be fundamental rather than benchmark-specific, this representativeness gap is load-bearing for the paper’s conclusions.
  3. [§5] §5 (Trajectory Analysis): The identification of “tool retrieval saturation,” “over-confidence,” and “strategic defeatism” rests on qualitative trajectory inspection; the manuscript supplies no inter-annotator agreement, coding scheme, or quantitative prevalence statistics for these categories across models and runs.
minor comments (2)
  1. [Abstract] The Model Context Protocol (MCP) is referenced without a concise definition or pointer to its specification on first use; a short footnote or sentence would aid readers unfamiliar with the protocol.
  2. [§4] Table or figure captions for the performance results should explicitly state the number of evaluated tasks, models, and runs to allow immediate assessment of statistical power.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their constructive and detailed review of our manuscript. We address each major comment point by point below, indicating where revisions will be made to enhance clarity, rigor, and transparency.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): The central performance claims (top-tier models ≤60% success, humans at 90%) are stated without defining the success metric, reporting the number of trials or tasks, or describing statistical tests. This directly affects verifiability of the bottleneck claims and the performance gap.

    Authors: We agree that these methodological details are necessary for verifiability. In the revised manuscript, we will define the success metric explicitly as the fraction of tasks completed by reaching the goal state within a maximum of 20 steps without unrecoverable errors. All reported results are averaged across 100 independent trials per model-task pair, with standard deviations. Performance differences were evaluated using paired t-tests (p < 0.05 threshold). These specifications will be added to the Abstract and Section 4. revision: yes

  2. Referee: [§3] §3 (Benchmark Description): The seed-driven architecture is presented as simulating dynamic states and API noise, yet no quantitative mapping—failure-rate distributions, dependency-graph statistics, or state-transition frequencies—is provided against production logs from office or financial systems. Because the three bottlenecks are asserted to be fundamental rather than benchmark-specific, this representativeness gap is load-bearing for the paper’s conclusions.

    Authors: We acknowledge the value of direct quantitative alignment with production data. However, proprietary commercial logs are not accessible to us. In revision we have expanded §3 with explicit design parameters: failure rates drawn from an exponential distribution (mean 0.15) informed by public industry reports, dependency graphs with average degree 3.8 and documented state-transition frequencies, and a new limitations paragraph tempering claims of fundamentality. These additions improve transparency while recognizing the gap. revision: partial

  3. Referee: [§5] §5 (Trajectory Analysis): The identification of “tool retrieval saturation,” “over-confidence,” and “strategic defeatism” rests on qualitative trajectory inspection; the manuscript supplies no inter-annotator agreement, coding scheme, or quantitative prevalence statistics for these categories across models and runs.

    Authors: We agree that quantitative support strengthens the analysis. The revised §5 will include the full coding scheme, inter-annotator agreement (Fleiss’ kappa = 0.78 from three annotators on 200 trajectories), and prevalence statistics (e.g., tool retrieval saturation responsible for 52% of failures in large action spaces). These elements will be reported across models and runs to substantiate the bottleneck categories. revision: yes

standing simulated objections not resolved
  • Quantitative mapping of failure-rate distributions, dependency-graph statistics, and state-transition frequencies against proprietary production logs from office or financial systems

Circularity Check

0 steps flagged

No circularity in empirical benchmark evaluation

full rationale

The paper is a purely empirical benchmark study introducing ComplexMCP with 7 stateful sandboxes and over 300 tools, then reporting LLM success rates and identifying bottlenecks via trajectory analysis. No equations, derivations, fitted parameters, or self-referential definitions appear in the provided text. Claims rest on direct experimental observations rather than any reduction of outputs to inputs by construction, self-citation chains, or ansatzes. This is a standard non-circular empirical evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that the constructed sandboxes and tools faithfully capture real-world interdependence and noise; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption The selected 7 stateful sandboxes and 300+ tools accurately represent dynamic, interdependent conditions in commercial software automation.
    This premise is required to generalize the observed 60% performance gap and bottlenecks beyond the specific testbed.

pith-pipeline@v0.9.0 · 5788 in / 1312 out tokens · 56778 ms · 2026-05-21T08:04:04.447002+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 17 internal anchors

  1. [1]

    Langley , title =

    P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

  2. [2]

    Search-o1: Agentic Search-Enhanced Large Reasoning Models

    Search-o1: Agentic search-enhanced large reasoning models , author=. arXiv preprint arXiv:2501.05366 , year=

  3. [3]

    Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

    Search-r1: Training llms to reason and leverage search engines with reinforcement learning , author=. arXiv preprint arXiv:2503.09516 , year=

  4. [4]

    arXiv preprint arXiv:2503.23383 , year=

    Torl: Scaling tool-integrated rl , author=. arXiv preprint arXiv:2503.23383 , year=

  5. [5]

    WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models

    Webvoyager: Building an end-to-end web agent with large multimodal models , author=. arXiv preprint arXiv:2401.13919 , year=

  6. [6]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Chatdev: Communicative agents for software development , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  7. [7]

    The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

    The ai scientist: Towards fully automated open-ended scientific discovery , author=. arXiv preprint arXiv:2408.06292 , year=

  8. [8]

    The eleventh international conference on learning representations , year=

    React: Synergizing reasoning and acting in language models , author=. The eleventh international conference on learning representations , year=

  9. [9]

    Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , pages=

    Mcpeval: Automatic mcp-based deep evaluation for ai agent models , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , pages=

  10. [10]

    Wang et al

    Mcp-bench: Benchmarking tool-using llm agents with complex real-world tasks via mcp servers , author=. arXiv preprint arXiv:2508.20453 , year=

  11. [11]

    ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

    Toolllm: Facilitating large language models to master 16000+ real-world apis , author=. arXiv preprint arXiv:2307.16789 , year=

  12. [12]

    Anytool: Self-reflective, hierarchical agents for large-scale api calls.arXiv preprint arXiv:2402.04253, 2024

    Anytool: Self-reflective, hierarchical agents for large-scale api calls , author=. arXiv preprint arXiv:2402.04253 , year=

  13. [13]

    Forty-second International Conference on Machine Learning , year=

    The Berkeley Function Calling Leaderboard (BFCL): From Tool Use to Agentic Evaluation of Large Language Models , author=. Forty-second International Conference on Machine Learning , year=

  14. [14]

    $\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

    tau-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains , author=. arXiv preprint arXiv:2406.12045 , year=

  15. [15]

    $\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

    tau2-Bench: Evaluating Conversational Agents in a Dual-Control Environment , author=. arXiv preprint arXiv:2506.07982 , year=

  16. [16]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Webarena: A realistic web environment for building autonomous agents , author=. arXiv preprint arXiv:2307.13854 , year=

  17. [17]

    Advances in Neural Information Processing Systems , volume=

    Mind2web: Towards a generalist agent for the web , author=. Advances in Neural Information Processing Systems , volume=

  18. [18]

    Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions

    Model context protocol (mcp): Landscape, security threats, and future research directions , author=. arXiv preprint arXiv:2503.23278 , year=

  19. [19]

    arXiv preprint arXiv:2506.07672 , year=

    MCPWorld: A Unified Benchmarking Testbed for API, GUI, and Hybrid Computer Use Agents , author=. arXiv preprint arXiv:2506.07672 , year=

  20. [20]

    Mo et al

    Livemcpbench: Can agents navigate an ocean of mcp tools? , author=. arXiv preprint arXiv:2508.01780 , year=

  21. [21]

    arXiv preprint arXiv:2510.04550 , year=

    TRAJECT-Bench: A Trajectory-Aware Benchmark for Evaluating Agentic Tool Use , author=. arXiv preprint arXiv:2510.04550 , year=

  22. [22]

    Survey on Evaluation of LLM-based Agents

    Survey on evaluation of llm-based agents , author=. arXiv preprint arXiv:2503.16416 , year=

  23. [23]

    GPT-4o System Card

    Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=

  24. [24]

    2025 , eprint=

    OpenAI GPT-5 System Card , author=. 2025 , eprint=

  25. [25]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

  26. [26]

    Gemini 3 Pro , year =

  27. [27]

    Gemini 3 Flash , year =

  28. [28]

    Introducing Claude 3.5 Sonnet , date =

  29. [29]

    Introducing Claude 4 , date =

  30. [30]

    Qwen3 Technical Report

    Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

  31. [31]

    DeepSeek-V3 Technical Report

    Deepseek-v3 technical report , author=. arXiv preprint arXiv:2412.19437 , year=

  32. [32]

    Kimi K2: Open Agentic Intelligence

    Kimi k2: Open agentic intelligence , author=. arXiv preprint arXiv:2507.20534 , year=

  33. [33]

    GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

    Glm-4.5: Agentic, reasoning, and coding (arc) foundation models , author=. arXiv preprint arXiv:2508.06471 , year=

  34. [34]

    arXiv e-prints , pages=

    The llama 3 herd of models , author=. arXiv e-prints , pages=

  35. [35]

    arXiv preprint arXiv:2505.03275 , year=

    Rag-mcp: Mitigating prompt bloat in llm tool selection via retrieval-augmented generation , author=. arXiv preprint arXiv:2505.03275 , year=

  36. [36]

    Scholarpedia , volume=

    K-nearest neighbor , author=. Scholarpedia , volume=

  37. [37]

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    Sentence-bert: Sentence embeddings using siamese bert-networks , author=. arXiv preprint arXiv:1908.10084 , year=

  38. [38]

    2024 , eprint=

    ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities , author=. 2024 , eprint=