ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox
Pith reviewed 2026-05-21 08:04 UTC · model grok-4.3
The pith
Top LLM agents fail to exceed 60% success in dynamic interdependent tool tasks, trailing humans at 90%.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ComplexMCP uses the Model Context Protocol to create seven stateful sandboxes containing over 300 carefully tested tools, employing a seed-driven architecture to generate dynamic environment states and unpredictable API failures. When LLMs are tested in both full context and RAG modes, the highest success rates stay under 60 percent while humans reach 90 percent. Granular analysis of execution paths highlights tool retrieval saturation as action spaces grow, over-confidence that causes agents to omit necessary environment checks, and strategic defeatism where agents explain away failures instead of attempting fixes.
What carries the argument
ComplexMCP benchmark built via seed-driven architecture on the Model Context Protocol to simulate interdependent, stateful tools with failures.
If this is right
- LLM agents require enhanced mechanisms for handling large and scaling tool sets to avoid retrieval saturation.
- Agents need to be designed or prompted to perform environment verifications to reduce errors from over-confidence.
- Strategies must be developed to encourage agents to persist and recover from partial failures instead of rationalizing defeat.
- The benchmark positions itself as a necessary test for building more reliable autonomous systems in commercial settings.
- Performance shortfalls appear consistent whether using full context or retrieval augmented generation approaches.
Where Pith is reading between the lines
- These failure modes point toward the value of adding structured planning or verification modules outside the core language model.
- Similar issues likely arise in other agent applications involving real-time state changes and tool dependencies.
- Future benchmarks could incorporate multi-turn recovery challenges to specifically target the defeatism problem.
- Insights from this could lead to training methods that reward long-term task completion over short-term action sequences.
Load-bearing premise
The 7 stateful sandboxes with over 300 tools built on the Model Context Protocol represent a realistic model of dynamic interdependent real-world commercial software automation including unpredictable failures.
What would settle it
An experiment where agents are augmented with forced verification steps and recovery planning and then tested to see if their success rate on ComplexMCP rises substantially above 60 percent.
Figures
read the original abstract
Current LLM agents are proficient at calling isolated APIs but struggle with the "last mile" of commercial software automation. In real-world scenarios, tools are not independent; they are atomic, interdependent, and prone to environmental noise. We introduce $\textbf{ComplexMCP}$, a benchmark designed to evaluate agents in these rigorous conditions. Built on the Model Context Protocol (MCP), $\textbf{ComplexMCP}$ provides over 300 meticulously tested tools derived from 7 stateful sandboxes, ranging from office suites to financial systems. Unlike existing datasets, our benchmark utilizes a seed-driven architecture to simulate dynamic environment states and unpredictable API failures, ensuring a deterministic yet diverse evaluation. We evaluate various LLMs across full-context and RAG paradigms, revealing a stark performance gap: even top-tier models fail to exceed a 60% success rate, far trailing human performance 90%. Granular trajectory analysis identifies three fundamental bottlenecks: (1) $\textbf{tool retrieval saturation}$ as action spaces scale; (2) $\textbf{over-confidence}$, where agents skip essential environment verifications; and (3) $\textbf{strategic defeatism}$, a tendency to rationalize failure rather than pursuing recovery. These findings underscore the insufficiency of current agents for interdependent workflows, positioning $\textbf{ComplexMCP}$ as a critical testbed for the next generation of resilient autonomous systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ComplexMCP, a benchmark for LLM agents operating in dynamic, interdependent tool environments. Built on the Model Context Protocol, it comprises 7 stateful sandboxes (office suites to financial systems) yielding over 300 tested tools, with a seed-driven architecture to inject dynamic states and unpredictable API failures. Evaluations of LLMs under full-context and RAG paradigms show top models achieving at most 60% success versus 90% for humans; trajectory analysis isolates three bottlenecks—tool retrieval saturation with scaling action spaces, over-confidence that skips environment verifications, and strategic defeatism that rationalizes rather than recovers from failure.
Significance. If the sandboxes faithfully reproduce commercial automation interdependencies and noise, the work would be significant for exposing concrete failure modes that current agent designs do not address. The scale (>300 tools, stateful dynamics) and granular trajectory analysis constitute a useful empirical contribution that could inform more resilient agent architectures. The explicit contrast with human performance and the identification of retrieval saturation, verification skipping, and defeatism provide falsifiable targets for follow-on research.
major comments (3)
- [Abstract and §4] Abstract and §4 (Experiments): The central performance claims (top-tier models ≤60% success, humans at 90%) are stated without defining the success metric, reporting the number of trials or tasks, or describing statistical tests. This directly affects verifiability of the bottleneck claims and the performance gap.
- [§3] §3 (Benchmark Description): The seed-driven architecture is presented as simulating dynamic states and API noise, yet no quantitative mapping—failure-rate distributions, dependency-graph statistics, or state-transition frequencies—is provided against production logs from office or financial systems. Because the three bottlenecks are asserted to be fundamental rather than benchmark-specific, this representativeness gap is load-bearing for the paper’s conclusions.
- [§5] §5 (Trajectory Analysis): The identification of “tool retrieval saturation,” “over-confidence,” and “strategic defeatism” rests on qualitative trajectory inspection; the manuscript supplies no inter-annotator agreement, coding scheme, or quantitative prevalence statistics for these categories across models and runs.
minor comments (2)
- [Abstract] The Model Context Protocol (MCP) is referenced without a concise definition or pointer to its specification on first use; a short footnote or sentence would aid readers unfamiliar with the protocol.
- [§4] Table or figure captions for the performance results should explicitly state the number of evaluated tasks, models, and runs to allow immediate assessment of statistical power.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review of our manuscript. We address each major comment point by point below, indicating where revisions will be made to enhance clarity, rigor, and transparency.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): The central performance claims (top-tier models ≤60% success, humans at 90%) are stated without defining the success metric, reporting the number of trials or tasks, or describing statistical tests. This directly affects verifiability of the bottleneck claims and the performance gap.
Authors: We agree that these methodological details are necessary for verifiability. In the revised manuscript, we will define the success metric explicitly as the fraction of tasks completed by reaching the goal state within a maximum of 20 steps without unrecoverable errors. All reported results are averaged across 100 independent trials per model-task pair, with standard deviations. Performance differences were evaluated using paired t-tests (p < 0.05 threshold). These specifications will be added to the Abstract and Section 4. revision: yes
-
Referee: [§3] §3 (Benchmark Description): The seed-driven architecture is presented as simulating dynamic states and API noise, yet no quantitative mapping—failure-rate distributions, dependency-graph statistics, or state-transition frequencies—is provided against production logs from office or financial systems. Because the three bottlenecks are asserted to be fundamental rather than benchmark-specific, this representativeness gap is load-bearing for the paper’s conclusions.
Authors: We acknowledge the value of direct quantitative alignment with production data. However, proprietary commercial logs are not accessible to us. In revision we have expanded §3 with explicit design parameters: failure rates drawn from an exponential distribution (mean 0.15) informed by public industry reports, dependency graphs with average degree 3.8 and documented state-transition frequencies, and a new limitations paragraph tempering claims of fundamentality. These additions improve transparency while recognizing the gap. revision: partial
-
Referee: [§5] §5 (Trajectory Analysis): The identification of “tool retrieval saturation,” “over-confidence,” and “strategic defeatism” rests on qualitative trajectory inspection; the manuscript supplies no inter-annotator agreement, coding scheme, or quantitative prevalence statistics for these categories across models and runs.
Authors: We agree that quantitative support strengthens the analysis. The revised §5 will include the full coding scheme, inter-annotator agreement (Fleiss’ kappa = 0.78 from three annotators on 200 trajectories), and prevalence statistics (e.g., tool retrieval saturation responsible for 52% of failures in large action spaces). These elements will be reported across models and runs to substantiate the bottleneck categories. revision: yes
- Quantitative mapping of failure-rate distributions, dependency-graph statistics, and state-transition frequencies against proprietary production logs from office or financial systems
Circularity Check
No circularity in empirical benchmark evaluation
full rationale
The paper is a purely empirical benchmark study introducing ComplexMCP with 7 stateful sandboxes and over 300 tools, then reporting LLM success rates and identifying bottlenecks via trajectory analysis. No equations, derivations, fitted parameters, or self-referential definitions appear in the provided text. Claims rest on direct experimental observations rather than any reduction of outputs to inputs by construction, self-citation chains, or ansatzes. This is a standard non-circular empirical evaluation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The selected 7 stateful sandboxes and 300+ tools accurately represent dynamic, interdependent conditions in commercial software automation.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ComplexMCP provides over 300 meticulously tested tools derived from 7 stateful sandboxes... seed-driven architecture to simulate dynamic environment states and unpredictable API failures
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Granular trajectory analysis identifies three fundamental bottlenecks: (1) tool retrieval saturation... (2) over-confidence... (3) strategic defeatism
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =
work page 2000
-
[2]
Search-o1: Agentic Search-Enhanced Large Reasoning Models
Search-o1: Agentic search-enhanced large reasoning models , author=. arXiv preprint arXiv:2501.05366 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
Search-r1: Training llms to reason and leverage search engines with reinforcement learning , author=. arXiv preprint arXiv:2503.09516 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
arXiv preprint arXiv:2503.23383 , year=
Torl: Scaling tool-integrated rl , author=. arXiv preprint arXiv:2503.23383 , year=
-
[5]
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models
Webvoyager: Building an end-to-end web agent with large multimodal models , author=. arXiv preprint arXiv:2401.13919 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Chatdev: Communicative agents for software development , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[7]
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
The ai scientist: Towards fully automated open-ended scientific discovery , author=. arXiv preprint arXiv:2408.06292 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
The eleventh international conference on learning representations , year=
React: Synergizing reasoning and acting in language models , author=. The eleventh international conference on learning representations , year=
-
[9]
Mcpeval: Automatic mcp-based deep evaluation for ai agent models , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , pages=
work page 2025
-
[10]
Mcp-bench: Benchmarking tool-using llm agents with complex real-world tasks via mcp servers , author=. arXiv preprint arXiv:2508.20453 , year=
-
[11]
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
Toolllm: Facilitating large language models to master 16000+ real-world apis , author=. arXiv preprint arXiv:2307.16789 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Anytool: Self-reflective, hierarchical agents for large-scale api calls , author=. arXiv preprint arXiv:2402.04253 , year=
-
[13]
Forty-second International Conference on Machine Learning , year=
The Berkeley Function Calling Leaderboard (BFCL): From Tool Use to Agentic Evaluation of Large Language Models , author=. Forty-second International Conference on Machine Learning , year=
-
[14]
$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
tau-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains , author=. arXiv preprint arXiv:2406.12045 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment
tau2-Bench: Evaluating Conversational Agents in a Dual-Control Environment , author=. arXiv preprint arXiv:2506.07982 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
WebArena: A Realistic Web Environment for Building Autonomous Agents
Webarena: A realistic web environment for building autonomous agents , author=. arXiv preprint arXiv:2307.13854 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Advances in Neural Information Processing Systems , volume=
Mind2web: Towards a generalist agent for the web , author=. Advances in Neural Information Processing Systems , volume=
-
[18]
Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions
Model context protocol (mcp): Landscape, security threats, and future research directions , author=. arXiv preprint arXiv:2503.23278 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
arXiv preprint arXiv:2506.07672 , year=
MCPWorld: A Unified Benchmarking Testbed for API, GUI, and Hybrid Computer Use Agents , author=. arXiv preprint arXiv:2506.07672 , year=
- [20]
-
[21]
arXiv preprint arXiv:2510.04550 , year=
TRAJECT-Bench: A Trajectory-Aware Benchmark for Evaluating Agentic Tool Use , author=. arXiv preprint arXiv:2510.04550 , year=
-
[22]
Survey on Evaluation of LLM-based Agents
Survey on evaluation of llm-based agents , author=. arXiv preprint arXiv:2503.16416 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=
work page internal anchor Pith review Pith/arXiv arXiv
- [24]
-
[25]
Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Gemini 3 Pro , year =
-
[27]
Gemini 3 Flash , year =
-
[28]
Introducing Claude 3.5 Sonnet , date =
-
[29]
Introducing Claude 4 , date =
-
[30]
Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
Deepseek-v3 technical report , author=. arXiv preprint arXiv:2412.19437 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
Kimi K2: Open Agentic Intelligence
Kimi k2: Open agentic intelligence , author=. arXiv preprint arXiv:2507.20534 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
Glm-4.5: Agentic, reasoning, and coding (arc) foundation models , author=. arXiv preprint arXiv:2508.06471 , year=
work page internal anchor Pith review Pith/arXiv arXiv
- [34]
-
[35]
arXiv preprint arXiv:2505.03275 , year=
Rag-mcp: Mitigating prompt bloat in llm tool selection via retrieval-augmented generation , author=. arXiv preprint arXiv:2505.03275 , year=
- [36]
-
[37]
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Sentence-bert: Sentence embeddings using siamese bert-networks , author=. arXiv preprint arXiv:1908.10084 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1908
-
[38]
ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities , author=. 2024 , eprint=
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.