Mcptoolbench++: A large scale ai agent model context protocol mcp tool use benchmark

Shiqing Fan, Xichen Ding, Liang Zhang, Linjian Mo · 2025 · arXiv 2508.07575

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control

cs.AI · 2026-05-15 · unverdicted · novelty 7.0

PAGER achieves 4.1x higher task success in point-precise geometric GUI control by combining topology-aware planning with precision-aligned reinforcement learning on the new PAGE Bench dataset of 4,906 problems.

TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents

cs.AI · 2026-05-16 · unverdicted · novelty 5.0

MM-ToolBench introduces 100 closed-loop multimodal tasks across two domains with 27 MCP servers and 324 tools, where agents must execute, inspect artifacts, and revise before final output.

From Language to Action: Enhancing LLM Task Efficiency with Task-Aware MCP Server Recommendation

cs.SE · 2026-04-19 · unverdicted · novelty 5.0

Introduces Task2MCP dataset and T2MRec model for recommending MCP servers to LLM agents based on task semantics and engineering constraints.

Bounded Autonomy for Enterprise AI: Typed Action Contracts and Consumer-Side Execution

cs.SE · 2026-04-16 · conditional · novelty 5.0

Bounded autonomy using typed action contracts and consumer-side execution lets LLMs safely operate enterprise systems, achieving 23 of 25 tasks with zero unsafe executions versus 17 for unconstrained AI across 25 trials.

citing papers explorer

Showing 4 of 4 citing papers.

PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control cs.AI · 2026-05-15 · unverdicted · none · ref 7
PAGER achieves 4.1x higher task success in point-precise geometric GUI control by combining topology-aware planning with precision-aligned reinforcement learning on the new PAGE Bench dataset of 4,906 problems.
TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents cs.AI · 2026-05-16 · unverdicted · none · ref 21
MM-ToolBench introduces 100 closed-loop multimodal tasks across two domains with 27 MCP servers and 324 tools, where agents must execute, inspect artifacts, and revise before final output.
From Language to Action: Enhancing LLM Task Efficiency with Task-Aware MCP Server Recommendation cs.SE · 2026-04-19 · unverdicted · none · ref 70
Introduces Task2MCP dataset and T2MRec model for recommending MCP servers to LLM agents based on task semantics and engineering constraints.
Bounded Autonomy for Enterprise AI: Typed Action Contracts and Consumer-Side Execution cs.SE · 2026-04-16 · conditional · none · ref 7
Bounded autonomy using typed action contracts and consumer-side execution lets LLMs safely operate enterprise systems, achieving 23 of 25 tasks with zero unsafe executions versus 17 for unconstrained AI across 25 trials.

Mcptoolbench++: A large scale ai agent model context protocol mcp tool use benchmark

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer