Mcptoolbench++: A large scale ai agent model context protocol mcp tool use benchmark

MCPToolBench++: A large scale ai agent model context protocol MCP tool use benchmark · 2025 · arXiv 2508.07575

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control

cs.AI · 2026-05-15 · unverdicted · novelty 7.0

PAGER achieves 4.1x higher task success in point-precise geometric GUI control by combining topology-aware planning with precision-aligned reinforcement learning on the new PAGE Bench dataset of 4,906 problems.

Notation Matters: A Benchmark Study of Token-Optimized Formats in Agentic AI Systems

cs.AI · 2026-05-28 · conditional · novelty 6.0

TRON cuts tokens up to 27% with accuracy within 14pp of JSON on agentic benchmarks while TOON reaches 18% savings but triggers multi-turn parsing failures and parallel-call collapse on most models.

Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application

cs.CL · 2026-06-10 · unverdicted · novelty 5.0

This survey categorizes agentic environments for LLMs by eight attributes and domains, introduces symbolic and neural synthesis paradigms with evaluation, and outlines four agent evolution pathways plus three environment evolution paradigms.

TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents

cs.AI · 2026-05-16 · unverdicted · novelty 5.0

MM-ToolBench introduces 100 closed-loop multimodal tasks across two domains with 27 MCP servers and 324 tools, where agents must execute, inspect artifacts, and revise before final output.

From Language to Action: Enhancing LLM Task Efficiency with Task-Aware MCP Server Recommendation

cs.SE · 2026-04-19 · unverdicted · novelty 5.0

Introduces Task2MCP dataset and T2MRec model for recommending MCP servers to LLM agents based on task semantics and engineering constraints.

Bounded Autonomy for Enterprise AI: Typed Action Contracts and Consumer-Side Execution

cs.SE · 2026-04-16 · conditional · novelty 5.0

Bounded autonomy using typed action contracts and consumer-side execution lets LLMs safely operate enterprise systems, achieving 23 of 25 tasks with zero unsafe executions versus 17 for unconstrained AI across 25 trials.

citing papers explorer

Showing 2 of 2 citing papers after filters.

Notation Matters: A Benchmark Study of Token-Optimized Formats in Agentic AI Systems cs.AI · 2026-05-28 · conditional · none · ref 3
TRON cuts tokens up to 27% with accuracy within 14pp of JSON on agentic benchmarks while TOON reaches 18% savings but triggers multi-turn parsing failures and parallel-call collapse on most models.
Bounded Autonomy for Enterprise AI: Typed Action Contracts and Consumer-Side Execution cs.SE · 2026-04-16 · conditional · none · ref 7
Bounded autonomy using typed action contracts and consumer-side execution lets LLMs safely operate enterprise systems, achieving 23 of 25 tasks with zero unsafe executions versus 17 for unconstrained AI across 25 trials.

Mcptoolbench++: A large scale ai agent model context protocol mcp tool use benchmark

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer