PAGER achieves 4.1x higher task success in point-precise geometric GUI control by combining topology-aware planning with precision-aligned reinforcement learning on the new PAGE Bench dataset of 4,906 problems.
Mcptoolbench++: A large scale ai agent model context protocol mcp tool use benchmark
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 6roles
background 1polarities
background 1representative citing papers
TRON cuts tokens up to 27% with accuracy within 14pp of JSON on agentic benchmarks while TOON reaches 18% savings but triggers multi-turn parsing failures and parallel-call collapse on most models.
This survey categorizes agentic environments for LLMs by eight attributes and domains, introduces symbolic and neural synthesis paradigms with evaluation, and outlines four agent evolution pathways plus three environment evolution paradigms.
MM-ToolBench introduces 100 closed-loop multimodal tasks across two domains with 27 MCP servers and 324 tools, where agents must execute, inspect artifacts, and revise before final output.
Introduces Task2MCP dataset and T2MRec model for recommending MCP servers to LLM agents based on task semantics and engineering constraints.
Bounded autonomy using typed action contracts and consumer-side execution lets LLMs safely operate enterprise systems, achieving 23 of 25 tasks with zero unsafe executions versus 17 for unconstrained AI across 25 trials.
citing papers explorer
-
Notation Matters: A Benchmark Study of Token-Optimized Formats in Agentic AI Systems
TRON cuts tokens up to 27% with accuracy within 14pp of JSON on agentic benchmarks while TOON reaches 18% savings but triggers multi-turn parsing failures and parallel-call collapse on most models.
-
Bounded Autonomy for Enterprise AI: Typed Action Contracts and Consumer-Side Execution
Bounded autonomy using typed action contracts and consumer-side execution lets LLMs safely operate enterprise systems, achieving 23 of 25 tasks with zero unsafe executions versus 17 for unconstrained AI across 25 trials.