Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.
Toolsandbox: A stateful, conversational, interactive evaluation benchmark for llm tool use capabilities
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
Cutscene Agent uses a multi-agent LLM system and a new toolkit for game engine control to automate end-to-end 3D cutscene generation, evaluated on the introduced CutsceneBench.
COMPASS benchmark shows LLM agents reach 70-90% feasibility but only 20-60% optimality on constrained travel planning tasks, attributing the gap to insufficient search space exploration rather than tool use.
τ²-bench provides a Dec-POMDP-based telecom domain with compositional task generation and a tool-constrained user simulator to measure agent performance drops in dual-control versus single-control settings.
MCP lifecycle is defined with four phases and 16 activities; a threat taxonomy of 16 scenarios is constructed, validated via case studies, and paired with phase-specific safeguards.
Defines agentic trustworthiness via five properties and proposes HAAF, a scenario-distribution framework with a Trustworthy Optimization Factory that transfers interventions across 13 models from seven families on a 100-scenario suite.
AgentHarm benchmark shows leading LLMs comply with malicious agent requests and simple jailbreaks enable coherent harmful multi-step execution while retaining capabilities.
The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.
citing papers explorer
-
Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety
Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.
-
Cutscene Agent: An LLM Agent Framework for Automated 3D Cutscene Generation
Cutscene Agent uses a multi-agent LLM system and a new toolkit for game engine control to automate end-to-end 3D cutscene generation, evaluated on the introduced CutsceneBench.
-
COMPASS: Benchmarking Constrained Optimization in LLM Agents
COMPASS benchmark shows LLM agents reach 70-90% feasibility but only 20-60% optimality on constrained travel planning tasks, attributing the gap to insufficient search space exploration rather than tool use.
-
$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment
τ²-bench provides a Dec-POMDP-based telecom domain with compositional task generation and a tool-constrained user simulator to measure agent performance drops in dual-control versus single-control settings.
-
Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions
MCP lifecycle is defined with four phases and 16 activities; a threat taxonomy of 16 scenarios is constructed, validated via case studies, and paired with phase-specific safeguards.
-
Beyond Benchmark Islands: Toward Representative Trustworthiness Evaluation for Agentic AI
Defines agentic trustworthiness via five properties and proposes HAAF, a scenario-distribution framework with a Trustworthy Optimization Factory that transfers interventions across 13 models from seven families on a 100-scenario suite.
-
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
AgentHarm benchmark shows leading LLMs comply with malicious agent requests and simple jailbreaks enable coherent harmful multi-step execution while retaining capabilities.
-
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence
The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.
- The A-R Behavioral Space: Execution-Level Profiling of Tool-Using Language Model Agents in Organizational Deployment