STE is a differentiable method to compute continuous analogues of the Top Cycle and Uncovered Set from pairwise comparison data for stable set-valued evaluation of cyclic agent interactions.
Evaluation and Benchmarking of LLM Agents: A Survey , url=
7 Pith papers cite this work. Polarity classification is still indexing.
years
2026 7representative citing papers
SocialGrid benchmark shows even top LLMs achieve below 60% in embodied planning and task completion, with deception detection near random chance regardless of model scale.
CivBench trains models on turn-level states in Civilization V to predict victory probabilities, providing a progress-based evaluation of LLM strategic capabilities across 307 games with 7 models.
Defines agentic trustworthiness via five properties and proposes HAAF, a scenario-distribution framework with a Trustworthy Optimization Factory that transfers interventions across 13 models from seven families on a 100-scenario suite.
A 3B model with few-shot prompting reaches 79.7% of GPT-5 tool-use performance while a hypernetwork adaptation adds zero measurable benefit across four benchmarks.