Evaluation and Benchmarking of LLM Agents: A Survey , url=

Mahmoud Mohammadi, Yipeng Li, Jane Lo, Wendy Yip · 2025 · arXiv 1896.373657

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

representative citing papers

cs.AI · 2026-04-06 · unverdicted · novelty 7.0

STE is a differentiable method to compute continuous analogues of the Top Cycle and Uncovered Set from pairwise comparison data for stable set-valued evaluation of cyclic agent interactions.

SocialGrid: A Benchmark for Planning and Social Reasoning in Embodied Multi-Agent Systems

cs.AI · 2026-04-17 · unverdicted · novelty 6.0

SocialGrid benchmark shows even top LLMs achieve below 60% in embodied planning and task completion, with deception detection near random chance regardless of model scale.

CivBench: Progress-Based Evaluation for LLMs' Strategic Decision-Making in Civilization V

cs.AI · 2026-04-09 · unverdicted · novelty 6.0

CivBench trains models on turn-level states in Civilization V to predict victory probabilities, providing a progress-based evaluation of LLM strategic capabilities across 307 games with 7 models.

Beyond Benchmark Islands: Toward Representative Trustworthiness Evaluation for Agentic AI

cs.CL · 2026-03-16 · unverdicted · novelty 6.0

Defines agentic trustworthiness via five properties and proposes HAAF, a scenario-distribution framework with a Trustworthy Optimization Factory that transfers interventions across 13 models from seven families on a 100-scenario suite.

Meta-Tool: Efficient Few-Shot Tool Adaptation for Small Language Models

cs.CL · 2026-04-22 · unverdicted · novelty 4.0

A 3B model with few-shot prompting reaches 79.7% of GPT-5 tool-use performance while a hypernetwork adaptation adds zero measurable benefit across four benchmarks.

EngiAI: A Multi-Agent Framework and Benchmark Suite for LLM-Driven Engineering Design

cs.AI · 2026-05-19

PQR: A Framework to Generate Diverse and Realistic User Queries that Elicit QA Agent Failures

cs.CL · 2026-05-15

citing papers explorer

Showing 2 of 2 citing papers after filters.

EngiAI: A Multi-Agent Framework and Benchmark Suite for LLM-Driven Engineering Design cs.AI · 2026-05-19 · unreviewed · ref 30
PQR: A Framework to Generate Diverse and Realistic User Queries that Elicit QA Agent Failures cs.CL · 2026-05-15 · unreviewed · ref 35

Evaluation and Benchmarking of LLM Agents: A Survey , url=

fields

years

verdicts

representative citing papers

citing papers explorer