Orak is a foundational benchmark providing training data, interfaces, and evaluation tools for LLM agents across diverse video game genres.
DSGBench: A Diverse Strategic Game Benchmark for Evaluating LLM-based Agents in Complex Decision-Making Environments
3 Pith papers cite this work. Polarity classification is still indexing.
abstract
Large language model (LLM)-based agents are increasingly applied to complex strategic environments that demand long-horizon reasoning, multi-agent interaction, and decision-making under uncertainty. However, common existing benchmarks either assess isolated skills, lack environmental diversity, or rely on broad overall metrics. To address these issues, we introduce DSGBench, a more rigorous evaluation platform for strategic decision-making tasks. Firstly, it incorporates six complex strategic games which serve as ideal testbeds due to their long-term and multi-dimensional decision-making demands and flexibility in customizing tasks with various difficulty levels and targets. Secondly, DSGBench employs a fine-grained evaluation scoring system which examines the decision-making capabilities by looking into the performance in five specific dimensions, offering a comprehensive assessment in a better-designed fashion. Furthermore, DSGBench also incorporates an automated decision-tracking mechanism which enables in-depth analysis of agent behaviour patterns and the turning points in their strategies. We evaluate six popular LLM agents, including open-source and closed-source models, and observe distinct strengths and limitations among various tasks. Through decision trajectory analysis, we further identify systemic limitations in different LLMs. These findings offer valuable insights for model selection and future LLM-based agent development.
citation-role summary
citation-polarity summary
roles
dataset 1polarities
background 1representative citing papers
LATTICE is a scalable LLM-judge benchmark for crypto agent decision support that reveals performance trade-offs among real-world copilots across dimensions and tasks.
A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.
citing papers explorer
-
Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games
Orak is a foundational benchmark providing training data, interfaces, and evaluation tools for LLM agents across diverse video game genres.
-
LATTICE: Evaluating Decision Support Utility of Crypto Agents
LATTICE is a scalable LLM-judge benchmark for crypto agent decision support that reveals performance trade-offs among real-world copilots across dimensions and tasks.
-
From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review
A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.