Orak is a foundational benchmark providing training data, interfaces, and evaluation tools for LLM agents across diverse video game genres.
How far are we on the decision-making of llms? evaluating llms’ gaming ability in multi-agent environments
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 8roles
background 3polarities
background 3representative citing papers
MALLM-GAN uses multi-agent LLMs to emulate GAN architecture for generating higher-quality synthetic tabular data from small samples than prior models, while preserving privacy.
CAGE uses common-agency games and an EPEC algorithm to compute equilibrium policies that balance multiple conflicting objectives for test-time LLM alignment.
A small set of sparse autoencoder features in LLMs drives shifts between generous and selfish allocations in dictator games, with causal patching and steering confirming their role and generalization to other social games.
Role-based personas in multi-agent LLM systems suppress payoff-aligned behavior, shifting equilibrium selection by up to 90 percentage points in Tragedy of the Commons versus Green Transition scenarios even with full payoff information.
VS-Bench is a new benchmark of ten visual multi-agent environments that measures VLMs on element recognition, next-action prediction, and normalized episode return, showing strong perception but large gaps in reasoning and decision-making with the best model at 46.6% prediction accuracy and 31.4% of
The paper organizes research on generalist game AI into Dataset, Model, Harness, and Benchmark pillars and charts a five-level progression from single-game mastery to agents that create and live inside game multiverses.
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
citing papers explorer
-
Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games
Orak is a foundational benchmark providing training data, interfaces, and evaluation tools for LLM agents across diverse video game genres.
-
MALLM-GAN: Multi-Agent Large Language Model as Generative Adversarial Network for Synthesizing Tabular Data
MALLM-GAN uses multi-agent LLMs to emulate GAN architecture for generating higher-quality synthetic tabular data from small samples than prior models, while preserving privacy.
-
Common-agency Games for Multi-Objective Test-Time Alignment
CAGE uses common-agency games and an EPEC algorithm to compute equilibrium policies that balance multiple conflicting objectives for test-time LLM alignment.
-
Understanding the Mechanism of Altruism in Large Language Models
A small set of sparse autoencoder features in LLMs drives shifts between generous and selfish allocations in dictator games, with causal patching and steering confirming their role and generalization to other social games.
-
When Identity Overrides Incentives: Representational Choices as Governance Decisions in Multi-Agent LLM Systems
Role-based personas in multi-agent LLM systems suppress payoff-aligned behavior, shifting equilibrium selection by up to 90 percentage points in Tragedy of the Commons versus Green Transition scenarios even with full payoff information.
-
VS-Bench: Evaluating VLMs for Strategic Abilities in Multi-Agent Environments
VS-Bench is a new benchmark of ten visual multi-agent environments that measures VLMs on element recognition, next-action prediction, and normalized episode return, showing strong perception but large gaps in reasoning and decision-making with the best model at 46.6% prediction accuracy and 31.4% of
-
Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse
The paper organizes research on generalist game AI into Dataset, Model, Harness, and Benchmark pillars and charts a five-level progression from single-game mastery to agents that create and live inside game multiverses.
-
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.