Gtbench: Uncovering the strategic reasoning limitations of llms via game-theoretic evaluations

· 2024 · arXiv 2402.12348

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

read on arXiv browse 10 citing papers

citation-role summary

background 1 dataset 1

citation-polarity summary

background 2

representative citing papers

BehaviorBench: Benchmarking Foundation Models for Behavioral Science Tasks

cs.CL · 2026-06-23 · unverdicted · novelty 7.0

BehaviorBench is a benchmark for foundation models on behavioral tasks that reveals fine-tuned behavioral models outperform general models on distributional alignment while general models lead on individual-level accuracy.

Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

cs.CV · 2026-06-17 · unverdicted · novelty 7.0

RNG-Bench evaluates MLLMs on hidden-observation reconstruction in non-Markov games, finds forgetting as the dominant error source, and shows fine-tuning on optimal rollouts improves performance with transfer to other benchmarks.

Age of LLM: A Strategic 1v1 Benchmark for Reasoning, Diplomacy and Reliability of Large Language Models under Fog of War

cs.AI · 2026-06-23 · unverdicted · novelty 6.0

Introduces Age of LLM benchmark pitting LLMs in a 13x7 grid game with fog of war, diplomacy, and JSON reliability constraints, reporting nuclear rush dominance in 54 matches and a weak reliability-win link.

DemoEvolve: Overcoming Sparse Feedback in Agentic Harness Evolution with Demonstrations

cs.AI · 2026-05-23 · unverdicted · novelty 6.0

DemoEvolve bootstraps harness evolution with demonstrations to achieve more stable and effective edits than self-rollout search in sparse-feedback environments like Balatro.

Common-agency Games for Multi-Objective Test-Time Alignment

cs.GT · 2026-05-08 · unverdicted · novelty 6.0

CAGE uses common-agency games and an EPEC algorithm to compute equilibrium policies that balance multiple conflicting objectives for test-time LLM alignment.

VS-Bench: Evaluating VLMs for Strategic Abilities in Multi-Agent Environments

cs.AI · 2025-06-03 · unverdicted · novelty 6.0

VS-Bench is a new benchmark of ten visual multi-agent environments that measures VLMs on element recognition, next-action prediction, and normalized episode return, showing strong perception but large gaps in reasoning and decision-making with the best model at 46.6% prediction accuracy and 31.4% of

Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application

cs.CL · 2026-06-10 · unverdicted · novelty 5.0

This survey categorizes agentic environments for LLMs by eight attributes and domains, introduces symbolic and neural synthesis paradigms with evaluation, and outlines four agent evolution pathways plus three environment evolution paradigms.

Robots Need More than VLA and World Models

cs.RO · 2026-06-04 · unverdicted · novelty 5.0

The paper identifies four missing interfaces (data autolabelling, embodiment retargeting, physics-grounded world models, and video-based reward inference) as the central bottleneck beyond VLA scaling for robot intelligence.

Strategic Heterogeneous Multi-Agent Architecture for Cost-Effective Code Vulnerability Detection

cs.CR · 2026-04-23 · unverdicted · novelty 5.0

A game-theoretic heterogeneous multi-agent architecture with three cloud LLMs and a local verifier achieves 77.2% F1, 100% recall, and 3x speedup for code vulnerability detection at $0.002 per sample on the NIST Juliet suite.

Towards Reliable and Robust LLM Planning: Symbolic Feedback-Driven Iterative Self-Refinement Framework

cs.AI · 2026-06-26 · unverdicted · novelty 4.0

Proposes a symbolic feedback-driven iterative self-refinement framework for LLM long-horizon planning that maps symbols to natural language, uses a verifier for error correction, and a plan recognizer for goal reachability, with abstract-level claims of improved feasibility and correctness.

citing papers explorer

Showing 10 of 10 citing papers after filters.

BehaviorBench: Benchmarking Foundation Models for Behavioral Science Tasks cs.CL · 2026-06-23 · unverdicted · none · ref 23
BehaviorBench is a benchmark for foundation models on behavioral tasks that reveals fine-tuned behavioral models outperform general models on distributional alignment while general models lead on individual-level accuracy.
Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games cs.CV · 2026-06-17 · unverdicted · none · ref 20
RNG-Bench evaluates MLLMs on hidden-observation reconstruction in non-Markov games, finds forgetting as the dominant error source, and shows fine-tuning on optimal rollouts improves performance with transfer to other benchmarks.
Age of LLM: A Strategic 1v1 Benchmark for Reasoning, Diplomacy and Reliability of Large Language Models under Fog of War cs.AI · 2026-06-23 · unverdicted · none · ref 10
Introduces Age of LLM benchmark pitting LLMs in a 13x7 grid game with fog of war, diplomacy, and JSON reliability constraints, reporting nuclear rush dominance in 54 matches and a weak reliability-win link.
DemoEvolve: Overcoming Sparse Feedback in Agentic Harness Evolution with Demonstrations cs.AI · 2026-05-23 · unverdicted · none · ref 5
DemoEvolve bootstraps harness evolution with demonstrations to achieve more stable and effective edits than self-rollout search in sparse-feedback environments like Balatro.
Common-agency Games for Multi-Objective Test-Time Alignment cs.GT · 2026-05-08 · unverdicted · none · ref 48
CAGE uses common-agency games and an EPEC algorithm to compute equilibrium policies that balance multiple conflicting objectives for test-time LLM alignment.
VS-Bench: Evaluating VLMs for Strategic Abilities in Multi-Agent Environments cs.AI · 2025-06-03 · unverdicted · none · ref 18
VS-Bench is a new benchmark of ten visual multi-agent environments that measures VLMs on element recognition, next-action prediction, and normalized episode return, showing strong perception but large gaps in reasoning and decision-making with the best model at 46.6% prediction accuracy and 31.4% of
Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application cs.CL · 2026-06-10 · unverdicted · none · ref 120
This survey categorizes agentic environments for LLMs by eight attributes and domains, introduces symbolic and neural synthesis paradigms with evaluation, and outlines four agent evolution pathways plus three environment evolution paradigms.
Robots Need More than VLA and World Models cs.RO · 2026-06-04 · unverdicted · none · ref 143
The paper identifies four missing interfaces (data autolabelling, embodiment retargeting, physics-grounded world models, and video-based reward inference) as the central bottleneck beyond VLA scaling for robot intelligence.
Strategic Heterogeneous Multi-Agent Architecture for Cost-Effective Code Vulnerability Detection cs.CR · 2026-04-23 · unverdicted · none · ref 6
A game-theoretic heterogeneous multi-agent architecture with three cloud LLMs and a local verifier achieves 77.2% F1, 100% recall, and 3x speedup for code vulnerability detection at $0.002 per sample on the NIST Juliet suite.
Towards Reliable and Robust LLM Planning: Symbolic Feedback-Driven Iterative Self-Refinement Framework cs.AI · 2026-06-26 · unverdicted · none · ref 3
Proposes a symbolic feedback-driven iterative self-refinement framework for LLM long-horizon planning that maps symbols to natural language, uses a verifier for error correction, and a plan recognizer for goal reachability, with abstract-level claims of improved feasibility and correctness.

Gtbench: Uncovering the strategic reasoning limitations of llms via game-theoretic evaluations

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer