VS-Bench: Evaluating VLMs for Strategic Abilities in Multi-Agent Environments

arxiv: 2506.02387 · v3 · submitted 2025-06-03 · 💻 cs.AI

VS-Bench: Evaluating VLMs for Strategic Abilities in Multi-Agent Environments

Zelai Xu , Zhexuan Xu , Xiangmin Yi , Huining Yuan , Mo Guang , Kaiwen Long , Xinlei Chen , Yi Wu

show 2 more authors

Chao Yu Yu Wang

This is my paper

Pith reviewed 2026-05-19 11:53 UTC · model grok-4.3

classification 💻 cs.AI

keywords vision language modelsmulti-agent environmentsstrategic reasoningbenchmark evaluationdecision makingmultimodal agentscooperative and competitive interactions

0 comments p. Extension

The pith

Vision-language models show strong perception yet lag significantly in strategic reasoning and decision-making across multi-agent visual environments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VS-Bench, a multimodal benchmark designed to test VLMs on strategic abilities in settings that combine visual observations with interactions among multiple agents. It features ten vision-grounded environments covering cooperative, competitive, and mixed-motive scenarios. Experiments across fifteen leading models find solid performance on element recognition but clear shortfalls in next-action prediction and overall returns, with the strongest model reaching 46.6 percent prediction accuracy and 31.4 percent normalized return. The work standardizes evaluation and identifies limitations to guide development of better multimodal agents.

Core claim

VS-Bench measures VLM performance in multi-agent environments along three axes: perception via element recognition accuracy, strategic reasoning via next-action prediction accuracy, and decision-making via normalized episode return, establishing that current models retain a substantial gap to optimal levels in reasoning and decision-making despite capable perception.

What carries the argument

VS-Bench, a benchmark built from ten vision-grounded environments that evaluate cooperative, competitive, and mixed-motive interactions through the metrics of element recognition accuracy, next-action prediction accuracy, and normalized episode return.

If this is right

Improved strategic performance in these environments would support deploying VLMs as agents in interactive multi-agent applications such as simulations or games.
Documented failure modes can directly inform targeted enhancements to VLM reasoning components.
Human performance data collected in the same environments provides concrete targets for model iteration.
Standardized use of VS-Bench could accelerate systematic progress on multimodal strategic agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Adding environments with greater scale or partial observability could expose further limits in current VLM strategies.
Looping VS-Bench evaluations into VLM training might narrow the observed gaps over successive model versions.
The visual emphasis implies that purely text-based strategic benchmarks may miss key multimodal interaction challenges.

Load-bearing premise

The ten chosen vision-grounded environments and the three chosen metrics of element recognition, next-action prediction, and normalized return serve as valid proxies for strategic abilities in real-world multi-agent settings.

What would settle it

Demonstrating that one or more VLMs reach near-optimal normalized returns and substantially higher next-action prediction accuracy across all ten environments would indicate the reported gap is smaller than claimed.

Figures

Figures reproduced from arXiv: 2506.02387 by Chao Yu, Huining Yuan, Kaiwen Long, Mo Guang, Xiangmin Yi, Xinlei Chen, Yi Wu, Yu Wang, Zelai Xu, Zhexuan Xu.

**Figure 2.** Figure 2: Overview of VS-BENCH, a multimodal benchmark for evaluating VLMs in multi-agent environments. We evaluate fourteen state-of-the-art models in eight vision-grounded environments with two complementary dimensions, including offline evaluation of strategic reasoning by nextaction prediction accuracy and online evaluation of decision-making by normalized episode return. In summary, our contributions are three… view at source ↗

**Figure 3.** Figure 3: Comparison of reasoning VLMs on decision-making with multimodal and text-only [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of reasoning VLMs and chat VLMs on decision-making with IO and CoT [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Social behaviors of two reasoning models and the best-performing open-source models in [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Social behaviors of all models in mixed-motive social dilemma games. Dimensions are [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗

**Figure 7.** Figure 7: Failure case example of strategic reasoning in [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗

**Figure 8.** Figure 8: Failure case example of reasoning in Overcooked. F Failure case examples F.1 Strategic reasoning We present three illustrative failure cases in strategic reasoning from different game environments. In Hanabi, VLM agents only observe the other agent’s hands but not their own hands, creating a distinct information asymmetry. An example with visual observation and the VLM’s response is shown in [PITH_FULL_IM… view at source ↗

**Figure 9.** Figure 9: Failure case example of strategic reasoning in [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗

**Figure 10.** Figure 10: Failure case example of decision-making in [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗

**Figure 11.** Figure 11: Failure case example of decision-making in [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗

**Figure 12.** Figure 12: Failure case example of decision-making in [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗

**Figure 13.** Figure 13: Tiny Hanabi [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗

**Figure 15.** Figure 15: Hanabi [PITH_FULL_IMAGE:figures/full_fig_p029_15.png] view at source ↗

**Figure 17.** Figure 17: Breakthrough [PITH_FULL_IMAGE:figures/full_fig_p030_17.png] view at source ↗

**Figure 20.** Figure 20: Coin Dilemma [PITH_FULL_IMAGE:figures/full_fig_p031_20.png] view at source ↗

**Figure 22.** Figure 22: Battle of the Colors. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_22.png] view at source ↗

read the original abstract

Recent advancements in Vision Language Models (VLMs) have expanded their capabilities to interactive agent tasks, yet existing benchmarks remain limited to single-agent or text-only environments. In contrast, real-world scenarios often involve multiple agents interacting within rich visual and textual contexts, posing challenges with both multimodal observations and strategic interactions. To bridge this gap, we introduce Visual Strategic Bench (VS-Bench), a multimodal benchmark that evaluates VLMs for strategic abilities in multi-agent environments. VS-Bench comprises ten vision-grounded environments that cover cooperative, competitive, and mixed-motive interactions. The performance of VLM agents is evaluated across three dimensions: perception measured by element recognition accuracy; strategic reasoning measured by next-action prediction accuracy; and decision-making measured by normalized episode return. Extensive experiments on fifteen leading VLMs show that, although current models exhibit strong perception abilities, there remains a significant gap to optimal performance in reasoning and decision-making, with the best-performing model attaining 46.6% prediction accuracy and 31.4% normalized return. We further analyze the key factors influencing performance, conduct human experiments, and examine failure modes to provide a deeper understanding of VLMs' strategic abilities. By standardizing the evaluation and highlighting the limitations of existing models, we envision VS-Bench as a foundation for future research on strategic multimodal agents. Code and data are available at https://vs-bench.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VS-Bench adds ten new vision-based multi-agent environments and shows a perception-reasoning gap in VLMs, but the metrics may not cleanly separate strategy from task artifacts.

read the letter

The main thing to know is that this paper introduces VS-Bench with ten vision-grounded environments covering cooperative, competitive, and mixed-motive settings, then tests fifteen VLMs and finds solid element recognition but much weaker next-action prediction and normalized returns, with the best model at 46.6% and 31.4% respectively. They also run human comparisons and look at failure modes, which gives a practical baseline picture beyond single-agent or text-only tests. Code and data are released, which helps anyone who wants to extend the work. That combination of new environments plus the three-way split into perception, reasoning, and decision-making is the clearest addition here. The experiments are straightforward and cover a range of current models, so the numbers provide a starting point for tracking progress on multimodal agents. The environments themselves seem designed to require ongoing interactions rather than isolated visual questions. On the softer side, the claim that next-action prediction isolates strategic reasoning rests on the assumption that the tasks induce genuine multi-agent dynamics and that the metrics do not pick up prompt sensitivity or environment-specific shortcuts. The abstract does not spell out whether agents receive raw visuals in closed-loop play or privileged summaries, nor does it show that prediction accuracy tracks actual returns under full rollouts. Statistical details like variance or significance tests are also light in the summary, which makes it harder to judge how stable the gap really is. If those checks are stronger in the full text, the results hold up better; otherwise the separation between abilities could be partly an artifact of the interface. This is aimed at groups building interactive VLMs who need a multimodal multi-agent testbed to measure incremental gains. A reader working on agent benchmarks or strategic decision-making in visual settings would get usable environments and initial numbers from it. I would send it for peer review. The benchmark is new enough and the experiments broad enough to warrant referee input, even if the metric validation could use tightening.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces VS-Bench, a multimodal benchmark consisting of ten vision-grounded multi-agent environments spanning cooperative, competitive, and mixed-motive settings. It evaluates fifteen VLMs using three metrics: element recognition accuracy for perception, next-action prediction accuracy for strategic reasoning, and normalized episode return for decision-making. Results show strong perception but substantial gaps in reasoning and decision-making, with the best model reaching 46.6% prediction accuracy and 31.4% normalized return. The authors additionally analyze influencing factors, compare to human performance, examine failure modes, and release code and data.

Significance. If the environments and metrics validly isolate strategic abilities from perception and task artifacts, the work identifies concrete limitations in current VLMs for multi-agent strategic tasks and offers a standardized testbed for future multimodal agent research. The public release of code and data is a clear strength supporting reproducibility.

major comments (2)

[Abstract] Abstract and Evaluation section: the central claim of a 'significant gap to optimal performance in reasoning and decision-making' rests on next-action prediction and normalized return cleanly separating strategic abilities from perception. The manuscript provides no evidence that next-action labels derive from full POMDP rollouts rather than expert policies or that agents receive only raw visual observations instead of privileged state summaries, leaving open the possibility that the reported gap reflects interface or labeling artifacts.
[Results] Results and Experiments: performance figures for the fifteen models are reported without statistical significance tests, error bars, data-split details, or validation that the three metrics capture strategic ability rather than environment-specific artifacts, weakening the cross-model and cross-environment comparisons.

minor comments (1)

[Abstract] The abstract could briefly indicate the specific VLMs tested or the distribution of environment types to give readers immediate context for the scale of the evaluation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript introducing VS-Bench. We address each major comment below with clarifications and planned revisions to strengthen the presentation of our metrics and results.

read point-by-point responses

Referee: [Abstract] Abstract and Evaluation section: the central claim of a 'significant gap to optimal performance in reasoning and decision-making' rests on next-action prediction and normalized return cleanly separating strategic abilities from perception. The manuscript provides no evidence that next-action labels derive from full POMDP rollouts rather than expert policies or that agents receive only raw visual observations instead of privileged state summaries, leaving open the possibility that the reported gap reflects interface or labeling artifacts.

Authors: We agree that explicit documentation of label generation and observation interfaces is necessary to support the separation of perception from strategic abilities. Next-action labels are produced by executing optimal or near-optimal policies: where tractable we solve the underlying POMDP using standard dynamic programming methods to obtain action sequences that maximize expected return; otherwise we adopt the expert policies supplied with each environment's original implementation. All VLM agents are provided exclusively with raw RGB visual frames and the standard textual observations emitted by the environment APIs, with no privileged state vectors or internal summaries. We have added a dedicated paragraph plus a summary table in the revised Evaluation section that lists the exact source of labels and observations for each of the ten environments. revision: yes
Referee: [Results] Results and Experiments: performance figures for the fifteen models are reported without statistical significance tests, error bars, data-split details, or validation that the three metrics capture strategic ability rather than environment-specific artifacts, weakening the cross-model and cross-environment comparisons.

Authors: We acknowledge that the original results section would benefit from additional statistical detail. In the revision we now report error bars as one standard deviation across five independent evaluation runs per model-environment pair. We include paired t-tests with p-values for key model comparisons. Data splits are described explicitly: next-action prediction uses an environment-wise 70/30 train/test partition with no temporal overlap. To validate that the metrics reflect strategic ability rather than artifacts, we add baseline comparisons against random agents and the same optimal/expert policies used for labeling; these show that both prediction accuracy and normalized return scale monotonically with policy quality across environments. These additions are incorporated into the Results and Experiments sections. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with direct measurements on external environments and metrics

full rationale

The paper introduces VS-Bench as a new benchmark with ten vision-grounded multi-agent environments and reports empirical performance of fifteen VLMs using three explicitly defined metrics (element recognition accuracy, next-action prediction accuracy, and normalized episode return). The key results (e.g., best model at 46.6% prediction accuracy and 31.4% normalized return) are obtained through direct experimentation and human comparisons against these externally specified environments and metrics. No equations, derivations, fitted parameters, or self-referential definitions appear; the central claims rest on straightforward empirical evaluation rather than any reduction to inputs by construction or load-bearing self-citations. The evaluation is therefore self-contained against the chosen benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the domain assumption that the selected environments and metrics validly capture strategic ability; no free parameters or invented entities are introduced.

axioms (1)

domain assumption The ten vision-grounded environments adequately represent cooperative, competitive, and mixed-motive strategic interactions.
Stated directly in the abstract as the composition of VS-Bench.

pith-pipeline@v0.9.0 · 5803 in / 1249 out tokens · 45172 ms · 2026-05-19T11:53:02.056305+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

VS-Bench comprises eight vision-grounded environments... offline evaluation of strategic reasoning by next-action prediction accuracy and online evaluation of decision-making by normalized episode return.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We evaluate fourteen leading VLMs... best-performing model attaining 47.8% prediction accuracy and 24.3% normalized return.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

169 extracted references · 169 canonical work pages · 19 internal anchors

[1]

Llm-coordination: evaluating and analyzing multi-agent coordination abilities in large language models

Saaket Agashe, Yue Fan, Anthony Reyna, and Xin Eric Wang. Llm-coordination: evaluating and analyzing multi-agent coordination abilities in large language models. arXiv preprint arXiv:2310.03903, 2023

work page arXiv 2023
[2]

Claude 3.7 sonnet system card, 2025

Anthropic. Claude 3.7 sonnet system card, 2025

work page 2025
[3]

Vqa: Visual question answering

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015

work page 2015
[4]

Atari. Pong. Arcade Video Game, 1972

work page 1972
[5]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwenvll: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

The hanabi challenge: A new frontier for ai research

Nolan Bard, Jakob N Foerster, Sarath Chandar, Neil Burch, Marc Lanctot, H Francis Song, Emilio Parisotto, Vincent Dumoulin, Subhodeep Moitra, Edward Hughes, et al. The hanabi challenge: A new frontier for ai research. Artificial Intelligence, 280:103216, 2020

work page 2020
[8]

The arcade learning environment: An evaluation platform for general agents

Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of artificial intelligence research, 47:253–279, 2013

work page 2013
[9]

Dota 2 with Large Scale Deep Reinforcement Learning

Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław D˛ ebiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1912
[10]

Windows agent arena: Evaluating multi-modal os agents at scale.arXiv preprint arXiv:2409.08264,

Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, et al. Windows agent arena: Evaluating multi-modal os agents at scale. arXiv preprint arXiv:2409.08264, 2024

work page arXiv 2024
[11]

Superhuman ai for multiplayer poker

Noam Brown and Tuomas Sandholm. Superhuman ai for multiplayer poker. Science, 365(6456):885–890, 2019

work page 2019
[12]

On the utility of learning about humans for human-ai coordination

Micah Carroll, Rohin Shah, Mark K Ho, Tom Griffiths, Sanjit Seshia, Pieter Abbeel, and Anca Dragan. On the utility of learning about humans for human-ai coordination. Advances in neural information processing systems, 32, 2019

work page 2019
[13]

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, et al. Mle-bench: Evaluating machine learning agents on machine learning engineering. arXiv preprint arXiv:2410.07095, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Llmarena: Assessing capabilities of large language models in dynamic multi-agent environments

Junzhe Chen, Xuming Hu, Shuodi Liu, Shiyu Huang, Wei-Wei Tu, Zhaofeng He, and Lijie Wen. Llmarena: Assessing capabilities of large language models in dynamic multi-agent environments. arXiv preprint arXiv:2402.16499, 2024

work page arXiv 2024
[15]

Microsoft COCO Captions: Data Collection and Evaluation Server

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[16]

Efficient selectivity and backup operators in monte-carlo tree search

Rémi Coulom. Efficient selectivity and backup operators in monte-carlo tree search. In International conference on computers and games, pages 72–83. Springer, 2006

work page 2006
[17]

Gemini 2.5: Our most intelligent ai model, 2025

Google DeepMined. Gemini 2.5: Our most intelligent ai model, 2025

work page 2025
[18]

Gtbench: Uncovering the strategic reasoning limitations of llms via game-theoretic evaluations

Jinhao Duan, Renming Zhang, James Diffenderfer, Bhavya Kailkhura, Lichao Sun, Elias Stengel-Eskin, Mohit Bansal, Tianlong Chen, and Kaidi Xu. Gtbench: Uncovering the strategic reasoning limitations of llms via game-theoretic evaluations. arXiv preprint arXiv:2402.12348, 2024. 11

work page arXiv 2024
[19]

The theory of decision making

Ward Edwards. The theory of decision making. Psychological bulletin, 51(4):380, 1954

work page 1954
[20]

Multi-agent systems: an introduction to distributed artificial intelligence, volume 1

Jacques Ferber and Gerhard Weiss. Multi-agent systems: an introduction to distributed artificial intelligence, volume 1. Addison-wesley Reading, 1999

work page 1999
[21]

Learning with Opponent-Learning Awareness

Jakob N Foerster, Richard Y Chen, Maruan Al-Shedivat, Shimon Whiteson, Pieter Abbeel, and Igor Mordatch. Learning with opponent-learning awareness. arXiv preprint arXiv:1709.04326, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[22]

Game theory

Drew Fudenberg and Jean Tirole. Game theory. MIT press, 1991

work page 1991
[23]

Overcooked, 2016

Ghost Town Games. Overcooked, 2016

work page 2016
[24]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Benchmarking vision, language, & action models on robotic learning tasks

Pranav Guruprasad, Harshvardhan Sikka, Jaewoo Song, Yangyue Wang, and Paul Pu Liang. Benchmarking vision, language, & action models on robotic learning tasks. arXiv preprint arXiv:2411.05821, 2024

work page arXiv 2024
[26]

WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. arXiv preprint arXiv:2401.13919, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

other-play

Hengyuan Hu, Adam Lerer, Alex Peysakhovich, and Jakob Foerster. “other-play” for zero-shot coordination. In International Conference on Machine Learning, pages 4399–4410. PMLR, 2020

work page 2020
[28]

Language instructed reinforcement learning for human-ai coordination

Hengyuan Hu and Dorsa Sadigh. Language instructed reinforcement learning for human-ai coordination. In International Conference on Machine Learning, pages 13584–13598. PMLR, 2023

work page 2023
[29]

How far are we on the decision- making of llms? evaluating llms’ gaming ability in multi-agent environments.arXiv preprint arXiv:2403.11807, 2024

Jen-tse Huang, Eric John Li, Man Ho Lam, Tian Liang, Wenxuan Wang, Youliang Yuan, Wenxiang Jiao, Xing Wang, Zhaopeng Tu, and Michael R Lyu. How far are we on the decision- making of llms? evaluating llms’ gaming ability in multi-agent environments.arXiv preprint arXiv:2403.11807, 2024

work page arXiv 2024
[30]

VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. arXiv preprint arXiv:2401.13649, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

arXiv preprint arXiv:2302.02083 , volume=

Michal Kosinski. Theory of mind may have spontaneously emerged in large language models. arXiv preprint arXiv:2302.02083, 4:169, 2023

work page arXiv 2023
[32]

A simplified two-person poker

Harold W Kuhn. A simplified two-person poker. Contributions to the Theory of Games , 1(97-103):2, 1950

work page 1950
[33]

Openspiel: A framework for reinforcement learning in games.arXiv preprint arXiv:1908.09453, 2019

Marc Lanctot, Edward Lockhart, Jean-Baptiste Lespiau, Vinicius Zambaldi, Satyaki Upadhyay, Julien Pérolat, Sriram Srinivasan, Finbarr Timbers, Karl Tuyls, Shayegan Omidshafiei, et al. Openspiel: A framework for reinforcement learning in games. arXiv preprint arXiv:1908.09453, 2019

work page arXiv 1908
[34]

Scalable evaluation of multi-agent reinforcement learning with melting pot

Joel Z Leibo, Edgar A Dueñez-Guzman, Alexander Vezhnevets, John P Agapiou, Peter Sunehag, Raphael Koster, Jayd Matyas, Charlie Beattie, Igor Mordatch, and Thore Graepel. Scalable evaluation of multi-agent reinforcement learning with melting pot. In International conference on machine learning, pages 6187–6199. PMLR, 2021

work page 2021
[35]

Maintaining cooperation in complex social dilemmas using deep reinforcement learning

Adam Lerer and Alexander Peysakhovich. Maintaining cooperation in complex social dilemmas using deep reinforcement learning. arXiv preprint arXiv:1707.01068, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[36]

Mmcode: Benchmarking multimodal large language models for code generation with visually rich pro- gramming problems

Kaixin Li, Yuchen Tian, Qisheng Hu, Ziyang Luo, Zhiyong Huang, and Jing Ma. Mmcode: Benchmarking multimodal large language models for code generation with visually rich pro- gramming problems. arXiv preprint arXiv:2404.09486, 2024. 12

work page arXiv 2024
[37]

On the effects of data scale on ui control agents

Wei Li, William E Bishop, Alice Li, Christopher Rawles, Folawiyo Campbell-Ajala, Divya Tyamagundlu, and Oriana Riva. On the effects of data scale on ui control agents. Advances in Neural Information Processing Systems, 37:92130–92154, 2024

work page 2024
[38]

Markov games as a framework for multi-agent reinforcement learning

Michael L Littman. Markov games as a framework for multi-agent reinforcement learning. In Machine learning proceedings 1994, pages 157–163. Elsevier, 1994

work page 1994
[39]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

work page 2023
[40]

Visualagent bench: Towards large multimodal models as visual foundation agents

Xiao Liu, Tianjie Zhang, Yu Gu, Iat Long Iong, Yifan Xu, Xixuan Song, Shudan Zhang, Hanyu Lai, Xinyi Liu, Hanlin Zhao, et al. Visualagentbench: Towards large multimodal models as visual foundation agents. arXiv preprint arXiv:2408.06327, 2024

work page arXiv 2024
[41]

Programming breakthrough

Richard Lorentz and Therese Horey. Programming breakthrough. In International Conference on Computers and Games, pages 49–59. Springer, 2013

work page 2013
[42]

Multi-agent actor-critic for mixed cooperative-competitive environments

Ryan Lowe, Yi I Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. Advances in neural information processing systems, 30, 2017

work page 2017
[43]

Model-free opponent shaping

Christopher Lu, Timon Willi, Christian A Schroeder De Witt, and Jakob Foerster. Model-free opponent shaping. In International Conference on Machine Learning, pages 14398–14411. PMLR, 2022

work page 2022
[44]

Games and decisions: Introduction and critical survey

R Duncan Luce and Howard Raiffa. Games and decisions: Introduction and critical survey. Wiley, 1957

work page 1957
[45]

Llama 3.2: Revolutionizing edge ai and vision with open, customizable models, 2024

meta. Llama 3.2: Revolutionizing edge ai and vision with open, customizable models, 2024

work page 2024
[46]

Playing Atari with Deep Reinforcement Learning

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[47]

Human-level control through deep reinforcement learning

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015

work page 2015
[48]

Deepstack: Expert-level artificial intelligence in heads-up no-limit poker

Matej Moravˇcík, Martin Schmid, Neil Burch, Viliam Lis`y, Dustin Morrill, Nolan Bard, Trevor Davis, Kevin Waugh, Michael Johanson, and Michael Bowling. Deepstack: Expert-level artificial intelligence in heads-up no-limit poker. Science, 356(6337):508–513, 2017

work page 2017
[49]

A generalized training approach for multiagent learning

Paul Muller, Shayegan Omidshafiei, Mark Rowland, Karl Tuyls, Julien Perolat, Siqi Liu, Daniel Hennes, Luke Marris, Marc Lanctot, Edward Hughes, et al. A generalized training approach for multiagent learning. arXiv preprint arXiv:1909.12823, 2019

work page arXiv 1909
[50]

Introducing gpt-4.1 in the api, 2025

OpenAI. Introducing gpt-4.1 in the api, 2025

work page 2025
[51]

Openai o3 and o4-mini system card, 2025

OpenAI. Openai o3 and o4-mini system card, 2025

work page 2025
[52]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022
[53]

Prosocial learning agents solve generalized Stag Hunts better than selfish ones

Alexander Peysakhovich and Adam Lerer. Prosocial learning agents solve generalized stag hunts better than selfish ones. arXiv preprint arXiv:1709.02865, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[54]

Bdi agents: from theory to practice

Anand S Rao, Michael P Georgeff, et al. Bdi agents: from theory to practice. In Icmas, volume 95, pages 312–319, 1995

work page 1995
[55]

Prisoner’s dilemma: A study in conflict and cooperation, volume 165

Anatol Rapoport and Albert M Chammah. Prisoner’s dilemma: A study in conflict and cooperation, volume 165. University of Michigan press, 1965. 13

work page 1965
[56]

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Mary- beth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. Androidworld: A dynamic benchmarking environment for autonomous agents. arXiv preprint arXiv:2405.14573, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

A discourse on inequality

Jean-Jacques Rousseau. A discourse on inequality. Penguin, 1985

work page 1985
[58]

Jaxmarl: Multi-agent rl environments and algorithms in jax

Alexander Rutherford, Benjamin Ellis, Matteo Gallici, Jonathan Cook, Andrei Lupu, Garðar Ingvarsson Juto, Timon Willi, Ravi Hammond, Akbir Khan, Christian Schroeder de Witt, et al. Jaxmarl: Multi-agent rl environments and algorithms in jax. Advances in Neural Information Processing Systems, 37:50925–50951, 2024

work page 2024
[59]

Solving breakthrough with race patterns and job-level proof number search

Abdallah Saffidine, Nicolas Jouandeau, and Tristan Cazenave. Solving breakthrough with race patterns and job-level proof number search. InAdvances in Computer Games: 13th International Conference, ACG 2011, Tilburg, The Netherlands, November 20-22, 2011, Revised Selected Papers 13, pages 196–207. Springer, 2012

work page 2011
[60]

The starcraft multi-agent challenge

Mikayel Samvelyan, Tabish Rashid, Christian Schroeder De Witt, Gregory Farquhar, Nan- tas Nardelli, Tim GJ Rudner, Chia-Man Hung, Philip HS Torr, Jakob Foerster, and Shimon Whiteson. The starcraft multi-agent challenge. arXiv preprint arXiv:1902.04043, 2019

work page arXiv 1902
[61]

Doubao-1.5-pro, 2025

ByteDance seed. Doubao-1.5-pro, 2025

work page 2025
[62]

Claude E Shannon. Xxii. programming a computer for playing chess. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 41(314):256–275, 1950

work page 1950
[63]

Stochastic games

Lloyd S Shapley. Stochastic games. Proceedings of the national academy of sciences , 39(10):1095–1100, 1953

work page 1953
[64]

Mas- tering the game of go with deep neural networks and tree search

David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driess- che, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mas- tering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016

work page 2016
[65]

Bayes' Bluff: Opponent Modelling in Poker

Finnegan Southey, Michael P Bowling, Bryce Larson, Carmelo Piccione, Neil Burch, Darse Billings, and Chris Rayner. Bayes’ bluff: Opponent modelling in poker. arXiv preprint arXiv:1207.1411, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012
[66]

Collaborat- ing with humans without human data

DJ Strouse, Kevin McKee, Matt Botvinick, Edward Hughes, and Richard Everett. Collaborat- ing with humans without human data. Advances in Neural Information Processing Systems, 34:14502–14515, 2021

work page 2021
[67]

Discovering diverse multi-agent strategic behavior via reward randomization

Zhenggang Tang, Chao Yu, Boyuan Chen, Huazhe Xu, Xiaolong Wang, Fei Fang, Simon Du, Yu Wang, and Yi Wu. Discovering diverse multi-agent strategic behavior via reward randomization. arXiv preprint arXiv:2103.04564, 2021

work page arXiv 2021
[68]

Gemini Robotics: Bringing AI into the Physical World

Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montser- rat Gonzalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world. arXiv preprint arXiv:2503.20020, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[69]

Qvq-max: Think with evidence, 2025

Qwen team. Qvq-max: Think with evidence, 2025

work page 2025
[70]

Td-gammon, a self-teaching backgammon program, achieves master-level play

Gerald Tesauro. Td-gammon, a self-teaching backgammon program, achieves master-level play. Neural computation, 6(2):215–219, 1994

work page 1994
[71]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[72]

Breakthrough

Dan Troyka. Breakthrough. About Board Games 8x8 Game Design Competition Winner, 2000

work page 2000
[73]

Grandmaster level in starcraft ii using multi-agent reinforcement learning

Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Jun- young Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. nature, 575(7782):350–354, 2019. 14

work page 2019
[74]

Theory of games and economic behavior: 60th anniversary commemorative edition

John V on Neumann and Oskar Morgenstern. Theory of games and economic behavior: 60th anniversary commemorative edition. In Theory of games and economic behavior. Princeton university press, 2007

work page 2007
[75]

Are large vision language models good game players? arXiv preprint arXiv:2503.02358, 2025

Xinyu Wang, Bohan Zhuang, and Qi Wu. Are large vision language models good game players? arXiv preprint arXiv:2503.02358, 2025

work page arXiv 2025
[76]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022
[77]

An introduction to multiagent systems

Michael Wooldridge. An introduction to multiagent systems. John wiley & sons, 2009

work page 2009
[78]

Grok-2 beta release, 2024

xAI. Grok-2 beta release, 2024

work page 2024
[79]

Can large language model agents simulate human trust behavior? In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

Chengxing Xie, Canyu Chen, Feiran Jia, Ziyu Ye, Shiyang Lai, Kai Shu, Jindong Gu, Adel Bibi, Ziniu Hu, David Jurgens, et al. Can large language model agents simulate human trust behavior? In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[80]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. Advances in Neural Information Processing Systems, 37:52040–52094, 2024

work page 2024

Showing first 80 references.

[1] [1]

Llm-coordination: evaluating and analyzing multi-agent coordination abilities in large language models

Saaket Agashe, Yue Fan, Anthony Reyna, and Xin Eric Wang. Llm-coordination: evaluating and analyzing multi-agent coordination abilities in large language models. arXiv preprint arXiv:2310.03903, 2023

work page arXiv 2023

[2] [2]

Claude 3.7 sonnet system card, 2025

Anthropic. Claude 3.7 sonnet system card, 2025

work page 2025

[3] [3]

Vqa: Visual question answering

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015

work page 2015

[4] [4]

Atari. Pong. Arcade Video Game, 1972

work page 1972

[5] [5]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwenvll: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

The hanabi challenge: A new frontier for ai research

Nolan Bard, Jakob N Foerster, Sarath Chandar, Neil Burch, Marc Lanctot, H Francis Song, Emilio Parisotto, Vincent Dumoulin, Subhodeep Moitra, Edward Hughes, et al. The hanabi challenge: A new frontier for ai research. Artificial Intelligence, 280:103216, 2020

work page 2020

[8] [8]

The arcade learning environment: An evaluation platform for general agents

Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of artificial intelligence research, 47:253–279, 2013

work page 2013

[9] [9]

Dota 2 with Large Scale Deep Reinforcement Learning

Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław D˛ ebiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1912

[10] [10]

Windows agent arena: Evaluating multi-modal os agents at scale.arXiv preprint arXiv:2409.08264,

Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, et al. Windows agent arena: Evaluating multi-modal os agents at scale. arXiv preprint arXiv:2409.08264, 2024

work page arXiv 2024

[11] [11]

Superhuman ai for multiplayer poker

Noam Brown and Tuomas Sandholm. Superhuman ai for multiplayer poker. Science, 365(6456):885–890, 2019

work page 2019

[12] [12]

On the utility of learning about humans for human-ai coordination

Micah Carroll, Rohin Shah, Mark K Ho, Tom Griffiths, Sanjit Seshia, Pieter Abbeel, and Anca Dragan. On the utility of learning about humans for human-ai coordination. Advances in neural information processing systems, 32, 2019

work page 2019

[13] [13]

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, et al. Mle-bench: Evaluating machine learning agents on machine learning engineering. arXiv preprint arXiv:2410.07095, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Llmarena: Assessing capabilities of large language models in dynamic multi-agent environments

Junzhe Chen, Xuming Hu, Shuodi Liu, Shiyu Huang, Wei-Wei Tu, Zhaofeng He, and Lijie Wen. Llmarena: Assessing capabilities of large language models in dynamic multi-agent environments. arXiv preprint arXiv:2402.16499, 2024

work page arXiv 2024

[15] [15]

Microsoft COCO Captions: Data Collection and Evaluation Server

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[16] [16]

Efficient selectivity and backup operators in monte-carlo tree search

Rémi Coulom. Efficient selectivity and backup operators in monte-carlo tree search. In International conference on computers and games, pages 72–83. Springer, 2006

work page 2006

[17] [17]

Gemini 2.5: Our most intelligent ai model, 2025

Google DeepMined. Gemini 2.5: Our most intelligent ai model, 2025

work page 2025

[18] [18]

Gtbench: Uncovering the strategic reasoning limitations of llms via game-theoretic evaluations

Jinhao Duan, Renming Zhang, James Diffenderfer, Bhavya Kailkhura, Lichao Sun, Elias Stengel-Eskin, Mohit Bansal, Tianlong Chen, and Kaidi Xu. Gtbench: Uncovering the strategic reasoning limitations of llms via game-theoretic evaluations. arXiv preprint arXiv:2402.12348, 2024. 11

work page arXiv 2024

[19] [19]

The theory of decision making

Ward Edwards. The theory of decision making. Psychological bulletin, 51(4):380, 1954

work page 1954

[20] [20]

Multi-agent systems: an introduction to distributed artificial intelligence, volume 1

Jacques Ferber and Gerhard Weiss. Multi-agent systems: an introduction to distributed artificial intelligence, volume 1. Addison-wesley Reading, 1999

work page 1999

[21] [21]

Learning with Opponent-Learning Awareness

Jakob N Foerster, Richard Y Chen, Maruan Al-Shedivat, Shimon Whiteson, Pieter Abbeel, and Igor Mordatch. Learning with opponent-learning awareness. arXiv preprint arXiv:1709.04326, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[22] [22]

Game theory

Drew Fudenberg and Jean Tirole. Game theory. MIT press, 1991

work page 1991

[23] [23]

Overcooked, 2016

Ghost Town Games. Overcooked, 2016

work page 2016

[24] [24]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

Benchmarking vision, language, & action models on robotic learning tasks

Pranav Guruprasad, Harshvardhan Sikka, Jaewoo Song, Yangyue Wang, and Paul Pu Liang. Benchmarking vision, language, & action models on robotic learning tasks. arXiv preprint arXiv:2411.05821, 2024

work page arXiv 2024

[26] [26]

WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. arXiv preprint arXiv:2401.13919, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

other-play

Hengyuan Hu, Adam Lerer, Alex Peysakhovich, and Jakob Foerster. “other-play” for zero-shot coordination. In International Conference on Machine Learning, pages 4399–4410. PMLR, 2020

work page 2020

[28] [28]

Language instructed reinforcement learning for human-ai coordination

Hengyuan Hu and Dorsa Sadigh. Language instructed reinforcement learning for human-ai coordination. In International Conference on Machine Learning, pages 13584–13598. PMLR, 2023

work page 2023

[29] [29]

How far are we on the decision- making of llms? evaluating llms’ gaming ability in multi-agent environments.arXiv preprint arXiv:2403.11807, 2024

Jen-tse Huang, Eric John Li, Man Ho Lam, Tian Liang, Wenxuan Wang, Youliang Yuan, Wenxiang Jiao, Xing Wang, Zhaopeng Tu, and Michael R Lyu. How far are we on the decision- making of llms? evaluating llms’ gaming ability in multi-agent environments.arXiv preprint arXiv:2403.11807, 2024

work page arXiv 2024

[30] [30]

VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. arXiv preprint arXiv:2401.13649, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [31]

arXiv preprint arXiv:2302.02083 , volume=

Michal Kosinski. Theory of mind may have spontaneously emerged in large language models. arXiv preprint arXiv:2302.02083, 4:169, 2023

work page arXiv 2023

[32] [32]

A simplified two-person poker

Harold W Kuhn. A simplified two-person poker. Contributions to the Theory of Games , 1(97-103):2, 1950

work page 1950

[33] [33]

Openspiel: A framework for reinforcement learning in games.arXiv preprint arXiv:1908.09453, 2019

Marc Lanctot, Edward Lockhart, Jean-Baptiste Lespiau, Vinicius Zambaldi, Satyaki Upadhyay, Julien Pérolat, Sriram Srinivasan, Finbarr Timbers, Karl Tuyls, Shayegan Omidshafiei, et al. Openspiel: A framework for reinforcement learning in games. arXiv preprint arXiv:1908.09453, 2019

work page arXiv 1908

[34] [34]

Scalable evaluation of multi-agent reinforcement learning with melting pot

Joel Z Leibo, Edgar A Dueñez-Guzman, Alexander Vezhnevets, John P Agapiou, Peter Sunehag, Raphael Koster, Jayd Matyas, Charlie Beattie, Igor Mordatch, and Thore Graepel. Scalable evaluation of multi-agent reinforcement learning with melting pot. In International conference on machine learning, pages 6187–6199. PMLR, 2021

work page 2021

[35] [35]

Maintaining cooperation in complex social dilemmas using deep reinforcement learning

Adam Lerer and Alexander Peysakhovich. Maintaining cooperation in complex social dilemmas using deep reinforcement learning. arXiv preprint arXiv:1707.01068, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[36] [36]

Mmcode: Benchmarking multimodal large language models for code generation with visually rich pro- gramming problems

Kaixin Li, Yuchen Tian, Qisheng Hu, Ziyang Luo, Zhiyong Huang, and Jing Ma. Mmcode: Benchmarking multimodal large language models for code generation with visually rich pro- gramming problems. arXiv preprint arXiv:2404.09486, 2024. 12

work page arXiv 2024

[37] [37]

On the effects of data scale on ui control agents

Wei Li, William E Bishop, Alice Li, Christopher Rawles, Folawiyo Campbell-Ajala, Divya Tyamagundlu, and Oriana Riva. On the effects of data scale on ui control agents. Advances in Neural Information Processing Systems, 37:92130–92154, 2024

work page 2024

[38] [38]

Markov games as a framework for multi-agent reinforcement learning

Michael L Littman. Markov games as a framework for multi-agent reinforcement learning. In Machine learning proceedings 1994, pages 157–163. Elsevier, 1994

work page 1994

[39] [39]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

work page 2023

[40] [40]

Visualagent bench: Towards large multimodal models as visual foundation agents

Xiao Liu, Tianjie Zhang, Yu Gu, Iat Long Iong, Yifan Xu, Xixuan Song, Shudan Zhang, Hanyu Lai, Xinyi Liu, Hanlin Zhao, et al. Visualagentbench: Towards large multimodal models as visual foundation agents. arXiv preprint arXiv:2408.06327, 2024

work page arXiv 2024

[41] [41]

Programming breakthrough

Richard Lorentz and Therese Horey. Programming breakthrough. In International Conference on Computers and Games, pages 49–59. Springer, 2013

work page 2013

[42] [42]

Multi-agent actor-critic for mixed cooperative-competitive environments

Ryan Lowe, Yi I Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. Advances in neural information processing systems, 30, 2017

work page 2017

[43] [43]

Model-free opponent shaping

Christopher Lu, Timon Willi, Christian A Schroeder De Witt, and Jakob Foerster. Model-free opponent shaping. In International Conference on Machine Learning, pages 14398–14411. PMLR, 2022

work page 2022

[44] [44]

Games and decisions: Introduction and critical survey

R Duncan Luce and Howard Raiffa. Games and decisions: Introduction and critical survey. Wiley, 1957

work page 1957

[45] [45]

Llama 3.2: Revolutionizing edge ai and vision with open, customizable models, 2024

meta. Llama 3.2: Revolutionizing edge ai and vision with open, customizable models, 2024

work page 2024

[46] [46]

Playing Atari with Deep Reinforcement Learning

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[47] [47]

Human-level control through deep reinforcement learning

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015

work page 2015

[48] [48]

Deepstack: Expert-level artificial intelligence in heads-up no-limit poker

Matej Moravˇcík, Martin Schmid, Neil Burch, Viliam Lis`y, Dustin Morrill, Nolan Bard, Trevor Davis, Kevin Waugh, Michael Johanson, and Michael Bowling. Deepstack: Expert-level artificial intelligence in heads-up no-limit poker. Science, 356(6337):508–513, 2017

work page 2017

[49] [49]

A generalized training approach for multiagent learning

Paul Muller, Shayegan Omidshafiei, Mark Rowland, Karl Tuyls, Julien Perolat, Siqi Liu, Daniel Hennes, Luke Marris, Marc Lanctot, Edward Hughes, et al. A generalized training approach for multiagent learning. arXiv preprint arXiv:1909.12823, 2019

work page arXiv 1909

[50] [50]

Introducing gpt-4.1 in the api, 2025

OpenAI. Introducing gpt-4.1 in the api, 2025

work page 2025

[51] [51]

Openai o3 and o4-mini system card, 2025

OpenAI. Openai o3 and o4-mini system card, 2025

work page 2025

[52] [52]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022

[53] [53]

Prosocial learning agents solve generalized Stag Hunts better than selfish ones

Alexander Peysakhovich and Adam Lerer. Prosocial learning agents solve generalized stag hunts better than selfish ones. arXiv preprint arXiv:1709.02865, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[54] [54]

Bdi agents: from theory to practice

Anand S Rao, Michael P Georgeff, et al. Bdi agents: from theory to practice. In Icmas, volume 95, pages 312–319, 1995

work page 1995

[55] [55]

Prisoner’s dilemma: A study in conflict and cooperation, volume 165

Anatol Rapoport and Albert M Chammah. Prisoner’s dilemma: A study in conflict and cooperation, volume 165. University of Michigan press, 1965. 13

work page 1965

[56] [56]

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Mary- beth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. Androidworld: A dynamic benchmarking environment for autonomous agents. arXiv preprint arXiv:2405.14573, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[57] [57]

A discourse on inequality

Jean-Jacques Rousseau. A discourse on inequality. Penguin, 1985

work page 1985

[58] [58]

Jaxmarl: Multi-agent rl environments and algorithms in jax

Alexander Rutherford, Benjamin Ellis, Matteo Gallici, Jonathan Cook, Andrei Lupu, Garðar Ingvarsson Juto, Timon Willi, Ravi Hammond, Akbir Khan, Christian Schroeder de Witt, et al. Jaxmarl: Multi-agent rl environments and algorithms in jax. Advances in Neural Information Processing Systems, 37:50925–50951, 2024

work page 2024

[59] [59]

Solving breakthrough with race patterns and job-level proof number search

Abdallah Saffidine, Nicolas Jouandeau, and Tristan Cazenave. Solving breakthrough with race patterns and job-level proof number search. InAdvances in Computer Games: 13th International Conference, ACG 2011, Tilburg, The Netherlands, November 20-22, 2011, Revised Selected Papers 13, pages 196–207. Springer, 2012

work page 2011

[60] [60]

The starcraft multi-agent challenge

Mikayel Samvelyan, Tabish Rashid, Christian Schroeder De Witt, Gregory Farquhar, Nan- tas Nardelli, Tim GJ Rudner, Chia-Man Hung, Philip HS Torr, Jakob Foerster, and Shimon Whiteson. The starcraft multi-agent challenge. arXiv preprint arXiv:1902.04043, 2019

work page arXiv 1902

[61] [61]

Doubao-1.5-pro, 2025

ByteDance seed. Doubao-1.5-pro, 2025

work page 2025

[62] [62]

Claude E Shannon. Xxii. programming a computer for playing chess. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 41(314):256–275, 1950

work page 1950

[63] [63]

Stochastic games

Lloyd S Shapley. Stochastic games. Proceedings of the national academy of sciences , 39(10):1095–1100, 1953

work page 1953

[64] [64]

Mas- tering the game of go with deep neural networks and tree search

David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driess- che, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mas- tering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016

work page 2016

[65] [65]

Bayes' Bluff: Opponent Modelling in Poker

Finnegan Southey, Michael P Bowling, Bryce Larson, Carmelo Piccione, Neil Burch, Darse Billings, and Chris Rayner. Bayes’ bluff: Opponent modelling in poker. arXiv preprint arXiv:1207.1411, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012

[66] [66]

Collaborat- ing with humans without human data

DJ Strouse, Kevin McKee, Matt Botvinick, Edward Hughes, and Richard Everett. Collaborat- ing with humans without human data. Advances in Neural Information Processing Systems, 34:14502–14515, 2021

work page 2021

[67] [67]

Discovering diverse multi-agent strategic behavior via reward randomization

Zhenggang Tang, Chao Yu, Boyuan Chen, Huazhe Xu, Xiaolong Wang, Fei Fang, Simon Du, Yu Wang, and Yi Wu. Discovering diverse multi-agent strategic behavior via reward randomization. arXiv preprint arXiv:2103.04564, 2021

work page arXiv 2021

[68] [68]

Gemini Robotics: Bringing AI into the Physical World

Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montser- rat Gonzalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world. arXiv preprint arXiv:2503.20020, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[69] [69]

Qvq-max: Think with evidence, 2025

Qwen team. Qvq-max: Think with evidence, 2025

work page 2025

[70] [70]

Td-gammon, a self-teaching backgammon program, achieves master-level play

Gerald Tesauro. Td-gammon, a self-teaching backgammon program, achieves master-level play. Neural computation, 6(2):215–219, 1994

work page 1994

[71] [71]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[72] [72]

Breakthrough

Dan Troyka. Breakthrough. About Board Games 8x8 Game Design Competition Winner, 2000

work page 2000

[73] [73]

Grandmaster level in starcraft ii using multi-agent reinforcement learning

Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Jun- young Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. nature, 575(7782):350–354, 2019. 14

work page 2019

[74] [74]

Theory of games and economic behavior: 60th anniversary commemorative edition

John V on Neumann and Oskar Morgenstern. Theory of games and economic behavior: 60th anniversary commemorative edition. In Theory of games and economic behavior. Princeton university press, 2007

work page 2007

[75] [75]

Are large vision language models good game players? arXiv preprint arXiv:2503.02358, 2025

Xinyu Wang, Bohan Zhuang, and Qi Wu. Are large vision language models good game players? arXiv preprint arXiv:2503.02358, 2025

work page arXiv 2025

[76] [76]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022

[77] [77]

An introduction to multiagent systems

Michael Wooldridge. An introduction to multiagent systems. John wiley & sons, 2009

work page 2009

[78] [78]

Grok-2 beta release, 2024

xAI. Grok-2 beta release, 2024

work page 2024

[79] [79]

Can large language model agents simulate human trust behavior? In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

Chengxing Xie, Canyu Chen, Feiran Jia, Ziyu Ye, Shiyang Lai, Kai Shu, Jindong Gu, Adel Bibi, Ziniu Hu, David Jurgens, et al. Can large language model agents simulate human trust behavior? In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024

[80] [80]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. Advances in Neural Information Processing Systems, 37:52040–52094, 2024

work page 2024