Arena: a toolkit for Multi-Agent Reinforcement Learning

Jiechao Xiong; Lei Han; Meng Fang; Peng Sun; Qing Wang; Xinghai Sun; Zhengyou Zhang; Zhuobin Zheng

arxiv: 1907.09467 · v1 · pith:ZGEZF5KUnew · submitted 2019-07-20 · 💻 cs.LG · cs.AI· cs.MA

Arena: a toolkit for Multi-Agent Reinforcement Learning

Qing Wang , Jiechao Xiong , Lei Han , Meng Fang , Xinghai Sun , Zhuobin Zheng , Peng Sun , Zhengyou Zhang This is my paper

Pith reviewed 2026-05-24 18:33 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.MA

keywords multi-agent reinforcement learningMARL toolkitmodular interfaceOpenAI Gym wrappersself-playcooperative-competitive MARLenvironment customization

0 comments

The pith

Arena introduces modular interfaces that extend Gym wrappers to handle customizations in multi-agent reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Arena as a toolkit that tackles the frequent need in multi-agent reinforcement learning to customize observations, rewards, actions, and agent interactions for each scenario. Its central contribution is a modular Interface design that lets users concatenate different interfaces and embed them either inside wrapped environments or directly with agents. This approach supplies ready interfaces for platforms including StarCraft II, Pommerman, ViZDoom, and Soccer while supporting self-play and cooperative-competitive training. Readers would care because these customizations normally demand repeated engineering work when moving between platforms or training modes.

Core claim

Arena claims that its Interface design manipulates observation, reward, and action routines in MARL through two mechanisms: interfaces can be concatenated and combined, and they can be placed inside either wrapped OpenAI Gym compatible environments or raw environment compatible agents, thereby extending the Gym wrapper concept to multi-agent settings and enabling off-the-shelf support for multiple platforms.

What carries the argument

Interface, a modular component that can be concatenated with others and embedded in environments or agents to customize multi-agent reinforcement learning routines.

If this is right

Interfaces support concatenation to build complex custom observation and reward schemes.
The same interfaces can be embedded in Gym-wrapped environments or used directly with raw agents.
Off-the-shelf interfaces are supplied for StarCraft II, Pommerman, ViZDoom, and Soccer.
The design enables both self-play reinforcement learning and cooperative-competitive hybrid training.
Users can extend Arena to additional MARL platforms by writing new interfaces.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The modular structure could shorten the time needed to test new agent interaction rules by reusing stacked interfaces across experiments.
Standard interfaces might make it easier to compare algorithms on the same customization layer across different base environments.
The embedding choice between environment and agent sides could influence how easily third-party agents are integrated into training loops.
Similar interface patterns might reduce engineering overhead when adapting single-agent environments to multi-agent versions.

Load-bearing premise

The Interface design is sufficiently general and flexible to cover the customization needs of diverse MARL platforms and scenarios without requiring substantial extra code beyond the provided interfaces.

What would settle it

A developer trying to set up a complex new MARL scenario, such as dynamic team switching in an unsupported game, still needs to write large amounts of custom code even after combining all available interfaces.

Figures

Figures reproduced from arXiv: 1907.09467 by Jiechao Xiong, Lei Han, Meng Fang, Peng Sun, Qing Wang, Xinghai Sun, Zhengyou Zhang, Zhuobin Zheng.

**Figure 1.** Figure 1: Left: A standard loop in reinforcement learning: agents receive observation (and reward) from the environment; And the environment receives agents actions then evolves to next state. Right: A gym wrapper can change the observation, reward, and action space of the wrapped environment, thus is convenient for training learn-able agents with structured inputs and outputs. 2.4 A solution for multi-agent: The A… view at source ↗

**Figure 2.** Figure 2: Interface Stacking: In this example, the interface I2 is stacked over [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Interface Combination: The interfaces I1 and I2 are combined and [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: In Arena, we support wrapping an interface on an environment (left) [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Upper: For training homogeneous agents, we can wrap an interface to the original multi-agent environment. The observations transformed by the interface are received by each agent respectively, and agents’ actions are transformed by the interface before received by the original environment. Lower: For testing heterogeneous agents, the interface used by each agent in the training phase is wrapped on itself… view at source ↗

**Figure 6.** Figure 6: Combining agents as a team: The combined team receives a tuple of [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Gym Compatibility: In this example, we firstly wrap a gym wrapper [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Screenshots of currently supported environments in Arena. [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Win-rate against rule-based Agent in Pong-2p. Horizontal: self-play [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 10.** Figure 10: Average test wining rate vs. training steps (i.e., number of passed [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 11.** Figure 11: Win rates against built-in AI. A.4 Pommerman Pommerman [19] is a popular environment based on the game of Bomberman for multi-agent learning. A typical game of Pommerman has 4 agents, each can move and place bomb on the playground. The action space for each agent is a discrete space of 6 actions: {Idle, Move Up, Move Down, Move Left, Move Right, Place a Bomb}. The limit of episode length is 800 steps. The… view at source ↗

**Figure 12.** Figure 12: Win rates against rule-based SimpleAgent. [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

**Figure 13.** Figure 13: Episodic Frags against built-in AI in ViZDoom Death Match. Hori [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

**Figure 14.** Figure 14: Win rates against a random agent. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

read the original abstract

We introduce Arena, a toolkit for multi-agent reinforcement learning (MARL) research. In MARL, it usually requires customizing observations, rewards and actions for each agent, changing cooperative-competitive agent-interaction, and playing with/against a third-party agent, etc. We provide a novel modular design, called Interface, for manipulating such routines in essentially two ways: 1) Different interfaces can be concatenated and combined, which extends the OpenAI Gym Wrappers concept to MARL scenarios. 2) During MARL training or testing, interfaces can be embedded in either wrapped OpenAI Gym compatible Environments or raw environment compatible Agents. We offer off-the-shelf interfaces for several popular MARL platforms, including StarCraft II, Pommerman, ViZDoom, Soccer, etc. The interfaces effectively support self-play RL and cooperative-competitive hybrid MARL. Also, Arena can be conveniently extended to your own favorite MARL platform.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Arena is a practical MARL toolkit whose main contribution is a modular Interface that extends Gym wrappers via concatenation and dual embedding, but the paper stays high-level with little demonstration.

read the letter

Arena introduces a toolkit called Arena for multi-agent reinforcement learning. The core new piece is their Interface design, which extends OpenAI Gym wrappers by allowing different interfaces to be concatenated and by letting them be embedded either in the environment or in the agents themselves. This setup is meant to handle the usual MARL chores like customizing per-agent observations and rewards, switching between cooperative and competitive modes, and incorporating self-play or third-party agents. They provide pre-built interfaces for StarCraft II, Pommerman, ViZDoom, and Soccer, and say it supports self-play and hybrid MARL out of the box. The design also claims to be easy to extend to other platforms. What the paper does well is identify a real pain point in MARL experiment setup and propose a modular way to manage it without reinventing wrappers for each new scenario. The concatenation idea and the choice of where to attach the interface are practical software decisions that could reduce boilerplate code. The main limitation is the lack of depth in the presentation. The abstract and description stay high-level, with no examples of how an interface is actually written or used, no performance numbers on setup time saved, and no comparison to other toolkits. That makes it hard to judge if the claimed flexibility holds up when you try to apply it to something outside the listed platforms. The assumption that off-the-shelf interfaces will cover most needs without much extra work is stated but not tested in the paper. This work is aimed at people actively running MARL experiments who need to customize environments quickly. Readers interested in new algorithms or theoretical results will find little here. It is worth sending to peer review because documenting and reviewing such toolkits helps the community share engineering solutions, even if the paper itself is more of an announcement than a deep analysis. I'd recommend sending it for review.

Referee Report

1 major / 1 minor

Summary. The paper introduces Arena, a toolkit for multi-agent reinforcement learning (MARL) research. It presents a novel modular design called Interface that allows different interfaces to be concatenated and combined to customize observations, rewards, and actions for agents in MARL scenarios, extending the concept of OpenAI Gym Wrappers. These interfaces can be embedded either in wrapped Gym-compatible environments or in raw environment-compatible agents. The toolkit provides off-the-shelf interfaces for platforms such as StarCraft II, Pommerman, ViZDoom, and Soccer, supporting self-play RL and cooperative-competitive hybrid MARL, and is designed to be extensible to other platforms.

Significance. If the Interface design functions as described, it could provide a valuable tool for MARL researchers by offering a flexible, composable way to handle agent-specific customizations and interactions, building upon the popular Gym framework. This has the potential to reduce development effort for complex MARL setups involving self-play and mixed cooperative-competitive dynamics.

major comments (1)

[Abstract] Abstract: The central claim that the Interface design 'effectively support[s] self-play RL and cooperative-competitive hybrid MARL' and extends Gym Wrappers via concatenation/embedding is presented without any code snippets, usage examples, or verification steps in the manuscript. This leaves the generality and practicality of the design (the weakest assumption noted in the review) unsubstantiated, which is load-bearing for the paper's contribution as a toolkit.

minor comments (1)

The abstract uses vague phrasing such as 'etc.' when listing customization routines; replacing this with a short enumerated list of additional examples would improve clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment and the specific suggestion for improvement. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the Interface design 'effectively support[s] self-play RL and cooperative-competitive hybrid MARL' and extends Gym Wrappers via concatenation/embedding is presented without any code snippets, usage examples, or verification steps in the manuscript. This leaves the generality and practicality of the design (the weakest assumption noted in the review) unsubstantiated, which is load-bearing for the paper's contribution as a toolkit.

Authors: We agree that the abstract (and, upon re-examination, the main text) would benefit from explicit usage examples to demonstrate concatenation/embedding and the resulting support for self-play and hybrid settings. The manuscript describes the two embedding modes and lists the provided interfaces for StarCraft II, Pommerman, etc., but does not include concrete code or verification steps. We will add a short usage example (including a code snippet) in the revised manuscript, most naturally in Section 3 or a new short “Usage” subsection, to make the practicality of the design explicit. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents a software toolkit and modular Interface design for MARL environments. It contains no equations, derivations, predictions, fitted parameters, or uniqueness theorems. The central claims concern concatenation of interfaces and dual embedding (environment or agent), which are described as design features without any reduction to self-referential inputs or self-citations that bear the load of a result. This is a pure architecture description whose correctness is evaluated by implementation and usage rather than internal logical closure.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The paper contributes a new software abstraction rather than relying on prior mathematical axioms or fitted parameters. The Interface is an invented entity in the context of this work with no independent evidence provided beyond the description.

invented entities (1)

Interface no independent evidence
purpose: A modular design for manipulating observations, rewards, and actions in MARL scenarios that can be concatenated and embedded in environments or agents
Presented as a novel concept in the paper to extend Gym wrappers to multi-agent settings.

pith-pipeline@v0.9.0 · 5706 in / 1204 out tokens · 29508 ms · 2026-05-24T18:33:56.765046+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 8 internal anchors

[1]

Human-level control through deep reinforce- ment learning

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fid- jeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioan- nis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforce- men...

work page 2015
[2]

Mas- tering the game of go with deep neural networks and tree search

David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mas- tering the ...

work page 2016
[3]

Openai ﬁve

OpenAI. Openai ﬁve. https://blog.openai.com/openai-five/, 2018

work page 2018
[4]

Human-level performance in first-person multiplayer games with population-based deep reinforcement learning

Max Jaderberg, Wojciech M. Czarnecki, Iain Dunning, Luke Marris, Guy Lever, Antonio Garc´ ıa Casta˜ neda, Charles Beattie, Neil C. Rabinowitz, Ari S. Morcos, Avraham Ruderman, Nicolas Sonnerat, Tim Green, Louise Deason, Joel Z. Leibo, David Silver, Demis Hassabis, Koray Kavukcuoglu, and Thore Graepel. Human-level performance in ﬁrst-person multi- player g...

work page internal anchor Pith review Pith/arXiv arXiv 2018
[5]

Alphastar: Mastering the real-time strategy game starcraft ii

The AlphaStar team. Alphastar: Mastering the real-time strategy game starcraft ii. https://deepmind.com/blog/ alphastar-mastering-real-time-strategy-game-starcraft-ii/ , 2019

work page 2019
[6]

Openai gym, 2016

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016

work page 2016
[7]

DeepMind Lab

Charles Beattie, Joel Z Leibo, Denis Teplyashin, Tom Ward, Marcus Wain- wright, Heinrich K¨ uttler, Andrew Lefrancq, Simon Green, V´ ıctor Vald´ es, Amir Sadik, et al. Deepmind lab. arXiv preprint arXiv:1612.03801 , 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[8]

Gotta Learn Fast: A New Benchmark for Generalization in RL

Alex Nichol, Vicki Pfau, Christopher Hesse, Oleg Klimov, and John Schul- man. Gotta learn fast: A new benchmark for generalization in rl. arXiv preprint arXiv:1804.03720, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[9]

DeepMind Control Suite

Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite. arXiv preprint arXiv:1801.00690 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[10]

A markovian decision process

Richard Bellman. A markovian decision process. Journal of Mathematics and Mechanics, pages 679–684, 1957

work page 1957
[11]

Optimal control of markov processes with incom- plete state information

Karl Johan ˚Astr¨ om. Optimal control of markov processes with incom- plete state information. Journal of Mathematical Analysis and Applica- tions, 10(1):174–205, 1965

work page 1965
[12]

Stochastic games

Lloyd S Shapley. Stochastic games. Proceedings of the national academy of sciences, 39(10):1095–1100, 1953

work page 1953
[13]

Counterfactual multi-agent policy gra- dients

Jakob N Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gra- dients. In Thirty-Second AAAI Conference on Artiﬁcial Intelligence , 2018

work page 2018
[14]

Grid-wise control for multi-agent reinforce- ment learning in video game AI

Lei Han, Peng Sun, Yali Du, Jiechao Xiong, Qing Wang, Xinghai Sun, Han Liu, and Tong Zhang. Grid-wise control for multi-agent reinforce- ment learning in video game AI. In Proceedings of the 36th International Conference on Machine Learning (ICML) , pages 2576–2585, 2019

work page 2019
[15]

Emergence of Grounded Compositional Language in Multi-Agent Populations

Igor Mordatch and Pieter Abbeel. Emergence of grounded compositional language in multi-agent populations. arXiv preprint arXiv:1703.04908 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[16]

Emergent Complexity via Multi-Agent Competition

Trapit Bansal, Jakub Pachocki, Szymon Sidor, Ilya Sutskever, and Igor Mordatch. Emergent complexity via multi-agent competition. arXiv preprint arXiv:1710.03748, 2017. 10

work page internal anchor Pith review Pith/arXiv arXiv 2017
[17]

Mikayel Samvelyan, Tabish Rashid, Christian Schroeder de Witt, Gregory Farquhar, Nantas Nardelli, Tim G. J. Rudner, Chia-Man Hung, Philiph H. S. Torr, Jakob Foerster, and Shimon Whiteson. The StarCraft Multi- Agent Challenge. CoRR, abs/1902.04043, 2019

work page arXiv 1902
[18]

StarCraft II: A New Challenge for Reinforcement Learning

Oriol Vinyals, Timo Ewalds, Sergey Bartunov, Petko Georgiev, Alexan- der Sasha Vezhnevets, Michelle Yeo, Alireza Makhzani, Heinrich K¨ uttler, John Agapiou, Julian Schrittwieser, John Quan, Stephen Gaﬀney, Stig Pe- tersen, Karen Simonyan, Tom Schaul, Hado van Hasselt, David Silver, Tim- othy P. Lillicrap, Kevin Calderone, Paul Keet, Anthony Brunasso, Davi...

work page internal anchor Pith review Pith/arXiv arXiv 2017
[19]

Pommerman: A multi- agent playground

Cinjon Resnick, Wes Eldridge, David Ha, Denny Britz, Jakob Foerster, Julian Togelius, Kyunghyun Cho, and Joan Bruna. Pommerman: A multi- agent playground. arXiv preprint arXiv:1809.07124 , 2018

work page arXiv 2018
[20]

Vizdoom com- petitions: Playing doom from pixels

Marek Wydmuch, Micha l Kempka, and Wojciech Ja´ skowski. Vizdoom com- petitions: Playing doom from pixels. IEEE Transactions on Games , 2018

work page 2018
[21]

Emergent coordination through competition

Siqi Liu, Guy Lever, Josh Merel, Saran Tunyasuvunakool, Nicolas Heess, and Thore Graepel. Emergent coordination through competition. arXiv preprint arXiv:1902.07151, 2019

work page arXiv 1902
[22]

Xinghai sun pong

Xinghai Sun. Xinghai sun pong. https://github.com/xinghai-sun/ deep-rl/blob/master/docs/selfplay_pong.md

work page
[23]

Steven hewitt pong

Steven Hewitt. Steven hewitt pong. https://github.com/ Steven-Hewitt/Multi-Agent-Pong-Rally

work page
[24]

TStarBots: Defeating the Cheating Level Builtin AI in StarCraft II in the Full Game

Peng Sun, Xinghai Sun, Lei Han, Jiechao Xiong, Qing Wang, Bo Li, Yang Zheng, Ji Liu, Yongsheng Liu, Han Liu, and Tong Zhang. Tstarbots: De- feating the cheating level builtin ai in starcraft ii in the full game. arXiv preprint arXiv:1809.07193, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[25]

ViZDoom: A Doom-based AI research platform for visual reinforcement learning

Micha l Kempka, Marek Wydmuch, Grzegorz Runc, Jakub Toczek, and Wo- jciech Ja´ skowski. ViZDoom: A Doom-based AI research platform for visual reinforcement learning. In IEEE Conference on Computational Intelligence and Games, pages 341–348, Santorini, Greece, Sep 2016. IEEE. The best paper award

work page 2016
[26]

Training agent for ﬁrst-person shooter game with actor-critic curriculum learning

Yuxin Wu and Yuandong Tian. Training agent for ﬁrst-person shooter game with actor-critic curriculum learning. In ICLR, 2016

work page 2016
[27]

Learning to act by predicting the future

Alexey Dosovitskiy and Vladlen Koltun. Learning to act by predicting the future. In ICLR, 2017

work page 2017
[28]

Combo-action: Training agent for fps game with auxiliary tasks

Shiyu Huang, Hang Su, Jun Zhu, and Ting Chen. Combo-action: Training agent for fps game with auxiliary tasks. In AAAI, 2019. 11

work page 2019
[29]

sanity-check

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on , pages 5026–5033. IEEE, 2012. 12 A List of supported environments A.1 Pong-2p Pong-2p (Pong of 2 players) is much like the Atari Pong [6], except that the two players on both sid...

work page 2012

[1] [1]

Human-level control through deep reinforce- ment learning

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fid- jeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioan- nis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforce- men...

work page 2015

[2] [2]

Mas- tering the game of go with deep neural networks and tree search

David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mas- tering the ...

work page 2016

[3] [3]

Openai ﬁve

OpenAI. Openai ﬁve. https://blog.openai.com/openai-five/, 2018

work page 2018

[4] [4]

Human-level performance in first-person multiplayer games with population-based deep reinforcement learning

Max Jaderberg, Wojciech M. Czarnecki, Iain Dunning, Luke Marris, Guy Lever, Antonio Garc´ ıa Casta˜ neda, Charles Beattie, Neil C. Rabinowitz, Ari S. Morcos, Avraham Ruderman, Nicolas Sonnerat, Tim Green, Louise Deason, Joel Z. Leibo, David Silver, Demis Hassabis, Koray Kavukcuoglu, and Thore Graepel. Human-level performance in ﬁrst-person multi- player g...

work page internal anchor Pith review Pith/arXiv arXiv 2018

[5] [5]

Alphastar: Mastering the real-time strategy game starcraft ii

The AlphaStar team. Alphastar: Mastering the real-time strategy game starcraft ii. https://deepmind.com/blog/ alphastar-mastering-real-time-strategy-game-starcraft-ii/ , 2019

work page 2019

[6] [6]

Openai gym, 2016

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016

work page 2016

[7] [7]

DeepMind Lab

Charles Beattie, Joel Z Leibo, Denis Teplyashin, Tom Ward, Marcus Wain- wright, Heinrich K¨ uttler, Andrew Lefrancq, Simon Green, V´ ıctor Vald´ es, Amir Sadik, et al. Deepmind lab. arXiv preprint arXiv:1612.03801 , 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[8] [8]

Gotta Learn Fast: A New Benchmark for Generalization in RL

Alex Nichol, Vicki Pfau, Christopher Hesse, Oleg Klimov, and John Schul- man. Gotta learn fast: A new benchmark for generalization in rl. arXiv preprint arXiv:1804.03720, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[9] [9]

DeepMind Control Suite

Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite. arXiv preprint arXiv:1801.00690 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[10] [10]

A markovian decision process

Richard Bellman. A markovian decision process. Journal of Mathematics and Mechanics, pages 679–684, 1957

work page 1957

[11] [11]

Optimal control of markov processes with incom- plete state information

Karl Johan ˚Astr¨ om. Optimal control of markov processes with incom- plete state information. Journal of Mathematical Analysis and Applica- tions, 10(1):174–205, 1965

work page 1965

[12] [12]

Stochastic games

Lloyd S Shapley. Stochastic games. Proceedings of the national academy of sciences, 39(10):1095–1100, 1953

work page 1953

[13] [13]

Counterfactual multi-agent policy gra- dients

Jakob N Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gra- dients. In Thirty-Second AAAI Conference on Artiﬁcial Intelligence , 2018

work page 2018

[14] [14]

Grid-wise control for multi-agent reinforce- ment learning in video game AI

Lei Han, Peng Sun, Yali Du, Jiechao Xiong, Qing Wang, Xinghai Sun, Han Liu, and Tong Zhang. Grid-wise control for multi-agent reinforce- ment learning in video game AI. In Proceedings of the 36th International Conference on Machine Learning (ICML) , pages 2576–2585, 2019

work page 2019

[15] [15]

Emergence of Grounded Compositional Language in Multi-Agent Populations

Igor Mordatch and Pieter Abbeel. Emergence of grounded compositional language in multi-agent populations. arXiv preprint arXiv:1703.04908 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[16] [16]

Emergent Complexity via Multi-Agent Competition

Trapit Bansal, Jakub Pachocki, Szymon Sidor, Ilya Sutskever, and Igor Mordatch. Emergent complexity via multi-agent competition. arXiv preprint arXiv:1710.03748, 2017. 10

work page internal anchor Pith review Pith/arXiv arXiv 2017

[17] [17]

Mikayel Samvelyan, Tabish Rashid, Christian Schroeder de Witt, Gregory Farquhar, Nantas Nardelli, Tim G. J. Rudner, Chia-Man Hung, Philiph H. S. Torr, Jakob Foerster, and Shimon Whiteson. The StarCraft Multi- Agent Challenge. CoRR, abs/1902.04043, 2019

work page arXiv 1902

[18] [18]

StarCraft II: A New Challenge for Reinforcement Learning

Oriol Vinyals, Timo Ewalds, Sergey Bartunov, Petko Georgiev, Alexan- der Sasha Vezhnevets, Michelle Yeo, Alireza Makhzani, Heinrich K¨ uttler, John Agapiou, Julian Schrittwieser, John Quan, Stephen Gaﬀney, Stig Pe- tersen, Karen Simonyan, Tom Schaul, Hado van Hasselt, David Silver, Tim- othy P. Lillicrap, Kevin Calderone, Paul Keet, Anthony Brunasso, Davi...

work page internal anchor Pith review Pith/arXiv arXiv 2017

[19] [19]

Pommerman: A multi- agent playground

Cinjon Resnick, Wes Eldridge, David Ha, Denny Britz, Jakob Foerster, Julian Togelius, Kyunghyun Cho, and Joan Bruna. Pommerman: A multi- agent playground. arXiv preprint arXiv:1809.07124 , 2018

work page arXiv 2018

[20] [20]

Vizdoom com- petitions: Playing doom from pixels

Marek Wydmuch, Micha l Kempka, and Wojciech Ja´ skowski. Vizdoom com- petitions: Playing doom from pixels. IEEE Transactions on Games , 2018

work page 2018

[21] [21]

Emergent coordination through competition

Siqi Liu, Guy Lever, Josh Merel, Saran Tunyasuvunakool, Nicolas Heess, and Thore Graepel. Emergent coordination through competition. arXiv preprint arXiv:1902.07151, 2019

work page arXiv 1902

[22] [22]

Xinghai sun pong

Xinghai Sun. Xinghai sun pong. https://github.com/xinghai-sun/ deep-rl/blob/master/docs/selfplay_pong.md

work page

[23] [23]

Steven hewitt pong

Steven Hewitt. Steven hewitt pong. https://github.com/ Steven-Hewitt/Multi-Agent-Pong-Rally

work page

[24] [24]

TStarBots: Defeating the Cheating Level Builtin AI in StarCraft II in the Full Game

Peng Sun, Xinghai Sun, Lei Han, Jiechao Xiong, Qing Wang, Bo Li, Yang Zheng, Ji Liu, Yongsheng Liu, Han Liu, and Tong Zhang. Tstarbots: De- feating the cheating level builtin ai in starcraft ii in the full game. arXiv preprint arXiv:1809.07193, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[25] [25]

ViZDoom: A Doom-based AI research platform for visual reinforcement learning

Micha l Kempka, Marek Wydmuch, Grzegorz Runc, Jakub Toczek, and Wo- jciech Ja´ skowski. ViZDoom: A Doom-based AI research platform for visual reinforcement learning. In IEEE Conference on Computational Intelligence and Games, pages 341–348, Santorini, Greece, Sep 2016. IEEE. The best paper award

work page 2016

[26] [26]

Training agent for ﬁrst-person shooter game with actor-critic curriculum learning

Yuxin Wu and Yuandong Tian. Training agent for ﬁrst-person shooter game with actor-critic curriculum learning. In ICLR, 2016

work page 2016

[27] [27]

Learning to act by predicting the future

Alexey Dosovitskiy and Vladlen Koltun. Learning to act by predicting the future. In ICLR, 2017

work page 2017

[28] [28]

Combo-action: Training agent for fps game with auxiliary tasks

Shiyu Huang, Hang Su, Jun Zhu, and Ting Chen. Combo-action: Training agent for fps game with auxiliary tasks. In AAAI, 2019. 11

work page 2019

[29] [29]

sanity-check

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on , pages 5026–5033. IEEE, 2012. 12 A List of supported environments A.1 Pong-2p Pong-2p (Pong of 2 players) is much like the Atari Pong [6], except that the two players on both sid...

work page 2012