ConventionPlay: Capability-Limited Training for Robust Ad-Hoc Collaboration

Abhishek Sriraman; Eleni Vasilaki; Robert Loftin

arxiv: 2604.18123 · v1 · submitted 2026-04-20 · 💻 cs.MA

ConventionPlay: Capability-Limited Training for Robust Ad-Hoc Collaboration

Abhishek Sriraman , Eleni Vasilaki , Robert Loftin This is my paper

Pith reviewed 2026-05-10 03:41 UTC · model grok-4.3

classification 💻 cs.MA

keywords ad-hoc collaborationreinforcement learningcognitive hierarchiesmulti-agent coordinationcapability limitsprobingsteeringconvention selection

0 comments

The pith

ConventionPlay trains agents to probe partners and steer ad-hoc teams toward effective conventions by learning against diverse capability-limited followers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ConventionPlay as a reinforcement learning method for ad-hoc collaboration, where agents must identify shared conventions and actively steer toward the best joint strategy when multiple options exist. It extends cognitive hierarchies by training a focal agent against a population of adaptive follower agents that vary in their capability limits. This produces agents that probe their partner's repertoire during interaction and choose to lead or follow accordingly. A sympathetic reader would care because many real-world coordination settings involve partners who can follow different conventions, making passive adaptation insufficient for high-efficiency outcomes.

Core claim

ConventionPlay is a reinforcement learning-based approach that extends cognitive hierarchies to include a diverse population of adaptive followers with varied capability limits. By training against these partners, the agent learns to probe its partner's repertoire, leading the team when possible and following when necessary. Results in canonical coordination tasks show superior coordination efficiency, especially in settings where conventions have differentiated payoffs.

What carries the argument

ConventionPlay, which trains a leader agent against a diverse population of capability-limited adaptive follower agents to build probing and steering behavior for selecting optimal joint conventions.

If this is right

Trained agents achieve higher coordination efficiency in tasks with multiple conventions that carry different payoffs.
Agents shift from passive adaptation to active probing of their partner's strategies during interaction.
The approach supports robust performance with previously unseen partners in ad-hoc settings.
Leading or following decisions emerge naturally from the learned probing behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same training principle could be tested in continuous control or partially observable environments to check if probing scales beyond discrete canonical tasks.
Varying capability limits during training may offer a general way to build robustness in other multi-agent reinforcement learning domains where partner competence is uncertain.
This suggests that explicit diversity in follower capabilities during training could reduce the need for online adaptation mechanisms in deployed systems.

Load-bearing premise

That training against a diverse population of adaptive followers with varied capability limits produces agents that reliably probe and steer in real ad-hoc encounters with unseen partners.

What would settle it

A direct comparison showing that ConventionPlay agents fail to outperform standard reinforcement learning baselines in coordination efficiency when tested against partners whose capability profiles lie outside the training distribution.

Figures

Figures reproduced from arXiv: 2604.18123 by Abhishek Sriraman, Eleni Vasilaki, Robert Loftin.

**Figure 1.** Figure 1: Comparison of the ConventionPlay pipeline with existing ad-hoc collaboration methods. While baseline approaches diversify the 𝐾0 population, we introduce diversity among 𝐾1 followers to force the 𝐾2 agent to infer its partner’s limited repertoire and coordinate on the most effective shared convention. the 𝐾2 agent to transition from passive adaptation to active team steering. To coordinate effectively, t… view at source ↗

**Figure 2.** Figure 2: Payoff matrices for the Repeated Matrix Game. Left: [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Trajectories of agents playing two variants of the [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Makeup of the generated 𝐾1 subsets. The top left image has cross-play of all the 𝐾0 population; clusters here roughly visualize compatible policies which can be considered conventions. The top middle shows how the stratified sampling has placed different agents in different clusters. The top right shows the maximum capability distribution across subsets, ensuring various 𝐾1 subsets have different capabilit… view at source ↗

**Figure 5.** Figure 5: Visualizing the steering behavior of ConventionPlay (blue) against 𝐾0 (left) and 𝐾1 (right) partners. In the 𝐾0 case, the agent probes for a better convention but converges on the partner’s choice when no adaptation is detected. Conversely, with a𝐾1 partner, the agent persists in moving toward the highest value goal, then the second highest value goal, successfully influencing the adaptive partner to swi… view at source ↗

read the original abstract

Ad-hoc collaboration often relies on identifying and adhering to shared conventions. However, when partners can follow multiple conventions, agents must do more than simply adapt; they must actively steer the team toward the most effective joint strategy. We present ConventionPlay, a reinforcement learning-based approach that extends cognitive hierarchies to include a diverse population of adaptive followers. By training against partners with varied capability limits, our agent learns to probe its partner's repertoire, leading the team when possible and following when necessary. Our results in canonical coordination tasks show that ConventionPlay achieves superior coordination efficiency, particularly in settings where conventions have differentiated payoffs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ConventionPlay trains agents to probe and steer by facing varied capability-limited partners, a reasonable RL extension of cognitive hierarchies, but the transfer to unseen partners rests on an untested assumption.

read the letter

The main contribution is a training regime that puts the learning agent up against a population of adaptive followers whose capability limits are varied during RL episodes. This is meant to produce an agent that can probe its partner, then lead or follow depending on what works best for the joint payoff in coordination games. It builds on cognitive hierarchies but adds diversity and adaptation on the follower side, which is a straightforward way to encourage active steering instead of pure adaptation. That part is clear and targets a practical gap in ad-hoc multi-agent settings where partners may not share the same convention set or ability level. The emphasis on differentiated payoffs is also sensible, since that is where steering should matter most. If the experiments include solid baselines and show gains there, the method could be useful for people building collaborative agents. The soft spot is the generalization step. Training against a finite set of simulated partners with specific capability limits does not automatically guarantee good behavior against truly novel partners whose limits or adaptation rules fall outside that set. The abstract gives no numbers on how the limits are sampled or whether held-out partners were tested, so the superior efficiency claim depends on an assumption that may not hold. If the full paper has ablation or out-of-distribution results, that would tighten it; otherwise the central result looks fragile. This is for researchers in multi-agent RL who work on convention formation and ad-hoc teams. A reader who needs a new training trick for coordination tasks could get something out of the method and any reported numbers. I would send it to peer review because the problem is real, the idea is grounded, and referees can push on the evaluation details.

Referee Report

3 major / 0 minor

Summary. The manuscript introduces ConventionPlay, a reinforcement learning method that extends cognitive hierarchies by training agents against a diverse population of adaptive followers with varied capability limits. The approach aims to enable agents to probe partners' repertoires, lead when possible, and follow when necessary in ad-hoc collaboration. The abstract claims that this yields superior coordination efficiency in canonical coordination tasks, especially when conventions have differentiated payoffs.

Significance. If the empirical claims are substantiated with proper controls and generalization tests, the work could advance multi-agent systems research by providing a concrete training paradigm for handling convention selection under capability heterogeneity, a common challenge in ad-hoc teamwork.

major comments (3)

Abstract: The central claim that ConventionPlay 'achieves superior coordination efficiency' is asserted without any reported experimental details, baselines, metrics, statistical tests, ablation studies, or quantitative results, making the performance assertion impossible to evaluate.
Method (training procedure): No quantitative characterization is given of how the capability limit distribution is sampled (discrete vs. continuous, range, coverage of payoff asymmetries) or the adaptation rules of the simulated followers, which is load-bearing for the generalization claim to unseen partners.
Results (generalization): The manuscript provides no held-out partner evaluation measuring performance degradation when test partners' capability limits or adaptation rules differ from the training population, leaving the distributional assumption for robust ad-hoc encounters untested.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments on our manuscript. We provide point-by-point responses to the major comments below and outline the revisions we will make to address the raised concerns.

read point-by-point responses

Referee: Abstract: The central claim that ConventionPlay 'achieves superior coordination efficiency' is asserted without any reported experimental details, baselines, metrics, statistical tests, ablation studies, or quantitative results, making the performance assertion impossible to evaluate.

Authors: We agree with the referee that the abstract should provide more context for the performance claims to allow proper evaluation. We will revise the abstract to include key experimental details, such as the specific coordination tasks, baselines (e.g., cognitive hierarchy and RL agents), metrics, and quantitative results with mention of statistical tests and ablations. revision: yes
Referee: Method (training procedure): No quantitative characterization is given of how the capability limit distribution is sampled (discrete vs. continuous, range, coverage of payoff asymmetries) or the adaptation rules of the simulated followers, which is load-bearing for the generalization claim to unseen partners.

Authors: We agree that additional quantitative details on the training procedure would strengthen the paper. We will revise the method section to provide a precise characterization of the capability limit distribution sampling, including whether it is discrete or continuous, the range, and coverage of payoff asymmetries, as well as a detailed description of the adaptation rules for the simulated followers. revision: yes
Referee: Results (generalization): The manuscript provides no held-out partner evaluation measuring performance degradation when test partners' capability limits or adaptation rules differ from the training population, leaving the distributional assumption for robust ad-hoc encounters untested.

Authors: We acknowledge that explicit held-out evaluations are important for validating the generalization to unseen partners. We will add experiments in the results section that test ConventionPlay against held-out partners with differing capability limits and adaptation rules, including measurements of performance degradation to better support the claims of robust ad-hoc collaboration. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical RL training against simulated partners

full rationale

The paper describes an RL training procedure that generates agents by exposing them to a population of capability-limited adaptive followers, then evaluates coordination efficiency on canonical tasks. No derivation chain is claimed that reduces a reported result to a fitted parameter, self-referential definition, or load-bearing self-citation; the efficiency numbers are direct experimental outcomes. The generalization assumption to unseen partners is an empirical claim open to falsification rather than a tautology.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The approach rests on the domain assumption that capability limits can be parameterized to create a representative training distribution of partners; ConventionPlay itself is introduced as the new training procedure.

free parameters (1)

capability limit distribution
Varied capability limits for training partners are a core design element whose specific parameterization is not detailed in the abstract.

axioms (1)

domain assumption Cognitive hierarchies provide a useful model of limited partner reasoning
The method explicitly extends this existing framework.

invented entities (1)

ConventionPlay training procedure no independent evidence
purpose: To produce agents that probe partner capabilities and steer toward effective conventions
New RL training paradigm introduced in the abstract.

pith-pipeline@v0.9.0 · 5396 in / 1266 out tokens · 37244 ms · 2026-05-10T03:41:21.030283+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

[1]

Camerer, Teck Ho, and Juin-Kuan Chong

Colin F. Camerer, Teck Ho, and Juin-Kuan Chong. 2003. A Cognitive Hierarchy Theory of One-shot Games and Experimental Analysis.SSRN Electronic Journal (2003). https://doi.org/10.2139/ssrn.411061

work page doi:10.2139/ssrn.411061 2003
[2]

Ho, Thomas L

Micah Carroll, Rohin Shah, Mark K. Ho, Thomas L. Griffiths, Sanjit A. Seshia, Pieter Abbeel, and Anca Dragan. 2020. On the Utility of Learning about Hu- mans for Human-AI Coordination. https://doi.org/10.48550/arXiv.1910.05789 arXiv:1910.05789 [cs]

work page doi:10.48550/arxiv.1910.05789 2020
[3]

Rujikorn Charakorn, Poramate Manoonpong, and Nat Dilokthanakul. 2023. Generating Diverse Cooperative Agents by Learning Incompatible Policies. InThe Eleventh International Conference on Learning Representations. https: //openreview.net/forum?id=UkU05GOH7_6

work page 2023
[4]

Costa-Gomes and Vincent P

Miguel A. Costa-Gomes and Vincent P. Crawford. 2006. Cognition and Behavior in Two-Person Guessing Games: An Experimental Study.American Economic Review96, 5 (December 2006), 1737–1768. https://doi.org/10.1257/aer.96.5.1737

work page doi:10.1257/aer.96.5.1737 2006
[5]

Brandon Cui, Hengyuan Hu, Luis Pineda, and Jakob Nicolaus Foerster. 2022. K-level Reasoning for Zero-Shot Coordination in Hanabi.ArXivabs/2207.07166 (2022). https://api.semanticscholar.org/CorpusID:246996266

work page arXiv 2022
[6]

Foerster, Richard Y

Jakob N. Foerster, Richard Y. Chen, Maruan Al-Shedivat, Shimon Whiteson, P. Abbeel, and Igor Mordatch. 2017. Learning with Opponent-Learning Awareness. InAdaptive Agents and Multi-Agent Systems. https://api.semanticscholar.org/ CorpusID:8708073

work page 2017
[7]

Hengyuan Hu, Adam Lerer, Alex Peysakhovich, and Jakob Foerster. 2020. Other Play. InProceedings of the 37th International Conference on Machine Learning. PMLR, 4399–4410

work page 2020
[8]

Leibo, and Nando De Freitas

Natasha Jaques, Angeliki Lazaridou, Edward Hughes, Caglar Gulcehre, Pedro Ortega, Dj Strouse, Joel Z. Leibo, and Nando De Freitas. 2019. Social Influence as Intrinsic Motivation for Multi-Agent Deep Reinforcement Learning. InPro- ceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 97), Kamalika...

work page 2019
[9]

Adam Lerer and Alexander Peysakhovich. 2019. Learning Existing Social Con- ventions via Observationally Augmented Self-Play. InProceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society(Honolulu, HI, USA)(AIES ’19). Association for Computing Machinery, New York, NY, USA, 107–114. https://doi.org/10.1145/3306618.3314268

work page doi:10.1145/3306618.3314268 2019
[10]

Chris Lu, Timon Willi, Christian Schroeder de Witt, and Jakob Foerster. 2022. Model-Free Opponent Shaping. https://doi.org/10.48550/arXiv.2205.01447 arXiv:2205.01447 [cs]

work page doi:10.48550/arxiv.2205.01447 2022
[11]

Andrei Lupu, Brandon Cui, Hengyuan Hu, and Jakob Foerster. 2021. Trajectory Diversity for Zero-Shot Coordination. InProceedings of the 38th International Conference on Machine Learning. PMLR, 7204–7213

work page 2021
[12]

Albrecht

Reuth Mirsky, Ignacio Carlucho, Arrasy Rahman, Elliot Fosong, William Macke, Mohan Sridharan, Peter Stone, and Stefano V. Albrecht. 2022. A Survey of Ad Hoc Teamwork Research. arXiv:2202.10450 [cs.MA] https://arxiv.org/abs/2202.10450

work page arXiv 2022
[13]

A concise introduction to decentralized POMDPs, volume 1

Frans A. Oliehoek and Christopher Amato. 2016.A Concise Introduction to Decentralized POMDPs. Springer International Publishing, Cham. https://doi. org/10.1007/978-3-319-28929-8

work page doi:10.1007/978-3-319-28929-8 2016
[14]

Arrasy Rahman, Elliot Fosong, Ignacio Carlucho, and Stefano V Albrecht. 2023. Generating Teammates for Training Robust Ad Hoc Teamwork Agents via Best- Response Diversity.Transactions on Machine Learning Research(2023). https: //openreview.net/forum?id=l5BzfQhROl

work page 2023
[15]

Andy Shih, Arjun Sawhney, Jovana Kondic, Stefano Ermon, and Dorsa Sadigh

work page
[16]

In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021

On the Critical Role of Conventions in Adaptive Human-AI Collaboration. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net. https://openreview.net/forum?id=8Ln- Bq0mZcy

work page 2021
[17]

Peter Stone, Gal Kaminka, Sarit Kraus, and Jeffrey Rosenschein. 2010. Ad Hoc Autonomous Agent Teams: Collaboration without Pre-Coordination.Proceedings of the AAAI Conference on Artificial Intelligence24, 1 (Jul. 2010), 1504–1509. https://doi.org/10.1609/aaai.v24i1.7529

work page doi:10.1609/aaai.v24i1.7529 2010
[18]

DJ Strouse, Kevin McKee, Matt Botvinick, Edward Hughes, and Richard Everett. 2021. Collaborating with Humans without Human Data. InAd- vances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.), Vol. 34. Curran Associates, Inc., 14502–14515. https://proceedings.neurips.cc/paper_files/...

work page 2021
[19]

Chao Yu, Akash Velu, Eugene Vinitsky, Jiaxuan Gao, Yu Wang, Alexandre Bayen, and YI WU. 2022. The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games. InAdvances in Neural Information Processing Systems, S. Koyejo, S. Mo- hamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Asso- ciates, Inc., 24611–24624. https://proceed...

work page 2022
[20]

Rui Zhao, Jinming Song, Yufeng Yuan, Hu Haifeng, Yang Gao, Yi Wu, Zhongqian Sun, and Yang Wei. 2022. MEP – Maximum Entropy Population-Based Training for Zero-Shot Human-AI Coordination. https://doi.org/10.48550/arXiv.2112. 11701 arXiv:2112.11701 [cs]

work page doi:10.48550/arxiv.2112 2022

[1] [1]

Camerer, Teck Ho, and Juin-Kuan Chong

Colin F. Camerer, Teck Ho, and Juin-Kuan Chong. 2003. A Cognitive Hierarchy Theory of One-shot Games and Experimental Analysis.SSRN Electronic Journal (2003). https://doi.org/10.2139/ssrn.411061

work page doi:10.2139/ssrn.411061 2003

[2] [2]

Ho, Thomas L

Micah Carroll, Rohin Shah, Mark K. Ho, Thomas L. Griffiths, Sanjit A. Seshia, Pieter Abbeel, and Anca Dragan. 2020. On the Utility of Learning about Hu- mans for Human-AI Coordination. https://doi.org/10.48550/arXiv.1910.05789 arXiv:1910.05789 [cs]

work page doi:10.48550/arxiv.1910.05789 2020

[3] [3]

Rujikorn Charakorn, Poramate Manoonpong, and Nat Dilokthanakul. 2023. Generating Diverse Cooperative Agents by Learning Incompatible Policies. InThe Eleventh International Conference on Learning Representations. https: //openreview.net/forum?id=UkU05GOH7_6

work page 2023

[4] [4]

Costa-Gomes and Vincent P

Miguel A. Costa-Gomes and Vincent P. Crawford. 2006. Cognition and Behavior in Two-Person Guessing Games: An Experimental Study.American Economic Review96, 5 (December 2006), 1737–1768. https://doi.org/10.1257/aer.96.5.1737

work page doi:10.1257/aer.96.5.1737 2006

[5] [5]

Brandon Cui, Hengyuan Hu, Luis Pineda, and Jakob Nicolaus Foerster. 2022. K-level Reasoning for Zero-Shot Coordination in Hanabi.ArXivabs/2207.07166 (2022). https://api.semanticscholar.org/CorpusID:246996266

work page arXiv 2022

[6] [6]

Foerster, Richard Y

Jakob N. Foerster, Richard Y. Chen, Maruan Al-Shedivat, Shimon Whiteson, P. Abbeel, and Igor Mordatch. 2017. Learning with Opponent-Learning Awareness. InAdaptive Agents and Multi-Agent Systems. https://api.semanticscholar.org/ CorpusID:8708073

work page 2017

[7] [7]

Hengyuan Hu, Adam Lerer, Alex Peysakhovich, and Jakob Foerster. 2020. Other Play. InProceedings of the 37th International Conference on Machine Learning. PMLR, 4399–4410

work page 2020

[8] [8]

Leibo, and Nando De Freitas

Natasha Jaques, Angeliki Lazaridou, Edward Hughes, Caglar Gulcehre, Pedro Ortega, Dj Strouse, Joel Z. Leibo, and Nando De Freitas. 2019. Social Influence as Intrinsic Motivation for Multi-Agent Deep Reinforcement Learning. InPro- ceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 97), Kamalika...

work page 2019

[9] [9]

Adam Lerer and Alexander Peysakhovich. 2019. Learning Existing Social Con- ventions via Observationally Augmented Self-Play. InProceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society(Honolulu, HI, USA)(AIES ’19). Association for Computing Machinery, New York, NY, USA, 107–114. https://doi.org/10.1145/3306618.3314268

work page doi:10.1145/3306618.3314268 2019

[10] [10]

Chris Lu, Timon Willi, Christian Schroeder de Witt, and Jakob Foerster. 2022. Model-Free Opponent Shaping. https://doi.org/10.48550/arXiv.2205.01447 arXiv:2205.01447 [cs]

work page doi:10.48550/arxiv.2205.01447 2022

[11] [11]

Andrei Lupu, Brandon Cui, Hengyuan Hu, and Jakob Foerster. 2021. Trajectory Diversity for Zero-Shot Coordination. InProceedings of the 38th International Conference on Machine Learning. PMLR, 7204–7213

work page 2021

[12] [12]

Albrecht

Reuth Mirsky, Ignacio Carlucho, Arrasy Rahman, Elliot Fosong, William Macke, Mohan Sridharan, Peter Stone, and Stefano V. Albrecht. 2022. A Survey of Ad Hoc Teamwork Research. arXiv:2202.10450 [cs.MA] https://arxiv.org/abs/2202.10450

work page arXiv 2022

[13] [13]

A concise introduction to decentralized POMDPs, volume 1

Frans A. Oliehoek and Christopher Amato. 2016.A Concise Introduction to Decentralized POMDPs. Springer International Publishing, Cham. https://doi. org/10.1007/978-3-319-28929-8

work page doi:10.1007/978-3-319-28929-8 2016

[14] [14]

Arrasy Rahman, Elliot Fosong, Ignacio Carlucho, and Stefano V Albrecht. 2023. Generating Teammates for Training Robust Ad Hoc Teamwork Agents via Best- Response Diversity.Transactions on Machine Learning Research(2023). https: //openreview.net/forum?id=l5BzfQhROl

work page 2023

[15] [15]

Andy Shih, Arjun Sawhney, Jovana Kondic, Stefano Ermon, and Dorsa Sadigh

work page

[16] [16]

In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021

On the Critical Role of Conventions in Adaptive Human-AI Collaboration. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net. https://openreview.net/forum?id=8Ln- Bq0mZcy

work page 2021

[17] [17]

Peter Stone, Gal Kaminka, Sarit Kraus, and Jeffrey Rosenschein. 2010. Ad Hoc Autonomous Agent Teams: Collaboration without Pre-Coordination.Proceedings of the AAAI Conference on Artificial Intelligence24, 1 (Jul. 2010), 1504–1509. https://doi.org/10.1609/aaai.v24i1.7529

work page doi:10.1609/aaai.v24i1.7529 2010

[18] [18]

DJ Strouse, Kevin McKee, Matt Botvinick, Edward Hughes, and Richard Everett. 2021. Collaborating with Humans without Human Data. InAd- vances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.), Vol. 34. Curran Associates, Inc., 14502–14515. https://proceedings.neurips.cc/paper_files/...

work page 2021

[19] [19]

Chao Yu, Akash Velu, Eugene Vinitsky, Jiaxuan Gao, Yu Wang, Alexandre Bayen, and YI WU. 2022. The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games. InAdvances in Neural Information Processing Systems, S. Koyejo, S. Mo- hamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Asso- ciates, Inc., 24611–24624. https://proceed...

work page 2022

[20] [20]

Rui Zhao, Jinming Song, Yufeng Yuan, Hu Haifeng, Yang Gao, Yi Wu, Zhongqian Sun, and Yang Wei. 2022. MEP – Maximum Entropy Population-Based Training for Zero-Shot Human-AI Coordination. https://doi.org/10.48550/arXiv.2112. 11701 arXiv:2112.11701 [cs]

work page doi:10.48550/arxiv.2112 2022