On Multi-Agent Learning in Team Sports Games

Ahmad Beirami; Caedmon Somers; Igor Borovikov; Jason Rupert; Yunqi Zhao

arxiv: 1906.10124 · v1 · pith:6V75XGONnew · submitted 2019-06-25 · 💻 cs.MA · cs.AI· cs.HC· cs.LG

On Multi-Agent Learning in Team Sports Games

Yunqi Zhao , Igor Borovikov , Jason Rupert , Caedmon Somers , Ahmad Beirami This is my paper

Pith reviewed 2026-05-25 15:49 UTC · model grok-4.3

classification 💻 cs.MA cs.AIcs.HCcs.LG

keywords multi-agent learningreinforcement learningteam sports gameshierarchical approachhuman-like agentsvideo game AIplaytesting

0 comments

The pith

A hierarchical approach to multi-agent reinforcement learning shows promise for producing human-like agents in team sports games.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that end-to-end model-free reinforcement learning succeeds in single-agent games but lacks sample efficiency and rarely yields human-like behavior, making it unsuitable for playtesting and AI development in complex video games. It proposes a hierarchical decomposition of the multi-agent problem as an alternative route to agents that combine high skill with human-like style specifically in team sports settings. A sympathetic reader would care because such agents could shorten development cycles by providing reliable opponents and test partners without exhaustive compute. Preliminary results are presented as evidence that the decomposition holds promise for this class of problems.

Core claim

The authors present a hierarchical approach to training agents and report that their preliminary results indicate this method holds promise for solving the multi-agent learning problem of achieving both human-like style and high skill level in team sports games, where end-to-end model-free RL is unlikely to succeed.

What carries the argument

The hierarchical approach, which decomposes the multi-agent learning task into layered sub-problems to train agents for team sports games.

If this is right

Agents trained this way can serve as human-like opponents and test partners during video game development.
The approach reduces the sample and compute demands compared with end-to-end model-free RL for multi-agent team settings.
The same decomposition can be applied to other team-based game environments that require coordinated behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the hierarchy succeeds, it could be tested by measuring how closely the learned policies match human action distributions rather than just win rates.
The method might extend to non-game multi-agent coordination problems that share the same need for interpretable, style-preserving behavior.
A concrete next step would be an ablation that isolates which layers of the hierarchy most affect human-likeness versus raw performance.

Load-bearing premise

A hierarchical decomposition of the task will succeed at producing human-like style and high skill where end-to-end model-free reinforcement learning is stated to be unlikely to do so.

What would settle it

An experiment that trains the hierarchical agents and directly compares their play style and skill metrics against human players in the target team sports game, showing no measurable improvement over end-to-end baselines.

Figures

Figures reproduced from arXiv: 1906.10124 by Ahmad Beirami, Caedmon Somers, Igor Borovikov, Jason Rupert, Yunqi Zhao.

**Figure 1.** Figure 1: A screen shot of the simple team sports simulator (STS2). The red agents are home agents attempting to score at the upper end and the white agents are away agents attempting to score the lower end. The highlighted player has the possession of the ball. provide the agent with any additional form of information that could ease training and might otherwise be hard to infer from the screen pixels. Our ultimate… view at source ↗

read the original abstract

In recent years, reinforcement learning has been successful in solving video games from Atari to Star Craft II. However, the end-to-end model-free reinforcement learning (RL) is not sample efficient and requires a significant amount of computational resources to achieve superhuman level performance. Model-free RL is also unlikely to produce human-like agents for playtesting and gameplaying AI in the development cycle of complex video games. In this paper, we present a hierarchical approach to training agents with the goal of achieving human-like style and high skill level in team sports games. While this is still work in progress, our preliminary results show that the presented approach holds promise for solving the posed multi-agent learning problem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes a hierarchical approach to multi-agent reinforcement learning for team sports games, with the goal of achieving human-like style and high skill levels. It argues that end-to-end model-free RL is sample-inefficient and unlikely to yield human-like agents, and states that preliminary results (while the work remains in progress) indicate the hierarchical method holds promise for the multi-agent problem.

Significance. A working hierarchical decomposition that delivers both human-like behavior and high performance in multi-agent sports settings would be relevant to game AI development, where sample efficiency and stylistic fidelity matter. No architecture, training procedure, environment, or results are supplied, so the potential cannot be evaluated.

major comments (2)

[Abstract] Abstract: the statement that 'our preliminary results show that the presented approach holds promise' is unsupported; the manuscript contains no methods, environments, baselines, metrics, or data of any kind.
[Abstract] Abstract: the assertion that end-to-end model-free RL is 'unlikely to produce human-like agents' is offered without justification, prior-work citations, or any comparative argument.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback. We acknowledge the concerns regarding unsupported claims in the abstract of this preliminary manuscript and will revise accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the statement that 'our preliminary results show that the presented approach holds promise' is unsupported; the manuscript contains no methods, environments, baselines, metrics, or data of any kind.

Authors: We agree that the manuscript contains no empirical results, methods, or data, as this remains a conceptual proposal at an early stage. The reference to 'preliminary results' is not supported by any presented evidence. We will revise the abstract to remove this claim and clarify that the work proposes a hierarchical approach without current experimental validation. revision: yes
Referee: [Abstract] Abstract: the assertion that end-to-end model-free RL is 'unlikely to produce human-like agents' is offered without justification, prior-work citations, or any comparative argument.

Authors: We accept that the assertion is presented without supporting citations or argument in the current text. We will revise to include relevant prior-work citations on RL in games and discussions of human-like behavior to provide justification or context for the claim. revision: yes

Circularity Check

0 steps flagged

No derivation chain or equations present; paper is explicitly work-in-progress with no methods or results shown.

full rationale

The manuscript contains no equations, fitted parameters, predictions, self-citations of theorems, or any derivation steps that could reduce to inputs by construction. It is labeled as work-in-progress and supplies only high-level motivation in the abstract, with the central claim resting on unreported 'preliminary results.' Per the hard rules, when no load-bearing mathematical steps exist, the circularity score is 0 and steps is left empty. The absence of any chain makes circularity analysis inapplicable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no free parameters, axioms, or invented entities are described in the provided text.

pith-pipeline@v0.9.0 · 5653 in / 965 out tokens · 22612 ms · 2026-05-25T15:49:36.749248+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 12 internal anchors

[1]

AlphaStar,

[Online, May 2018] https:// blog.openai.com/ai-and-compute. AlphaStar,

work page 2018
[2]

Learning Dexterous In-Hand Manipulation

[Online, January 2019] https:// tinyurl.com/yc2knerv. Andrychowicz, M., Baker, B., Chociej, M., Jozefowicz, R., McGrew, B., Pachocki, J., Petron, A., Plappert, M., Powell, G., Ray, A., et al. Learning dexterous in-hand manipulation. arXiv preprint arXiv:1808.00177,

work page internal anchor Pith review Pith/arXiv arXiv 2019
[3]

com/ibm/history/ibm100/us/en/icons/ deepblue

[Online] http://www-03.ibm. com/ibm/history/ibm100/us/en/icons/ deepblue. Devlin, S., Yliniemi, L., Kudenko, D., and Tumer, K. Potential-based difference rewards for multiagent rein- forcement learning. In Proceedings of the 2014 interna- tional conference on Autonomous agents and multi-agent systems, pp. 165–172. International Foundation for Au- tonomous...

work page 2014
[4]

Hernandez-Leal, P., Kaisers, M., Baarslag, T., and de Cote, E. M. A survey of learning in multiagent environ- ments: Dealing with non-stationarity. arXiv preprint arXiv:1707.09183,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Hernandez-Leal, P., Kartal, B., and Taylor, M. E. Is multia- gent deep reinforcement learning the answer or the ques- tion? a brief survey. arXiv preprint arXiv:1810.05587,

work page arXiv
[6]

Rainbow: Combining Improvements in Deep Reinforcement Learning

Hessel, M., Modayil, J., Van Hasselt, H., Schaul, T., Ostro- vski, G., Dabney, W., Horgan, D., Piot, B., Azar, M., and Silver, D. Rainbow: Combining improvements in deep reinforcement learning. arXiv preprint arXiv:1710.02298,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Kartal, B., Hernandez-Leal, P., Gao, C., and Taylor, M. E. Safer deep RL with shallow MCTS: A case study in Pommerman. arXiv preprint arXiv:1904.05759,

work page internal anchor Pith review Pith/arXiv arXiv 1904
[8]

Littman, M. L. Markov games as a framework for multi- agent reinforcement learning. In Machine learning pro- ceedings 1994, pp. 157–163. Elsevier,

work page 1994
[9]

Learning latent plans from play

Lynch, C., Khansari, M., Xiao, T., Kumar, V ., Tompson, J., Levine, S., and Sermanet, P. Learning latent plans from play. arXiv preprint arXiv:1903.01973,

work page arXiv 1903
[10]

On Reinforcement Learning for Full-length Game of StarCraft

[Online, June 2018] https:// openai.com/five. Pang, Z.-J., Liu, R.-Z., Meng, Z.-Y ., Zhang, Y ., Yu, Y ., and Lu, T. On reinforcement learning for full-length game of starcraft. arXiv preprint arXiv:1809.09095,

work page internal anchor Pith review Pith/arXiv arXiv 2018
[11]

Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations

Rajeswaran, A., Kumar, V ., Gupta, A., Vezzani, G., Schul- man, J., Todorov, E., and Levine, S. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. arXiv preprint arXiv:1709.10087,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Prioritized Experience Replay

Schaul, T., Quan, J., Antonoglou, I., and Silver, D. Priori- tized experience replay.arXiv preprint arXiv:1511.05952,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Grae- pel, T., et al. Mastering Chess and Shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017a. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T....

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards

Veˇcer´ık, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Roth¨orl, T., Lampe, T., and Riedmiller, M. Leveraging demonstrations for deep reinforcement On Multi-Agent Learning in Team Sports Games learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

StarCraft II: A New Challenge for Reinforcement Learning

Vinyals, O., Ewalds, T., Bartunov, S., Georgiev, P., Vezhn- evets, A. S., Yeo, M., Makhzani, A., K¨uttler, H., Agapiou, J., Schrittwieser, J., et al. StarCraft II: A new challenge for reinforcement learning. arXiv preprint arXiv:1708.04782,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

CM3: Cooperative multi-goal multi-stage multi-agent reinforcement learning

Yang, J., Nakhaei, A., Isele, D., Zha, H., and Fujimura, K. CM3: Cooperative multi-goal multi-stage multi-agent reinforcement learning. arXiv preprint arXiv:1809.05188,

work page arXiv
[18]

Generating Multi-Agent Trajectories using Programmatic Weak Supervision

Zhan, E., Zheng, S., Yue, Y ., Sha, L., and Lucey, P. Gener- ating multi-agent trajectories using programmatic weak supervision. arXiv preprint arXiv:1803.07612,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Zhao, Y ., Borovikov, I., Beirami, A., Rupert, J., Somers, C., Harder, J., Silva, F. d. M., Kolen, J., Pinto, J., Pourabol- ghasem, R., et al. Winning Isn’t Everything: Training Human-Like Agents for Playtesting and Game AI. arXiv preprint arXiv:1903.10545, 2019

work page arXiv 1903

[1] [1]

AlphaStar,

[Online, May 2018] https:// blog.openai.com/ai-and-compute. AlphaStar,

work page 2018

[2] [2]

Learning Dexterous In-Hand Manipulation

[Online, January 2019] https:// tinyurl.com/yc2knerv. Andrychowicz, M., Baker, B., Chociej, M., Jozefowicz, R., McGrew, B., Pachocki, J., Petron, A., Plappert, M., Powell, G., Ray, A., et al. Learning dexterous in-hand manipulation. arXiv preprint arXiv:1808.00177,

work page internal anchor Pith review Pith/arXiv arXiv 2019

[3] [3]

com/ibm/history/ibm100/us/en/icons/ deepblue

[Online] http://www-03.ibm. com/ibm/history/ibm100/us/en/icons/ deepblue. Devlin, S., Yliniemi, L., Kudenko, D., and Tumer, K. Potential-based difference rewards for multiagent rein- forcement learning. In Proceedings of the 2014 interna- tional conference on Autonomous agents and multi-agent systems, pp. 165–172. International Foundation for Au- tonomous...

work page 2014

[4] [4]

Hernandez-Leal, P., Kaisers, M., Baarslag, T., and de Cote, E. M. A survey of learning in multiagent environ- ments: Dealing with non-stationarity. arXiv preprint arXiv:1707.09183,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Hernandez-Leal, P., Kartal, B., and Taylor, M. E. Is multia- gent deep reinforcement learning the answer or the ques- tion? a brief survey. arXiv preprint arXiv:1810.05587,

work page arXiv

[6] [6]

Rainbow: Combining Improvements in Deep Reinforcement Learning

Hessel, M., Modayil, J., Van Hasselt, H., Schaul, T., Ostro- vski, G., Dabney, W., Horgan, D., Piot, B., Azar, M., and Silver, D. Rainbow: Combining improvements in deep reinforcement learning. arXiv preprint arXiv:1710.02298,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Kartal, B., Hernandez-Leal, P., Gao, C., and Taylor, M. E. Safer deep RL with shallow MCTS: A case study in Pommerman. arXiv preprint arXiv:1904.05759,

work page internal anchor Pith review Pith/arXiv arXiv 1904

[8] [8]

Littman, M. L. Markov games as a framework for multi- agent reinforcement learning. In Machine learning pro- ceedings 1994, pp. 157–163. Elsevier,

work page 1994

[9] [9]

Learning latent plans from play

Lynch, C., Khansari, M., Xiao, T., Kumar, V ., Tompson, J., Levine, S., and Sermanet, P. Learning latent plans from play. arXiv preprint arXiv:1903.01973,

work page arXiv 1903

[10] [10]

On Reinforcement Learning for Full-length Game of StarCraft

[Online, June 2018] https:// openai.com/five. Pang, Z.-J., Liu, R.-Z., Meng, Z.-Y ., Zhang, Y ., Yu, Y ., and Lu, T. On reinforcement learning for full-length game of starcraft. arXiv preprint arXiv:1809.09095,

work page internal anchor Pith review Pith/arXiv arXiv 2018

[11] [11]

Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations

Rajeswaran, A., Kumar, V ., Gupta, A., Vezzani, G., Schul- man, J., Todorov, E., and Levine, S. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. arXiv preprint arXiv:1709.10087,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Prioritized Experience Replay

Schaul, T., Quan, J., Antonoglou, I., and Silver, D. Priori- tized experience replay.arXiv preprint arXiv:1511.05952,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Grae- pel, T., et al. Mastering Chess and Shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017a. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T....

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards

Veˇcer´ık, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Roth¨orl, T., Lampe, T., and Riedmiller, M. Leveraging demonstrations for deep reinforcement On Multi-Agent Learning in Team Sports Games learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

StarCraft II: A New Challenge for Reinforcement Learning

Vinyals, O., Ewalds, T., Bartunov, S., Georgiev, P., Vezhn- evets, A. S., Yeo, M., Makhzani, A., K¨uttler, H., Agapiou, J., Schrittwieser, J., et al. StarCraft II: A new challenge for reinforcement learning. arXiv preprint arXiv:1708.04782,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

CM3: Cooperative multi-goal multi-stage multi-agent reinforcement learning

Yang, J., Nakhaei, A., Isele, D., Zha, H., and Fujimura, K. CM3: Cooperative multi-goal multi-stage multi-agent reinforcement learning. arXiv preprint arXiv:1809.05188,

work page arXiv

[18] [18]

Generating Multi-Agent Trajectories using Programmatic Weak Supervision

Zhan, E., Zheng, S., Yue, Y ., Sha, L., and Lucey, P. Gener- ating multi-agent trajectories using programmatic weak supervision. arXiv preprint arXiv:1803.07612,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Zhao, Y ., Borovikov, I., Beirami, A., Rupert, J., Somers, C., Harder, J., Silva, F. d. M., Kolen, J., Pinto, J., Pourabol- ghasem, R., et al. Winning Isn’t Everything: Training Human-Like Agents for Playtesting and Game AI. arXiv preprint arXiv:1903.10545, 2019

work page arXiv 1903