arxiv: 1912.06680 · v1 · submitted 2019-12-13 · 💻 cs.LG · stat.ML

Recognition: 2 theorem links

Dota 2 with Large Scale Deep Reinforcement Learning

OpenAI: Christopher Berner , Greg Brockman , Brooke Chan , Vicki Cheung , Przemys{\l}aw D\k{e}biak , Christy Dennison , David Farhi , Quirin Fischer

show 17 more authors

Shariq Hashme Chris Hesse Rafal J\'ozefowicz Scott Gray Catherine Olsson Jakub Pachocki Michael Petrov Henrique P. d.O. Pinto Jonathan Raiman Tim Salimans Jeremy Schlatter Jonas Schneider Szymon Sidor Ilya Sutskever Jie Tang Filip Wolski Susan Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 22:13 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords reinforcement learningself-playDota 2superhuman performancedeep reinforcement learningdistributed trainingimperfect informationesports

0 comments

The pith

Large-scale self-play reinforcement learning produced an AI that defeated the Dota 2 world champions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that scaling reinforcement learning with self-play can yield superhuman performance in Dota 2. The game demands handling extended planning periods, hidden opponent details, and a broad range of continuous actions and states. Training occurred over ten months on a distributed system that ingested roughly two million game frames every two seconds. Success here would indicate that standard reinforcement learning methods, when expanded in data volume and compute, can master environments with these demanding features. A sympathetic reader would view this as a step toward applying similar techniques to other intricate decision problems.

Core claim

The paper reports that the AI system trained with deep reinforcement learning through self-play became the first to defeat the Dota 2 world champions. This outcome followed from scaling existing techniques via a distributed training system and continual training tools, allowing the system to learn from batches of approximately two million frames every two seconds across ten months. The result shows that self-play reinforcement learning reaches superhuman levels on a task defined by long time horizons, imperfect information, and complex continuous state-action spaces.

What carries the argument

The distributed training system that supports continual self-play reinforcement learning at the scale of two million frames processed every two seconds.

If this is right

Self-play reinforcement learning can improve without human demonstrations or external data.
Processing large volumes of frames at high frequency supports mastery of long-horizon planning.
Complex games with imperfect information become solvable at superhuman levels through scaled training.
Continuous action spaces in strategy environments yield to the same reinforcement learning approach.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The training scale used here could transfer to other multiplayer strategy simulations sharing similar time and information challenges.
Real-world tasks involving resource allocation or autonomous planning under uncertainty might benefit from equivalent self-play scaling.
Further increases in training duration or data throughput could uncover strategies beyond current human levels.

Load-bearing premise

That the AI's victories stemmed from genuine strategic superiority instead of match-specific rules, setup differences, or unaccounted advantages.

What would settle it

A controlled rematch series under the same rules where the human team wins the majority of games would show the superhuman performance claim does not hold.

read the original abstract

On April 13th, 2019, OpenAI Five became the first AI system to defeat the world champions at an esports game. The game of Dota 2 presents novel challenges for AI systems such as long time horizons, imperfect information, and complex, continuous state-action spaces, all challenges which will become increasingly central to more capable AI systems. OpenAI Five leveraged existing reinforcement learning techniques, scaled to learn from batches of approximately 2 million frames every 2 seconds. We developed a distributed training system and tools for continual training which allowed us to train OpenAI Five for 10 months. By defeating the Dota 2 world champion (Team OG), OpenAI Five demonstrates that self-play reinforcement learning can achieve superhuman performance on a difficult task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript describes OpenAI Five, a deep RL agent trained via self-play that defeated the Dota 2 world champion team (Team OG) on April 13, 2019. It details the game's challenges (long horizons, imperfect information, continuous spaces), the distributed training system processing batches of ~2 million frames every 2 seconds, and the 10-month continual training process that produced the result.

Significance. If the central empirical outcome holds under equivalent conditions, the work shows that scaled self-play RL can reach superhuman performance on a complex, real-time, multi-agent task with partial observability. This provides a concrete, falsifiable demonstration of existing RL techniques at extreme scale and has implications for long-horizon decision making in other domains.

major comments (1)

[Results section on the April 13, 2019 match against Team OG] The claim that the April 13, 2019 victory establishes superhuman performance via learned strategy requires explicit verification that match conditions were equivalent to standard human play. The manuscript does not detail (in the results or methods sections describing the exhibition match) whether the agent's observation space was restricted to human-visible information, whether action latency and timing matched human reaction limits, or whether inference-time compute was constrained to human-equivalent levels. Without these controls, alternative explanations for the outcome cannot be ruled out.

minor comments (2)

[Abstract] The abstract is outcome-focused but omits any mention of the scale of training or key implementation choices; adding one sentence on these would improve context without lengthening the paper.
[Figures and captions] Several figures (e.g., training curves and architecture diagrams) would benefit from explicit axis labels, units, and direct textual references in the surrounding paragraphs for clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed review of our manuscript. We address the single major comment below and have revised the manuscript to improve transparency on the exhibition match conditions.

read point-by-point responses

Referee: The claim that the April 13, 2019 victory establishes superhuman performance via learned strategy requires explicit verification that match conditions were equivalent to standard human play. The manuscript does not detail (in the results or methods sections describing the exhibition match) whether the agent's observation space was restricted to human-visible information, whether action latency and timing matched human reaction limits, or whether inference-time compute was constrained to human-equivalent levels. Without these controls, alternative explanations for the outcome cannot be ruled out.

Authors: We agree that the manuscript would benefit from greater explicitness on the exhibition match conditions, as the current text provides only a high-level description of the outcome without a dedicated breakdown in the Results or Methods sections. We have revised the manuscript by adding a new paragraph in the Results section (cross-referenced from the abstract and introduction) that describes the match setup. The agent's observation space used the full game state available via the Dota 2 API rather than being restricted to human-visible information. Action timing followed the model's inference speed at the game's native rate without additional artificial delays to match human reaction limits. Inference-time compute was supplied by the distributed training infrastructure and was not throttled to human-equivalent levels. We maintain that these AI-specific conditions are appropriate for the demonstration and do not invalidate the evidence of learned long-horizon strategy; the agent still had to discover and execute complex, coordinated behaviors to prevail. We have also added a short discussion of alternative explanations and why the outcome supports our central claim. This addresses the referee's concern directly. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical match outcome stands independent of training description

full rationale

The paper reports the training of OpenAI Five via scaled self-play RL over 10 months and its observed defeat of Team OG on April 13, 2019, as direct evidence that self-play RL can reach superhuman performance. No equations, fitted parameters, or derivations are presented that reduce the performance claim to inputs by construction. The central result is an external, falsifiable event (the match outcome) rather than a self-referential definition, renamed pattern, or self-citation chain. The description of the distributed training system and tools is procedural and does not invoke uniqueness theorems or ansatzes that loop back to the target claim. This is a standard empirical systems paper with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical success of scaling self-play RL; no explicit free parameters, axioms, or invented entities are stated in the abstract, though the approach implicitly assumes that self-play generates sufficient signal for superhuman performance.

pith-pipeline@v0.9.0 · 5529 in / 975 out tokens · 73326 ms · 2026-05-12T22:13:04.192641+00:00 · methodology

discussion (0)

Forward citations

Cited by 33 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Generative Agents: Interactive Simulacra of Human Behavior
cs.HC 2023-04 accept novelty 8.0

Generative agents with memory streams, reflection, and planning using LLMs exhibit believable individual and emergent social behaviors in a simulated town.
Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling
cs.LG 2026-05 unverdicted novelty 7.0

DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.
ASH: Agents that Self-Hone via Embodied Learning
cs.AI 2026-05 unverdicted novelty 7.0

ASH reaches 11.2/12 milestones in Pokemon Emerald and 9.9/12 in Zelda by self-improving via an IDM trained on its own trajectories to label internet video, while baselines plateau at roughly 6/12.
Controllability in preference-conditioned multi-objective reinforcement learning
cs.LG 2026-05 unverdicted novelty 7.0

Standard MORL metrics do not measure whether preference inputs reliably control agent behavior, so a new controllability metric is introduced to restore the link between user intent and agent output.
Learning to Communicate Locally for Large-Scale Multi-Agent Pathfinding
cs.AI 2026-05 unverdicted novelty 7.0

LC-MAPF uses multi-round local communication between neighboring agents in a pre-trained model to outperform prior learning-based MAPF solvers on diverse unseen scenarios while preserving scalability.
Does Synthetic Data Help? Empirical Evidence from Deep Learning Time Series Forecasters
cs.LG 2026-05 accept novelty 7.0

Synthetic data augmentation helps channel-mixing time series models but degrades channel-independent ones, with reliable gains only from seasonal-trend generators and gradual schedules in low-resource settings.
SOPE: Stabilizing Off-Policy Evaluation for Online RL with Prior Data
cs.LG 2026-05 unverdicted novelty 7.0

SOPE uses an actor-aligned OPE signal on a held-out validation split to dynamically stop offline stabilization phases in online RL, improving performance up to 45.6% and cutting TFLOPs up to 22x on 25 Minari tasks.
Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks
cs.AI 2026-04 unverdicted novelty 7.0

COSPLAY co-evolves an LLM decision agent with a skill bank agent to improve long-horizon game performance, reporting over 25.1% average reward gains versus frontier LLM baselines on single-player benchmarks.
InfoChess: A Game of Adversarial Inference and a Laboratory for Quantifiable Information Control
cs.MA 2026-04 unverdicted novelty 7.0

InfoChess proposes a symmetric adversarial game focused purely on information control and probabilistic king-location inference, with RL agents outperforming heuristic baselines and gameplay dissected via belief entro...
Territory Paint Wars: Diagnosing and Mitigating Failure Modes in Competitive Multi-Agent PPO
cs.LG 2026-04 conditional novelty 7.0

PPO in a new competitive game fails due to five implementation bugs and then competitive overfitting where self-play stays near 50% but generalization drops to 21.6%; mixing 20% random opponents restores generalizatio...
Voyager: An Open-Ended Embodied Agent with Large Language Models
cs.AI 2023-05 unverdicted novelty 7.0

Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more uniq...
Learning to Communicate Locally for Large-Scale Multi-Agent Pathfinding
cs.AI 2026-05 unverdicted novelty 6.0

LC-MAPF is a decentralized MAPF solver that uses a learnable multi-round communication module among nearby agents to outperform prior IL and RL methods while preserving scalability.
QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL
cs.LG 2026-05 unverdicted novelty 6.0

QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markov...
QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL
cs.LG 2026-05 unverdicted novelty 6.0

QHyer achieves state-of-the-art results in offline goal-conditioned RL by replacing return-to-go with a state-conditioned Q-estimator and introducing a gated hybrid attention-mamba backbone for content-adaptive histor...
Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.
When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient
cs.LG 2026-04 unverdicted novelty 6.0

Certain errors in proxy rewards for policy gradient methods can be benign or beneficial by preventing policies from stalling on outputs with mediocre ground truth rewards, enabling improved RLHF metrics and reward des...
Biased Dreams: Limitations to Epistemic Uncertainty Quantification in Latent Space Models
cs.LG 2026-04 unverdicted novelty 6.0

Latent transitions in models like Dreamer are biased toward dense regions, creating attractors that hide true dynamics discrepancies and cause epistemic uncertainty to be unreliable while overestimating rewards.
CivBench: Progress-Based Evaluation for LLMs' Strategic Decision-Making in Civilization V
cs.AI 2026-04 unverdicted novelty 6.0

CivBench trains models on turn-level states in Civilization V to predict victory probabilities, providing a progress-based evaluation of LLM strategic capabilities across 307 games with 7 models.
Vocabulary Dropout for Curriculum Diversity in LLM Co-Evolution
cs.CL 2026-04 unverdicted novelty 6.0

Vocabulary dropout prevents diversity collapse in LLM co-evolution by masking proposer logits, yielding average +4.4 point solver gains on mathematical reasoning benchmarks at 8B scale.
Heterogeneous Self-Play for Realistic Highway Traffic Simulation
cs.AI 2026-03 accept novelty 6.0

PHASE uses heterogeneous self-play and context-conditioned policies to achieve realistic, zero-shot highway traffic simulation that outperforms traditional rule-based and self-play models on real-world datasets.
Muon is Scalable for LLM Training
cs.LG 2025-02 unverdicted novelty 6.0

Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.
TD-MPC2: Scalable, Robust World Models for Continuous Control
cs.LG 2023-10 conditional novelty 6.0

TD-MPC2 scales an implicit world-model RL method to a 317M-parameter agent that masters 80 tasks across four domains with a single hyperparameter configuration.
Language Models (Mostly) Know What They Know
cs.CL 2022-07 unverdicted novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
A General Language Assistant as a Laboratory for Alignment
cs.CL 2021-12 conditional novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning
cs.RO 2021-08 conditional novelty 6.0

Isaac Gym achieves 2-3 orders of magnitude faster robot policy training by keeping physics simulation and PyTorch-based RL entirely on GPU with direct buffer sharing.
Data-Augmented Game Starts for Accelerating Self-Play Exploration in Imperfect Information Games
cs.LG 2026-05 unverdicted novelty 5.0

DAGS initializes policy-gradient self-play from human-derived intermediate states to reduce exploitability in challenging imperfect-information games, with a multi-task flag fix for resulting bias and new benchmark en...
Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse
cs.CV 2026-05 unverdicted novelty 5.0

The paper organizes research on generalist game AI into Dataset, Model, Harness, and Benchmark pillars and charts a five-level progression from single-game mastery to agents that create and live inside game multiverses.
On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length
cs.AI 2026-05 unverdicted novelty 5.0

Longer action horizons bottleneck LLM agent training through instability, but training with reduced horizons stabilizes learning and enables better generalization to longer horizons.
A High-Throughput Compute-Efficient POMDP Hide-And-Seek-Engine (HASE) for Multi-Agent Operations
cs.MA 2026-04 unverdicted novelty 5.0

A C++ Dec-POMDP simulator using data-oriented design and zero-copy PyTorch integration achieves up to 33 million steps per second on a 16-core CPU, enabling multi-agent policy training in minutes with PPO, DQN, and SAC.
RAMP: Hybrid DRL for Online Learning of Numeric Action Models
cs.AI 2026-04 unverdicted novelty 5.0

RAMP learns numeric action models online via a DRL-planning feedback loop and outperforms PPO on IPC numeric domains in solvability and plan quality.
Gymnasium: A Standard Interface for Reinforcement Learning Environments
cs.LG 2024-07 accept novelty 5.0

Gymnasium establishes a standardized API for RL environments to improve interoperability, reproducibility, and ease of development in reinforcement learning.
Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse
cs.CV 2026-05 unverdicted novelty 3.0

This work traces four eras of generalist game players across dataset, model, harness, and benchmark pillars and charts a five-level roadmap ending in agents that create and evolve within game multiverses.
Benefits of Low-Cost Bio-Inspiration in the Age of Overparametrization
cs.RO 2026-04 unverdicted novelty 3.0

Shallow MLPs and dense CPGs outperform deeper MLPs and Actor-Critic RL in bounded robot control tasks with limited proprioception, with a Parameter Impact metric indicating extra RL parameters yield no performance gai...

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · cited by 30 Pith papers · 8 internal anchors

[1]

TD-Gammon, a self-teaching backgammon program, achieves master-level play

Tesauro, G. TD-Gammon, a self-teaching backgammon program, achieves master-level play. Neural computation6, 215–219 (1994)

work page 1994
[2]

Campbell, M., Hoane Jr., A. J. & Hsu, F.-h. Deep Blue.Artif. Intell.134, 57–83. issn: 0004- 3702 (Jan. 2002)

work page 2002
[3]

Playing Atari with Deep Reinforcement Learning

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D. & Riedmiller, M. Playing atari with deep reinforcement learning.arXiv preprint arXiv:1312.5602(2013)

work page internal anchor Pith review Pith/arXiv arXiv 2013
[4]

Mastering the game of Go with deep neural networks and tree search.nature 529, 484 (2016)

Silver,D.,Huang,A.,Maddison,C.J.,Guez,A.,Sifre,L.,VanDenDriessche,G.,Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M.,et al. Mastering the game of Go with deep neural networks and tree search.nature 529, 484 (2016)

work page 2016
[5]

Learning Dexterity https : / / openai

OpenAI. Learning Dexterity https : / / openai . com / blog / learning - dexterity/. [Online; accessed 28-May-2019]. 2018

work page 2019
[6]

& Socher, R.A Deep Reinforced Model for Abstractive Summarization

Paulus, R., Xiong, C. & Socher, R.A Deep Reinforced Model for Abstractive Summarization

work page
[7]

arXiv: 1705.04304 [cs.CL]

work page arXiv
[8]

M., Mathieu, M., Dudzik, A., Chung, J., Choi, D

Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., Choi, D. H., Powell, R., Ewalds, T., Georgiev, P.,et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning.Nature, 1–5 (2019)

work page 2019
[9]

H., Codel, C., Hofmann, K., Houghton, B., Kuno, N., Milani, S., Mohanty, S

Guss, W. H., Codel, C., Hofmann, K., Houghton, B., Kuno, N., Milani, S., Mohanty, S. P., Liebana, D. P., Salakhutdinov, R., Topin, N., Veloso, M. & Wang, P. The MineRL Competition on Sample Eﬃcient Reinforcement Learning using Human Priors.CoRR abs/1904.10079. arXiv: 1904.10079. <http://arxiv.org/abs/1904.10079> (2019)

work page arXiv 1904
[10]

Dota 2 — Wikipedia, The Free Encyclopediahttps://en.wikipedia

Wikipediacontributors. Dota 2 — Wikipedia, The Free Encyclopediahttps://en.wikipedia. org/w/index.php?title=Dota_2&oldid=913733447. [Online; accessed 9-September- 2019]. 2019

work page 2019
[11]

The International 2018 — Wikipedia, The Free Encyclopediahttps: //en.wikipedia.org/w/index.php?title=The_International_2018&oldid= 912865272

Wikipedia contributors. The International 2018 — Wikipedia, The Free Encyclopediahttps: //en.wikipedia.org/w/index.php?title=The_International_2018&oldid= 912865272. [Online; accessed 9-September-2019]. 2019

work page 2018
[12]

Allis, L. V. Searching for solutions in games and artiﬁcial intelligencein (1994)

work page 1994
[13]

Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T.,et al.A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play.Science 362, 1140–1144 (2018)

work page 2018
[14]

A., Schmidhuber, J

Gers, F. A., Schmidhuber, J. & Cummins, F. Learning to forget: Continual prediction with LSTM (1999)

work page 1999
[15]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A. & Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347(2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[16]

Konda, V. R. & Tsitsiklis, J. N. Actor-critic algorithms in Advances in neural information processing systems(2000), 1008–1014

work page 2000
[17]

Asynchronous methods for deep reinforcement learningin International conference on ma- chine learning(2016), 1928–1937

Mnih,V.,Badia,A.P.,Mirza,M.,Graves,A.,Lillicrap,T.,Harley,T.,Silver,D.&Kavukcuoglu, K. Asynchronous methods for deep reinforcement learningin International conference on ma- chine learning(2016), 1928–1937. 19

work page 2016
[18]

Schulman, J., Moritz, P., Levine, S., Jordan, M. I. & Abbeel, P. High-Dimensional Continuous Control Using Generalized Advantage Estimation.CoRR abs/1506.02438 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[19]

Distributed Prioritized Experience Replay

Horgan, D., Quan, J., Budden, D., Barth-Maron, G., Hessel, M., van Hasselt, H. & Silver, D. Distributed Prioritized Experience Replay.CoRR abs/1803.00933. arXiv: 1803.00933. <http://arxiv.org/abs/1803.00933> (2018)

work page Pith review arXiv 2018
[20]

NVIDIA Collective Communications Library (NCCL) https : / / developer

NVIDIA. NVIDIA Collective Communications Library (NCCL) https : / / developer . nvidia.com/nccl. [Online; accessed 9-September-2019]. 2019

work page 2019
[21]

Adam: A Method for Stochastic Optimization

Kingma,D.P.&Ba,J.Adam:Amethodforstochasticoptimization. arXiv preprint arXiv:1412.6980 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014
[22]

Williams, R. J. & Peng, J. An Eﬃcient Gradient-Based Algorithm for On-Line Training of Recurrent Network Trajectories.Neural Computation2, 490–501 (1990)

work page 1990
[23]

& Kingma, D

Gray, S., Radford, A. & Kingma, D. P.GPU Kernels for Block-Sparse Weights2017

work page
[24]

Chen, T., Goodfellow, I. J. & Shlens, J.Net2Net: Accelerating Learning via Knowledge Transfer in 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings(2016). <http://arxiv.org/abs/ 1511.05641>

work page arXiv 2016
[25]

& Zhang, L.Solving Rubik’s Cube with a Robot Hand

OpenAI, Akkaya, I., Andrychowicz, M., Chociej, M., Litwin, M., McGrew, B., Petron, A., Paino, A., Plappert, M., Powell, G., Ribas, R., Schneider, J., Tezak, N., Tworek, J., Welinder, P., Weng, L., Yuan, Q., Zaremba, W. & Zhang, L.Solving Rubik’s Cube with a Robot Hand

work page
[26]

arXiv: 1910.07113 [cs.LG]

work page internal anchor Pith review arXiv 1910
[27]

Dalvi, N., Domingos, P., Mausam, Sanghai, S. & Verma, D.Adversarial Classiﬁcationin Pro- ceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, Seattle, WA, USA, 2004), 99–108.isbn: 1-58113-888-1. doi:10.1145/ 1014052.1014066. <http://doi.acm.org/10.1145/1014052.1014066>

work page doi:10.1145/1014052.1014066 2004
[28]

& Singh, K

Jain, A., Bansal, R., Kumar, A. & Singh, K. A comparative study of visual and auditory reaction times on the basis of gender and physical activity levels of medical ﬁrst year students. International journal of applied and basic medical research5, 125–127 (2015)

work page 2015
[29]

& Graepel, T.TrueSkill: a Bayesian skill rating systemin Advances in neural information processing systems(2007), 569–576

Herbrich, R., Minka, T. & Graepel, T.TrueSkill: a Bayesian skill rating systemin Advances in neural information processing systems(2007), 569–576

work page 2007
[30]

An Empirical Model of Large-Batch Training

McCandlish, S., Kaplan, J., Amodei, D. & Team, O. D. An empirical model of large-batch training. arXiv preprint arXiv:1812.06162(2018)

work page Pith review arXiv 2018
[31]

& Schulman, J

Cobbe, K., Klimov, O., Hesse, C., Kim, T. & Schulman, J. Quantifying Generalization in Reinforcement Learning.CoRR abs/1812.02341. arXiv: 1812.02341. <http://arxiv. org/abs/1812.02341> (2018)

work page arXiv 2018
[32]

M., Dunning, I., Marris, L., Lever, G., Castaneda, A

Jaderberg, M., Czarnecki, W. M., Dunning, I., Marris, L., Lever, G., Castaneda, A. G., Beattie, C., Rabinowitz, N. C., Morcos, A. S., Ruderman, A.,et al.Human-level performance in ﬁrst- person multiplayer games with population-based deep reinforcement learning.arXiv preprint arXiv:1807.01281 (2018)

work page arXiv 2018
[33]

& Bowling, M

Moravčík, M., Schmid, M., Burch, N., Lisý, V., Morrill, D., Bard, N., Davis, T., Waugh, K., Johanson, M. & Bowling, M. Deepstack: Expert-level artiﬁcial intelligence in heads-up no-limit poker. Science 356, 508–513 (2017). 20

work page 2017
[34]

& Szafron, D

Schaeﬀer, J., Culberson, J., Treloar, N., Knight, B., Lu, P. & Szafron, D. A world championship caliber checkers program.Artiﬁcial Intelligence53, 273–289. issn: 0004-3702 (1992)

work page 1992
[35]

Emergent Complexity via Multi-Agent Competition

Bansal, T., Pachocki, J., Sidor, S., Sutskever, I. & Mordatch, I. Emergent complexity via multi-agent competition.arXiv preprint arXiv:1710.03748(2017)

work page Pith review arXiv 2017
[36]

Sukhbaatar,S.,Lin,Z.,Kostrikov,I.,Synnaeve,G.,Szlam,A.&Fergus,R.Intrinsicmotivation and automatic curricula via asymmetric self-play.arXiv preprint arXiv:1703.05407(2017)

work page arXiv 2017
[37]

Brown, G. W. inActivity Analysis of Production and Allocation(ed Koopmans, T. C.) (Wiley, New York, 1951)

work page 1951
[38]

Deep reinforcement learning from self-play in imperfect-information games.arXiv preprint arXiv:1603.01121,

Heinrich, J. & Silver, D. Deep Reinforcement Learning from Self-Play in Imperfect-Information Games. CoRR abs/1603.01121. arXiv: 1603.01121 (2016)

work page arXiv 2016
[39]

Mastering the game of go without human knowledge

Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A.,et al. Mastering the game of go without human knowledge. Nature 550, 354 (2017)

work page 2017
[40]

& Barber, D.Thinking fast and slow with deep learning and tree search in Advances in Neural Information Processing Systems(2017), 5360–5370

Anthony, T., Tian, Z. & Barber, D.Thinking fast and slow with deep learning and tree search in Advances in Neural Information Processing Systems(2017), 5360–5370

work page 2017
[41]

& Sandholm, T

Brown, N. & Sandholm, T. Superhuman AI for multiplayer poker.Science, eaay2400 (2019)

work page 2019
[42]

Watkins, C. J. & Dayan, P. Q-learning.Machine learning8, 279–292 (1992)

work page 1992
[43]

D., Narasimhan, K., Saeedi, A

Kulkarni, T. D., Narasimhan, K., Saeedi, A. & Tenenbaum, J. Hierarchical deep reinforce- ment learning: Integrating temporal abstraction and intrinsic motivationin Advances in neural information processing systems(2016), 3675–3683

work page 2016
[44]

Exploration by Random Network Distillation

Burda, Y., Edwards, H., Storkey, A. & Klimov, O. Exploration by random network distillation. arXiv preprint arXiv:1810.12894(2018)

work page Pith review arXiv 2018
[45]

Ecoﬀet, A., Huizinga, J., Lehman, J., Stanley, K. O. & Clune, J. Montezuma’s revenge solved by go-explore, a new algorithm for hard-exploration problems (sets records on pitfall too). Uber Engineering Blog, Nov(2018)

work page 2018
[46]

Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y. & He, K. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour (June 2017)

work page 2017
[47]

& Ginsburg, B

You, Y., Gitman, I. & Ginsburg, B. Scaling SGD Batch Size to 32K for ImageNet Training (Aug. 2017)

work page 2017
[48]

& Keutzer, K.ImageNet Training in Minutesin Proceedings of the 47th International Conference on Parallel Processing(ACM, Eugene, OR, USA, 2018), 1:1–1:10

You, Y., Zhang, Z., Hsieh, C.-J., Demmel, J. & Keutzer, K.ImageNet Training in Minutesin Proceedings of the 47th International Conference on Parallel Processing(ACM, Eugene, OR, USA, 2018), 1:1–1:10. isbn: 978-1-4503-6510-9. doi:10.1145/3225058.3225069. <http: //doi.acm.org/10.1145/3225058.3225069>

work page doi:10.1145/3225058.3225069 2018
[49]

Mnih,V.,Badia,A.P.,Mirza,M.,Graves,A.,Lillicrap,T.,Harley,T.,Silver,D.&Kavukcuoglu, K. Asynchronous Methods for Deep Reinforcement Learningin Proceedings of The 33rd Inter- national Conference on Machine Learning(edsBalcan,M.F.&Weinberger,K.Q.) 48(PMLR, New York, New York, USA, June 2016), 1928–1937. <http://proceedings.mlr.press/ v48/mniha16.html>

work page 2016
[50]

Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, I.,et al.Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures.arXiv preprint arXiv:1802.01561(2018). 21

work page arXiv 2018
[51]

Fahlman, S. E. & Lebiere, C. in (ed Touretzky, D. S.) 524–532 (Morgan Kaufmann Publish- ers Inc., San Francisco, CA, USA, 1990).isbn: 1-55860-100-7. <http : / / dl . acm . org / citation.cfm?id=109230.107380>

work page arXiv 1990
[52]

& Hebert, M

Wang, Y., Ramanan, D. & Hebert, M. Growing a Brain: Fine-Tuning by Increasing Model Capacity in2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017(2017), 3029–3038. doi:10.1109/CVPR.2017.323 . <https://doi.org/10.1109/CVPR.2017.323>

work page doi:10.1109/cvpr.2017.323 2017
[53]

M., Jayakumar, S

Czarnecki, W. M., Jayakumar, S. M., Jaderberg, M., Hasenclever, L., Teh, Y. W., Osindero, S., Heess, N. & Pascanu, R. Mix&Match - Agent Curricula for Reinforcement Learning.CoRR abs/1806.01780. arXiv: 1806 . 01780. <http : / / arxiv . org / abs / 1806 . 01780> (2018)

work page arXiv 2018
[54]

& Hoiem, D

Li, Z. & Hoiem, D. Learning without Forgetting.IEEE Transactions on Pattern Analysis and Machine Intelligence40, 2935–2947. issn: 0162-8828 (Dec. 2018)

work page 2018
[55]

Progressive Neural Networks

Rusu, A. A., Rabinowitz, N. C., Desjardins, G., Soyer, H., Kirkpatrick, J., Kavukcuoglu, K., Pascanu, R. & Hadsell, R. Progressive Neural Networks. arXiv preprint arXiv:1606.04671 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[56]

Distilling the Knowledge in a Neural Network

Hinton, G., Vinyals, O. & Dean, J. Distilling the Knowledge in a Neural Networkin NIPS Deep Learning and Representation Learning Workshop(2015). <http://arxiv.org/abs/ 1503.02531>

work page internal anchor Pith review Pith/arXiv arXiv 2015
[57]

Ross, S., Gordon, G. & Bagnell, D.A Reduction of Imitation Learning and Structured Predic- tion to No-Regret Online Learningin Proceedings of the Fourteenth International Conference on Artiﬁcial Intelligence and Statistics(eds Gordon, G., Dunson, D. & Dudík, M.)15 (PMLR, Fort Lauderdale, FL, USA, Apr. 2011), 627–635. <http://proceedings.mlr.press/ v15/ros...

work page 2011
[58]

& Koltun, V

Levine, S. & Koltun, V. Guided Policy Searchin Proceedings of the 30th International Con- ference on International Conference on Machine Learning - Volume 28(JMLR.org, Atlanta, GA, USA, 2013), III-1–III-9. <http://dl.acm.org/citation.cfm?id=3042817. 3042937>

work page 2013
[59]

AI and Compute https://openai.com/blog/ai- and- compute/

OpenAI. AI and Compute https://openai.com/blog/ai- and- compute/ . [Online; accessed 9-Sept-2019]. 2018

work page 2019
[60]

Y., Harada, D

Ng, A. Y., Harada, D. & Russell, S.Policy invariance under reward transformations: Theory and application to reward shapingin In Proceedings of the Sixteenth International Conference on Machine Learning(Morgan Kaufmann, 1999), 278–287

work page 1999
[61]

OpenAI Gym

Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J. & Zaremba, W. OpenAI Gym.CoRR abs/1606.01540. arXiv: 1606.01540. <http://arxiv.org/ abs/1606.01540> (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[62]

M., Pérolat, J., Jaderberg, M

Balduzzi, D., Garnelo, M., Bachrach, Y., Czarnecki, W. M., Pérolat, J., Jaderberg, M. & Graepel, T. Open-ended Learning in Symmetric Zero-sum Games.CoRR abs/1901.08106. arXiv: 1901.08106. <http://arxiv.org/abs/1901.08106> (2019)

work page arXiv 1901
[63]

cheating

Williams, R. J. & Peng, J. Function optimization using connectionist reinforcement learning algorithms. Connection Science3, 241–268 (1991). 22 Appendix Table of Contents A Compute Usage 25 B Surgery 25 C Hyperparameters 29 D Evaluating agents’ understanding 30 D.1 Understanding OpenAI Five Finals . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 D...

work page 1991
[64]

Observation Space

Predictions by diﬀerent heroes diﬀer as they speciﬁcally predict whether they will participate in bringing given building down. Predictions should not be read as calibrated probabilities, because they are trained with a discount factor. See Figure 11a and Figure 11b for descriptions of the events corresponding to two of these buildings. We also looked at ...

work page
[65]

level up,

Ability Builds:Each hero has four spell abilities. Over the course of the game, a player can choose which of these to “level up,” making that particular skill more powerful. For these, in evaluation games we follow a ﬁxed schedule (improve ability X at level 1, then Y at level 2, then Z at level 3, etc). In training, we randomize around this ﬁxed script s...

work page
[66]

We divide items into consumables — items which are consumed for a one-time beneﬁt such as healing — and everything else

Item Purchasing: As a hero gains gold, they can purchase items. We divide items into consumables — items which are consumed for a one-time beneﬁt such as healing — and everything else. For consumables, we use a simple logic which ensures that the agent always has a certain set of consumables; when the agent uses one up, we then purchase a new one. After a...

work page
[67]

inventory

Item Swap: Each player can choose 6 of the items they hold to keep in their “inventory” where they are actively usable, leaving up to 3 inactive items in their “backpack.” Instead of letting the model control this, we use a heuristic which approximately keeps the most valuable items in the inventory

work page
[68]

Team”) and some just to the hero who took the action “Solo

Courier Control: Each side has a single “Courier” unit which cannot ﬁght but can carry items from the shop to the player which purchased them. We use a state-machine based logic to control this character. G Reward Weights Our agent’s ultimate goal is to win the game. In order to simplify the credit assignment problem (the task of ﬁguring out which of the ...

work page
[69]

softmax sample/argmax Chosen Action ID FC

The primary action is chosen via a linear projection over the available actions. softmax sample/argmax Chosen Action ID FC

work page
[70]

Other action parameters are linear projections from the LSTM softmax sample/argmax Offset X Offset YFC softmax sample/argmax Embedding multiply sigmoid dot product softmax sample/argmax Target Unit

work page
[71]

The unit keys are masked by a learned per-action mask based on the sampled action

The target unit is chosen via an attention mechanism over the available units. The unit keys are masked by a learned per-action mask based on the sampled action. FC FC Figure 18: The hidden state of the LSTM and unit embeddings are used to parameterize the actions. each responsible for the observations and actions of one of the heroes in the team. At a hi...

work page
[72]

distance to me

In practice the Observation Processing portion of the model is also cloned 5 times for the ﬁve diﬀerent heroes. The weights are identical and the observations are nearly identical — but there are a handful of derived features which are diﬀerent for each replica (such as “distance to me” for each unit; see Table 4 for the list of observations that vary). T...

work page
[73]

Flattened Observation

The “Flattened Observation” and “Hero Embedding” are processed before being sent into the LSTM (see Figure 17b) by a fully-connected layer and a “cross-hero pool” operation, to ensure that the non-identical observations can be used by other members of the team if needed

work page
[74]

Unit Embeddings

The “Unit Embeddings” from the observation processing are carried along beside the LSTM, and used by the action heads to choose a unit to target (see Figure 18). In addition to the action logits, the value function is computed as another linear projection of the LSTM state. Thus our value function and action policy share a network and share gradients. 48 ...

work page 2018
[75]

If a long and very speciﬁc series of actions is necessary to be taken by the agent in order to randomly stumble on a reward, and any deviation from that sequence will result in negative advantage, then the longer this series, the less likely is agent to explore this skill thoroughly and learn to use it when necessary

work page
[76]

If an environment is highly repetitive, then the agent is more likely to ﬁnd and stay in a local minimum

work page
[77]

designing skyscrapers

In order to be robust to various strategies humans employ, our agents must have encountered a wide variety of situations in training. This parallels the success of domain randomization in transferring policies from simulation to real-world robotics[5]. We randomize many parts of the environment: • Initial State: In our rollout games, heroes start with ran...

work page 2000