Recognition: 2 theorem links
Dota 2 with Large Scale Deep Reinforcement Learning
Pith reviewed 2026-05-12 22:13 UTC · model grok-4.3
The pith
Large-scale self-play reinforcement learning produced an AI that defeated the Dota 2 world champions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper reports that the AI system trained with deep reinforcement learning through self-play became the first to defeat the Dota 2 world champions. This outcome followed from scaling existing techniques via a distributed training system and continual training tools, allowing the system to learn from batches of approximately two million frames every two seconds across ten months. The result shows that self-play reinforcement learning reaches superhuman levels on a task defined by long time horizons, imperfect information, and complex continuous state-action spaces.
What carries the argument
The distributed training system that supports continual self-play reinforcement learning at the scale of two million frames processed every two seconds.
If this is right
- Self-play reinforcement learning can improve without human demonstrations or external data.
- Processing large volumes of frames at high frequency supports mastery of long-horizon planning.
- Complex games with imperfect information become solvable at superhuman levels through scaled training.
- Continuous action spaces in strategy environments yield to the same reinforcement learning approach.
Where Pith is reading between the lines
- The training scale used here could transfer to other multiplayer strategy simulations sharing similar time and information challenges.
- Real-world tasks involving resource allocation or autonomous planning under uncertainty might benefit from equivalent self-play scaling.
- Further increases in training duration or data throughput could uncover strategies beyond current human levels.
Load-bearing premise
That the AI's victories stemmed from genuine strategic superiority instead of match-specific rules, setup differences, or unaccounted advantages.
What would settle it
A controlled rematch series under the same rules where the human team wins the majority of games would show the superhuman performance claim does not hold.
read the original abstract
On April 13th, 2019, OpenAI Five became the first AI system to defeat the world champions at an esports game. The game of Dota 2 presents novel challenges for AI systems such as long time horizons, imperfect information, and complex, continuous state-action spaces, all challenges which will become increasingly central to more capable AI systems. OpenAI Five leveraged existing reinforcement learning techniques, scaled to learn from batches of approximately 2 million frames every 2 seconds. We developed a distributed training system and tools for continual training which allowed us to train OpenAI Five for 10 months. By defeating the Dota 2 world champion (Team OG), OpenAI Five demonstrates that self-play reinforcement learning can achieve superhuman performance on a difficult task.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript describes OpenAI Five, a deep RL agent trained via self-play that defeated the Dota 2 world champion team (Team OG) on April 13, 2019. It details the game's challenges (long horizons, imperfect information, continuous spaces), the distributed training system processing batches of ~2 million frames every 2 seconds, and the 10-month continual training process that produced the result.
Significance. If the central empirical outcome holds under equivalent conditions, the work shows that scaled self-play RL can reach superhuman performance on a complex, real-time, multi-agent task with partial observability. This provides a concrete, falsifiable demonstration of existing RL techniques at extreme scale and has implications for long-horizon decision making in other domains.
major comments (1)
- [Results section on the April 13, 2019 match against Team OG] The claim that the April 13, 2019 victory establishes superhuman performance via learned strategy requires explicit verification that match conditions were equivalent to standard human play. The manuscript does not detail (in the results or methods sections describing the exhibition match) whether the agent's observation space was restricted to human-visible information, whether action latency and timing matched human reaction limits, or whether inference-time compute was constrained to human-equivalent levels. Without these controls, alternative explanations for the outcome cannot be ruled out.
minor comments (2)
- [Abstract] The abstract is outcome-focused but omits any mention of the scale of training or key implementation choices; adding one sentence on these would improve context without lengthening the paper.
- [Figures and captions] Several figures (e.g., training curves and architecture diagrams) would benefit from explicit axis labels, units, and direct textual references in the surrounding paragraphs for clarity.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review of our manuscript. We address the single major comment below and have revised the manuscript to improve transparency on the exhibition match conditions.
read point-by-point responses
-
Referee: The claim that the April 13, 2019 victory establishes superhuman performance via learned strategy requires explicit verification that match conditions were equivalent to standard human play. The manuscript does not detail (in the results or methods sections describing the exhibition match) whether the agent's observation space was restricted to human-visible information, whether action latency and timing matched human reaction limits, or whether inference-time compute was constrained to human-equivalent levels. Without these controls, alternative explanations for the outcome cannot be ruled out.
Authors: We agree that the manuscript would benefit from greater explicitness on the exhibition match conditions, as the current text provides only a high-level description of the outcome without a dedicated breakdown in the Results or Methods sections. We have revised the manuscript by adding a new paragraph in the Results section (cross-referenced from the abstract and introduction) that describes the match setup. The agent's observation space used the full game state available via the Dota 2 API rather than being restricted to human-visible information. Action timing followed the model's inference speed at the game's native rate without additional artificial delays to match human reaction limits. Inference-time compute was supplied by the distributed training infrastructure and was not throttled to human-equivalent levels. We maintain that these AI-specific conditions are appropriate for the demonstration and do not invalidate the evidence of learned long-horizon strategy; the agent still had to discover and execute complex, coordinated behaviors to prevail. We have also added a short discussion of alternative explanations and why the outcome supports our central claim. This addresses the referee's concern directly. revision: yes
Circularity Check
No circularity: empirical match outcome stands independent of training description
full rationale
The paper reports the training of OpenAI Five via scaled self-play RL over 10 months and its observed defeat of Team OG on April 13, 2019, as direct evidence that self-play RL can reach superhuman performance. No equations, fitted parameters, or derivations are presented that reduce the performance claim to inputs by construction. The central result is an external, falsifiable event (the match outcome) rather than a self-referential definition, renamed pattern, or self-citation chain. The description of the distributed training system and tools is procedural and does not invoke uniqueness theorems or ansatzes that loop back to the target claim. This is a standard empirical systems paper with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 30 Pith papers
-
Generative Agents: Interactive Simulacra of Human Behavior
Generative agents with memory streams, reflection, and planning using LLMs exhibit believable individual and emergent social behaviors in a simulated town.
-
Controllability in preference-conditioned multi-objective reinforcement learning
Standard MORL metrics do not measure whether preference inputs reliably control agent behavior, so a new controllability metric is introduced to restore the link between user intent and agent output.
-
Learning to Communicate Locally for Large-Scale Multi-Agent Pathfinding
LC-MAPF uses multi-round local communication between neighboring agents in a pre-trained model to outperform prior learning-based MAPF solvers on diverse unseen scenarios while preserving scalability.
-
Does Synthetic Data Help? Empirical Evidence from Deep Learning Time Series Forecasters
Synthetic data augmentation helps channel-mixing time series models but degrades channel-independent ones, with reliable gains only from seasonal-trend generators and gradual schedules in low-resource settings.
-
SOPE: Stabilizing Off-Policy Evaluation for Online RL with Prior Data
SOPE uses an actor-aligned OPE signal on a held-out validation split to dynamically stop offline stabilization phases in online RL, improving performance up to 45.6% and cutting TFLOPs up to 22x on 25 Minari tasks.
-
Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks
COSPLAY co-evolves an LLM decision agent with a skill bank agent to improve long-horizon game performance, reporting over 25.1% average reward gains versus frontier LLM baselines on single-player benchmarks.
-
InfoChess: A Game of Adversarial Inference and a Laboratory for Quantifiable Information Control
InfoChess proposes a symmetric adversarial game focused purely on information control and probabilistic king-location inference, with RL agents outperforming heuristic baselines and gameplay dissected via belief entro...
-
Territory Paint Wars: Diagnosing and Mitigating Failure Modes in Competitive Multi-Agent PPO
PPO in a new competitive game fails due to five implementation bugs and then competitive overfitting where self-play stays near 50% but generalization drops to 21.6%; mixing 20% random opponents restores generalizatio...
-
Voyager: An Open-Ended Embodied Agent with Large Language Models
Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more uniq...
-
Learning to Communicate Locally for Large-Scale Multi-Agent Pathfinding
LC-MAPF is a decentralized MAPF solver that uses a learnable multi-round communication module among nearby agents to outperform prior IL and RL methods while preserving scalability.
-
QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL
QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markov...
-
QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL
QHyer achieves state-of-the-art results in offline goal-conditioned RL by replacing return-to-go with a state-conditioned Q-estimator and introducing a gated hybrid attention-mamba backbone for content-adaptive histor...
-
Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning
Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.
-
When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient
Certain errors in proxy rewards for policy gradient methods can be benign or beneficial by preventing policies from stalling on outputs with mediocre ground truth rewards, enabling improved RLHF metrics and reward des...
-
Biased Dreams: Limitations to Epistemic Uncertainty Quantification in Latent Space Models
Latent transitions in models like Dreamer are biased toward dense regions, creating attractors that hide true dynamics discrepancies and cause epistemic uncertainty to be unreliable while overestimating rewards.
-
CivBench: Progress-Based Evaluation for LLMs' Strategic Decision-Making in Civilization V
CivBench trains models on turn-level states in Civilization V to predict victory probabilities, providing a progress-based evaluation of LLM strategic capabilities across 307 games with 7 models.
-
Vocabulary Dropout for Curriculum Diversity in LLM Co-Evolution
Vocabulary dropout prevents diversity collapse in LLM co-evolution by masking proposer logits, yielding average +4.4 point solver gains on mathematical reasoning benchmarks at 8B scale.
-
Heterogeneous Self-Play for Realistic Highway Traffic Simulation
PHASE uses heterogeneous self-play and context-conditioned policies to achieve realistic, zero-shot highway traffic simulation that outperforms traditional rule-based and self-play models on real-world datasets.
-
Muon is Scalable for LLM Training
Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.
-
TD-MPC2: Scalable, Robust World Models for Continuous Control
TD-MPC2 scales an implicit world-model RL method to a 317M-parameter agent that masters 80 tasks across four domains with a single hyperparameter configuration.
-
Language Models (Mostly) Know What They Know
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
-
A General Language Assistant as a Laboratory for Alignment
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
-
Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning
Isaac Gym achieves 2-3 orders of magnitude faster robot policy training by keeping physics simulation and PyTorch-based RL entirely on GPU with direct buffer sharing.
-
Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse
The paper organizes research on generalist game AI into Dataset, Model, Harness, and Benchmark pillars and charts a five-level progression from single-game mastery to agents that create and live inside game multiverses.
-
On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length
Longer action horizons bottleneck LLM agent training through instability, but training with reduced horizons stabilizes learning and enables better generalization to longer horizons.
-
A High-Throughput Compute-Efficient POMDP Hide-And-Seek-Engine (HASE) for Multi-Agent Operations
A C++ Dec-POMDP simulator using data-oriented design and zero-copy PyTorch integration achieves up to 33 million steps per second on a 16-core CPU, enabling multi-agent policy training in minutes with PPO, DQN, and SAC.
-
RAMP: Hybrid DRL for Online Learning of Numeric Action Models
RAMP learns numeric action models online via a DRL-planning feedback loop and outperforms PPO on IPC numeric domains in solvability and plan quality.
-
Gymnasium: A Standard Interface for Reinforcement Learning Environments
Gymnasium establishes a standardized API for RL environments to improve interoperability, reproducibility, and ease of development in reinforcement learning.
-
Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse
This work traces four eras of generalist game players across dataset, model, harness, and benchmark pillars and charts a five-level roadmap ending in agents that create and evolve within game multiverses.
-
Benefits of Low-Cost Bio-Inspiration in the Age of Overparametrization
Shallow MLPs and dense CPGs outperform deeper MLPs and Actor-Critic RL in bounded robot control tasks with limited proprioception, with a Parameter Impact metric indicating extra RL parameters yield no performance gai...
Reference graph
Works this paper leans on
-
[1]
TD-Gammon, a self-teaching backgammon program, achieves master-level play
Tesauro, G. TD-Gammon, a self-teaching backgammon program, achieves master-level play. Neural computation6, 215–219 (1994)
work page 1994
-
[2]
Campbell, M., Hoane Jr., A. J. & Hsu, F.-h. Deep Blue.Artif. Intell.134, 57–83. issn: 0004- 3702 (Jan. 2002)
work page 2002
-
[3]
Playing Atari with Deep Reinforcement Learning
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D. & Riedmiller, M. Playing atari with deep reinforcement learning.arXiv preprint arXiv:1312.5602(2013)
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[4]
Mastering the game of Go with deep neural networks and tree search.nature 529, 484 (2016)
Silver,D.,Huang,A.,Maddison,C.J.,Guez,A.,Sifre,L.,VanDenDriessche,G.,Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M.,et al. Mastering the game of Go with deep neural networks and tree search.nature 529, 484 (2016)
work page 2016
-
[5]
Learning Dexterity https : / / openai
OpenAI. Learning Dexterity https : / / openai . com / blog / learning - dexterity/. [Online; accessed 28-May-2019]. 2018
work page 2019
-
[6]
& Socher, R.A Deep Reinforced Model for Abstractive Summarization
Paulus, R., Xiong, C. & Socher, R.A Deep Reinforced Model for Abstractive Summarization
- [7]
-
[8]
M., Mathieu, M., Dudzik, A., Chung, J., Choi, D
Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., Choi, D. H., Powell, R., Ewalds, T., Georgiev, P.,et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning.Nature, 1–5 (2019)
work page 2019
-
[9]
H., Codel, C., Hofmann, K., Houghton, B., Kuno, N., Milani, S., Mohanty, S
Guss, W. H., Codel, C., Hofmann, K., Houghton, B., Kuno, N., Milani, S., Mohanty, S. P., Liebana, D. P., Salakhutdinov, R., Topin, N., Veloso, M. & Wang, P. The MineRL Competition on Sample Efficient Reinforcement Learning using Human Priors.CoRR abs/1904.10079. arXiv: 1904.10079. <http://arxiv.org/abs/1904.10079> (2019)
-
[10]
Dota 2 — Wikipedia, The Free Encyclopediahttps://en.wikipedia
Wikipediacontributors. Dota 2 — Wikipedia, The Free Encyclopediahttps://en.wikipedia. org/w/index.php?title=Dota_2&oldid=913733447. [Online; accessed 9-September- 2019]. 2019
work page 2019
-
[11]
Wikipedia contributors. The International 2018 — Wikipedia, The Free Encyclopediahttps: //en.wikipedia.org/w/index.php?title=The_International_2018&oldid= 912865272. [Online; accessed 9-September-2019]. 2019
work page 2018
-
[12]
Allis, L. V. Searching for solutions in games and artificial intelligencein (1994)
work page 1994
-
[13]
Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T.,et al.A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play.Science 362, 1140–1144 (2018)
work page 2018
-
[14]
Gers, F. A., Schmidhuber, J. & Cummins, F. Learning to forget: Continual prediction with LSTM (1999)
work page 1999
-
[15]
Proximal Policy Optimization Algorithms
Schulman, J., Wolski, F., Dhariwal, P., Radford, A. & Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347(2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[16]
Konda, V. R. & Tsitsiklis, J. N. Actor-critic algorithms in Advances in neural information processing systems(2000), 1008–1014
work page 2000
-
[17]
Mnih,V.,Badia,A.P.,Mirza,M.,Graves,A.,Lillicrap,T.,Harley,T.,Silver,D.&Kavukcuoglu, K. Asynchronous methods for deep reinforcement learningin International conference on ma- chine learning(2016), 1928–1937. 19
work page 2016
-
[18]
Schulman, J., Moritz, P., Levine, S., Jordan, M. I. & Abbeel, P. High-Dimensional Continuous Control Using Generalized Advantage Estimation.CoRR abs/1506.02438 (2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[19]
Distributed Prioritized Experience Replay
Horgan, D., Quan, J., Budden, D., Barth-Maron, G., Hessel, M., van Hasselt, H. & Silver, D. Distributed Prioritized Experience Replay.CoRR abs/1803.00933. arXiv: 1803.00933. <http://arxiv.org/abs/1803.00933> (2018)
work page Pith review arXiv 2018
-
[20]
NVIDIA Collective Communications Library (NCCL) https : / / developer
NVIDIA. NVIDIA Collective Communications Library (NCCL) https : / / developer . nvidia.com/nccl. [Online; accessed 9-September-2019]. 2019
work page 2019
-
[21]
Adam: A Method for Stochastic Optimization
Kingma,D.P.&Ba,J.Adam:Amethodforstochasticoptimization. arXiv preprint arXiv:1412.6980 (2014)
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[22]
Williams, R. J. & Peng, J. An Efficient Gradient-Based Algorithm for On-Line Training of Recurrent Network Trajectories.Neural Computation2, 490–501 (1990)
work page 1990
- [23]
- [24]
-
[25]
& Zhang, L.Solving Rubik’s Cube with a Robot Hand
OpenAI, Akkaya, I., Andrychowicz, M., Chociej, M., Litwin, M., McGrew, B., Petron, A., Paino, A., Plappert, M., Powell, G., Ribas, R., Schneider, J., Tezak, N., Tworek, J., Welinder, P., Weng, L., Yuan, Q., Zaremba, W. & Zhang, L.Solving Rubik’s Cube with a Robot Hand
- [26]
-
[27]
Dalvi, N., Domingos, P., Mausam, Sanghai, S. & Verma, D.Adversarial Classificationin Pro- ceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, Seattle, WA, USA, 2004), 99–108.isbn: 1-58113-888-1. doi:10.1145/ 1014052.1014066. <http://doi.acm.org/10.1145/1014052.1014066>
-
[28]
Jain, A., Bansal, R., Kumar, A. & Singh, K. A comparative study of visual and auditory reaction times on the basis of gender and physical activity levels of medical first year students. International journal of applied and basic medical research5, 125–127 (2015)
work page 2015
-
[29]
Herbrich, R., Minka, T. & Graepel, T.TrueSkill: a Bayesian skill rating systemin Advances in neural information processing systems(2007), 569–576
work page 2007
-
[30]
An Empirical Model of Large-Batch Training
McCandlish, S., Kaplan, J., Amodei, D. & Team, O. D. An empirical model of large-batch training. arXiv preprint arXiv:1812.06162(2018)
work page Pith review arXiv 2018
-
[31]
Cobbe, K., Klimov, O., Hesse, C., Kim, T. & Schulman, J. Quantifying Generalization in Reinforcement Learning.CoRR abs/1812.02341. arXiv: 1812.02341. <http://arxiv. org/abs/1812.02341> (2018)
-
[32]
M., Dunning, I., Marris, L., Lever, G., Castaneda, A
Jaderberg, M., Czarnecki, W. M., Dunning, I., Marris, L., Lever, G., Castaneda, A. G., Beattie, C., Rabinowitz, N. C., Morcos, A. S., Ruderman, A.,et al.Human-level performance in first- person multiplayer games with population-based deep reinforcement learning.arXiv preprint arXiv:1807.01281 (2018)
-
[33]
Moravčík, M., Schmid, M., Burch, N., Lisý, V., Morrill, D., Bard, N., Davis, T., Waugh, K., Johanson, M. & Bowling, M. Deepstack: Expert-level artificial intelligence in heads-up no-limit poker. Science 356, 508–513 (2017). 20
work page 2017
-
[34]
Schaeffer, J., Culberson, J., Treloar, N., Knight, B., Lu, P. & Szafron, D. A world championship caliber checkers program.Artificial Intelligence53, 273–289. issn: 0004-3702 (1992)
work page 1992
-
[35]
Emergent Complexity via Multi-Agent Competition
Bansal, T., Pachocki, J., Sidor, S., Sutskever, I. & Mordatch, I. Emergent complexity via multi-agent competition.arXiv preprint arXiv:1710.03748(2017)
work page Pith review arXiv 2017
- [36]
-
[37]
Brown, G. W. inActivity Analysis of Production and Allocation(ed Koopmans, T. C.) (Wiley, New York, 1951)
work page 1951
-
[38]
Heinrich, J. & Silver, D. Deep Reinforcement Learning from Self-Play in Imperfect-Information Games. CoRR abs/1603.01121. arXiv: 1603.01121 (2016)
-
[39]
Mastering the game of go without human knowledge
Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A.,et al. Mastering the game of go without human knowledge. Nature 550, 354 (2017)
work page 2017
-
[40]
Anthony, T., Tian, Z. & Barber, D.Thinking fast and slow with deep learning and tree search in Advances in Neural Information Processing Systems(2017), 5360–5370
work page 2017
-
[41]
Brown, N. & Sandholm, T. Superhuman AI for multiplayer poker.Science, eaay2400 (2019)
work page 2019
-
[42]
Watkins, C. J. & Dayan, P. Q-learning.Machine learning8, 279–292 (1992)
work page 1992
-
[43]
Kulkarni, T. D., Narasimhan, K., Saeedi, A. & Tenenbaum, J. Hierarchical deep reinforce- ment learning: Integrating temporal abstraction and intrinsic motivationin Advances in neural information processing systems(2016), 3675–3683
work page 2016
-
[44]
Exploration by random network distillation.arXiv preprint arXiv:1810.12894, 2018
Burda, Y., Edwards, H., Storkey, A. & Klimov, O. Exploration by random network distillation. arXiv preprint arXiv:1810.12894(2018)
-
[45]
Ecoffet, A., Huizinga, J., Lehman, J., Stanley, K. O. & Clune, J. Montezuma’s revenge solved by go-explore, a new algorithm for hard-exploration problems (sets records on pitfall too). Uber Engineering Blog, Nov(2018)
work page 2018
-
[46]
Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y. & He, K. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour (June 2017)
work page 2017
-
[47]
You, Y., Gitman, I. & Ginsburg, B. Scaling SGD Batch Size to 32K for ImageNet Training (Aug. 2017)
work page 2017
-
[48]
You, Y., Zhang, Z., Hsieh, C.-J., Demmel, J. & Keutzer, K.ImageNet Training in Minutesin Proceedings of the 47th International Conference on Parallel Processing(ACM, Eugene, OR, USA, 2018), 1:1–1:10. isbn: 978-1-4503-6510-9. doi:10.1145/3225058.3225069. <http: //doi.acm.org/10.1145/3225058.3225069>
-
[49]
Mnih,V.,Badia,A.P.,Mirza,M.,Graves,A.,Lillicrap,T.,Harley,T.,Silver,D.&Kavukcuoglu, K. Asynchronous Methods for Deep Reinforcement Learningin Proceedings of The 33rd Inter- national Conference on Machine Learning(edsBalcan,M.F.&Weinberger,K.Q.) 48(PMLR, New York, New York, USA, June 2016), 1928–1937. <http://proceedings.mlr.press/ v48/mniha16.html>
work page 2016
- [50]
- [51]
-
[52]
Wang, Y., Ramanan, D. & Hebert, M. Growing a Brain: Fine-Tuning by Increasing Model Capacity in2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017(2017), 3029–3038. doi:10.1109/CVPR.2017.323 . <https://doi.org/10.1109/CVPR.2017.323>
-
[53]
Czarnecki, W. M., Jayakumar, S. M., Jaderberg, M., Hasenclever, L., Teh, Y. W., Osindero, S., Heess, N. & Pascanu, R. Mix&Match - Agent Curricula for Reinforcement Learning.CoRR abs/1806.01780. arXiv: 1806 . 01780. <http : / / arxiv . org / abs / 1806 . 01780> (2018)
-
[54]
Li, Z. & Hoiem, D. Learning without Forgetting.IEEE Transactions on Pattern Analysis and Machine Intelligence40, 2935–2947. issn: 0162-8828 (Dec. 2018)
work page 2018
-
[55]
Rusu, A. A., Rabinowitz, N. C., Desjardins, G., Soyer, H., Kirkpatrick, J., Kavukcuoglu, K., Pascanu, R. & Hadsell, R. Progressive Neural Networks. arXiv preprint arXiv:1606.04671 (2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[56]
Distilling the Knowledge in a Neural Network
Hinton, G., Vinyals, O. & Dean, J. Distilling the Knowledge in a Neural Networkin NIPS Deep Learning and Representation Learning Workshop(2015). <http://arxiv.org/abs/ 1503.02531>
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[57]
Ross, S., Gordon, G. & Bagnell, D.A Reduction of Imitation Learning and Structured Predic- tion to No-Regret Online Learningin Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics(eds Gordon, G., Dunson, D. & Dudík, M.)15 (PMLR, Fort Lauderdale, FL, USA, Apr. 2011), 627–635. <http://proceedings.mlr.press/ v15/ros...
work page 2011
-
[58]
Levine, S. & Koltun, V. Guided Policy Searchin Proceedings of the 30th International Con- ference on International Conference on Machine Learning - Volume 28(JMLR.org, Atlanta, GA, USA, 2013), III-1–III-9. <http://dl.acm.org/citation.cfm?id=3042817. 3042937>
work page 2013
-
[59]
AI and Compute https://openai.com/blog/ai- and- compute/
OpenAI. AI and Compute https://openai.com/blog/ai- and- compute/ . [Online; accessed 9-Sept-2019]. 2018
work page 2019
-
[60]
Ng, A. Y., Harada, D. & Russell, S.Policy invariance under reward transformations: Theory and application to reward shapingin In Proceedings of the Sixteenth International Conference on Machine Learning(Morgan Kaufmann, 1999), 278–287
work page 1999
-
[61]
Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J. & Zaremba, W. OpenAI Gym.CoRR abs/1606.01540. arXiv: 1606.01540. <http://arxiv.org/ abs/1606.01540> (2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[62]
Balduzzi, D., Garnelo, M., Bachrach, Y., Czarnecki, W. M., Pérolat, J., Jaderberg, M. & Graepel, T. Open-ended Learning in Symmetric Zero-sum Games.CoRR abs/1901.08106. arXiv: 1901.08106. <http://arxiv.org/abs/1901.08106> (2019)
-
[63]
Williams, R. J. & Peng, J. Function optimization using connectionist reinforcement learning algorithms. Connection Science3, 241–268 (1991). 22 Appendix Table of Contents A Compute Usage 25 B Surgery 25 C Hyperparameters 29 D Evaluating agents’ understanding 30 D.1 Understanding OpenAI Five Finals . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 D...
work page 1991
-
[64]
Predictions by different heroes differ as they specifically predict whether they will participate in bringing given building down. Predictions should not be read as calibrated probabilities, because they are trained with a discount factor. See Figure 11a and Figure 11b for descriptions of the events corresponding to two of these buildings. We also looked at ...
-
[65]
Ability Builds:Each hero has four spell abilities. Over the course of the game, a player can choose which of these to “level up,” making that particular skill more powerful. For these, in evaluation games we follow a fixed schedule (improve ability X at level 1, then Y at level 2, then Z at level 3, etc). In training, we randomize around this fixed script s...
-
[66]
Item Purchasing: As a hero gains gold, they can purchase items. We divide items into consumables — items which are consumed for a one-time benefit such as healing — and everything else. For consumables, we use a simple logic which ensures that the agent always has a certain set of consumables; when the agent uses one up, we then purchase a new one. After a...
-
[67]
Item Swap: Each player can choose 6 of the items they hold to keep in their “inventory” where they are actively usable, leaving up to 3 inactive items in their “backpack.” Instead of letting the model control this, we use a heuristic which approximately keeps the most valuable items in the inventory
-
[68]
Team”) and some just to the hero who took the action “Solo
Courier Control: Each side has a single “Courier” unit which cannot fight but can carry items from the shop to the player which purchased them. We use a state-machine based logic to control this character. G Reward Weights Our agent’s ultimate goal is to win the game. In order to simplify the credit assignment problem (the task of figuring out which of the ...
-
[69]
softmax sample/argmax Chosen Action ID FC
The primary action is chosen via a linear projection over the available actions. softmax sample/argmax Chosen Action ID FC
-
[70]
Other action parameters are linear projections from the LSTM softmax sample/argmax Offset X Offset YFC softmax sample/argmax Embedding multiply sigmoid dot product softmax sample/argmax Target Unit
-
[71]
The unit keys are masked by a learned per-action mask based on the sampled action
The target unit is chosen via an attention mechanism over the available units. The unit keys are masked by a learned per-action mask based on the sampled action. FC FC Figure 18: The hidden state of the LSTM and unit embeddings are used to parameterize the actions. each responsible for the observations and actions of one of the heroes in the team. At a hi...
-
[72]
In practice the Observation Processing portion of the model is also cloned 5 times for the five different heroes. The weights are identical and the observations are nearly identical — but there are a handful of derived features which are different for each replica (such as “distance to me” for each unit; see Table 4 for the list of observations that vary). T...
-
[73]
The “Flattened Observation” and “Hero Embedding” are processed before being sent into the LSTM (see Figure 17b) by a fully-connected layer and a “cross-hero pool” operation, to ensure that the non-identical observations can be used by other members of the team if needed
-
[74]
The “Unit Embeddings” from the observation processing are carried along beside the LSTM, and used by the action heads to choose a unit to target (see Figure 18). In addition to the action logits, the value function is computed as another linear projection of the LSTM state. Thus our value function and action policy share a network and share gradients. 48 ...
work page 2018
-
[75]
If a long and very specific series of actions is necessary to be taken by the agent in order to randomly stumble on a reward, and any deviation from that sequence will result in negative advantage, then the longer this series, the less likely is agent to explore this skill thoroughly and learn to use it when necessary
-
[76]
If an environment is highly repetitive, then the agent is more likely to find and stay in a local minimum
-
[77]
In order to be robust to various strategies humans employ, our agents must have encountered a wide variety of situations in training. This parallels the success of domain randomization in transferring policies from simulation to real-world robotics[5]. We randomize many parts of the environment: • Initial State: In our rollout games, heroes start with ran...
work page 2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.