pith. machine review for the scientific record. sign in

arxiv: 1912.06680 · v1 · submitted 2019-12-13 · 💻 cs.LG · stat.ML

Recognition: 2 theorem links

Dota 2 with Large Scale Deep Reinforcement Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-12 22:13 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords reinforcement learningself-playDota 2superhuman performancedeep reinforcement learningdistributed trainingimperfect informationesports
0
0 comments X

The pith

Large-scale self-play reinforcement learning produced an AI that defeated the Dota 2 world champions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that scaling reinforcement learning with self-play can yield superhuman performance in Dota 2. The game demands handling extended planning periods, hidden opponent details, and a broad range of continuous actions and states. Training occurred over ten months on a distributed system that ingested roughly two million game frames every two seconds. Success here would indicate that standard reinforcement learning methods, when expanded in data volume and compute, can master environments with these demanding features. A sympathetic reader would view this as a step toward applying similar techniques to other intricate decision problems.

Core claim

The paper reports that the AI system trained with deep reinforcement learning through self-play became the first to defeat the Dota 2 world champions. This outcome followed from scaling existing techniques via a distributed training system and continual training tools, allowing the system to learn from batches of approximately two million frames every two seconds across ten months. The result shows that self-play reinforcement learning reaches superhuman levels on a task defined by long time horizons, imperfect information, and complex continuous state-action spaces.

What carries the argument

The distributed training system that supports continual self-play reinforcement learning at the scale of two million frames processed every two seconds.

If this is right

  • Self-play reinforcement learning can improve without human demonstrations or external data.
  • Processing large volumes of frames at high frequency supports mastery of long-horizon planning.
  • Complex games with imperfect information become solvable at superhuman levels through scaled training.
  • Continuous action spaces in strategy environments yield to the same reinforcement learning approach.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The training scale used here could transfer to other multiplayer strategy simulations sharing similar time and information challenges.
  • Real-world tasks involving resource allocation or autonomous planning under uncertainty might benefit from equivalent self-play scaling.
  • Further increases in training duration or data throughput could uncover strategies beyond current human levels.

Load-bearing premise

That the AI's victories stemmed from genuine strategic superiority instead of match-specific rules, setup differences, or unaccounted advantages.

What would settle it

A controlled rematch series under the same rules where the human team wins the majority of games would show the superhuman performance claim does not hold.

read the original abstract

On April 13th, 2019, OpenAI Five became the first AI system to defeat the world champions at an esports game. The game of Dota 2 presents novel challenges for AI systems such as long time horizons, imperfect information, and complex, continuous state-action spaces, all challenges which will become increasingly central to more capable AI systems. OpenAI Five leveraged existing reinforcement learning techniques, scaled to learn from batches of approximately 2 million frames every 2 seconds. We developed a distributed training system and tools for continual training which allowed us to train OpenAI Five for 10 months. By defeating the Dota 2 world champion (Team OG), OpenAI Five demonstrates that self-play reinforcement learning can achieve superhuman performance on a difficult task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript describes OpenAI Five, a deep RL agent trained via self-play that defeated the Dota 2 world champion team (Team OG) on April 13, 2019. It details the game's challenges (long horizons, imperfect information, continuous spaces), the distributed training system processing batches of ~2 million frames every 2 seconds, and the 10-month continual training process that produced the result.

Significance. If the central empirical outcome holds under equivalent conditions, the work shows that scaled self-play RL can reach superhuman performance on a complex, real-time, multi-agent task with partial observability. This provides a concrete, falsifiable demonstration of existing RL techniques at extreme scale and has implications for long-horizon decision making in other domains.

major comments (1)
  1. [Results section on the April 13, 2019 match against Team OG] The claim that the April 13, 2019 victory establishes superhuman performance via learned strategy requires explicit verification that match conditions were equivalent to standard human play. The manuscript does not detail (in the results or methods sections describing the exhibition match) whether the agent's observation space was restricted to human-visible information, whether action latency and timing matched human reaction limits, or whether inference-time compute was constrained to human-equivalent levels. Without these controls, alternative explanations for the outcome cannot be ruled out.
minor comments (2)
  1. [Abstract] The abstract is outcome-focused but omits any mention of the scale of training or key implementation choices; adding one sentence on these would improve context without lengthening the paper.
  2. [Figures and captions] Several figures (e.g., training curves and architecture diagrams) would benefit from explicit axis labels, units, and direct textual references in the surrounding paragraphs for clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed review of our manuscript. We address the single major comment below and have revised the manuscript to improve transparency on the exhibition match conditions.

read point-by-point responses
  1. Referee: The claim that the April 13, 2019 victory establishes superhuman performance via learned strategy requires explicit verification that match conditions were equivalent to standard human play. The manuscript does not detail (in the results or methods sections describing the exhibition match) whether the agent's observation space was restricted to human-visible information, whether action latency and timing matched human reaction limits, or whether inference-time compute was constrained to human-equivalent levels. Without these controls, alternative explanations for the outcome cannot be ruled out.

    Authors: We agree that the manuscript would benefit from greater explicitness on the exhibition match conditions, as the current text provides only a high-level description of the outcome without a dedicated breakdown in the Results or Methods sections. We have revised the manuscript by adding a new paragraph in the Results section (cross-referenced from the abstract and introduction) that describes the match setup. The agent's observation space used the full game state available via the Dota 2 API rather than being restricted to human-visible information. Action timing followed the model's inference speed at the game's native rate without additional artificial delays to match human reaction limits. Inference-time compute was supplied by the distributed training infrastructure and was not throttled to human-equivalent levels. We maintain that these AI-specific conditions are appropriate for the demonstration and do not invalidate the evidence of learned long-horizon strategy; the agent still had to discover and execute complex, coordinated behaviors to prevail. We have also added a short discussion of alternative explanations and why the outcome supports our central claim. This addresses the referee's concern directly. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical match outcome stands independent of training description

full rationale

The paper reports the training of OpenAI Five via scaled self-play RL over 10 months and its observed defeat of Team OG on April 13, 2019, as direct evidence that self-play RL can reach superhuman performance. No equations, fitted parameters, or derivations are presented that reduce the performance claim to inputs by construction. The central result is an external, falsifiable event (the match outcome) rather than a self-referential definition, renamed pattern, or self-citation chain. The description of the distributed training system and tools is procedural and does not invoke uniqueness theorems or ansatzes that loop back to the target claim. This is a standard empirical systems paper with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical success of scaling self-play RL; no explicit free parameters, axioms, or invented entities are stated in the abstract, though the approach implicitly assumes that self-play generates sufficient signal for superhuman performance.

pith-pipeline@v0.9.0 · 5529 in / 975 out tokens · 73326 ms · 2026-05-12T22:13:04.192641+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 33 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Generative Agents: Interactive Simulacra of Human Behavior

    cs.HC 2023-04 accept novelty 8.0

    Generative agents with memory streams, reflection, and planning using LLMs exhibit believable individual and emergent social behaviors in a simulated town.

  2. Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling

    cs.LG 2026-05 unverdicted novelty 7.0

    DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.

  3. ASH: Agents that Self-Hone via Embodied Learning

    cs.AI 2026-05 unverdicted novelty 7.0

    ASH reaches 11.2/12 milestones in Pokemon Emerald and 9.9/12 in Zelda by self-improving via an IDM trained on its own trajectories to label internet video, while baselines plateau at roughly 6/12.

  4. Controllability in preference-conditioned multi-objective reinforcement learning

    cs.LG 2026-05 unverdicted novelty 7.0

    Standard MORL metrics do not measure whether preference inputs reliably control agent behavior, so a new controllability metric is introduced to restore the link between user intent and agent output.

  5. Learning to Communicate Locally for Large-Scale Multi-Agent Pathfinding

    cs.AI 2026-05 unverdicted novelty 7.0

    LC-MAPF uses multi-round local communication between neighboring agents in a pre-trained model to outperform prior learning-based MAPF solvers on diverse unseen scenarios while preserving scalability.

  6. Does Synthetic Data Help? Empirical Evidence from Deep Learning Time Series Forecasters

    cs.LG 2026-05 accept novelty 7.0

    Synthetic data augmentation helps channel-mixing time series models but degrades channel-independent ones, with reliable gains only from seasonal-trend generators and gradual schedules in low-resource settings.

  7. SOPE: Stabilizing Off-Policy Evaluation for Online RL with Prior Data

    cs.LG 2026-05 unverdicted novelty 7.0

    SOPE uses an actor-aligned OPE signal on a held-out validation split to dynamically stop offline stabilization phases in online RL, improving performance up to 45.6% and cutting TFLOPs up to 22x on 25 Minari tasks.

  8. Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks

    cs.AI 2026-04 unverdicted novelty 7.0

    COSPLAY co-evolves an LLM decision agent with a skill bank agent to improve long-horizon game performance, reporting over 25.1% average reward gains versus frontier LLM baselines on single-player benchmarks.

  9. InfoChess: A Game of Adversarial Inference and a Laboratory for Quantifiable Information Control

    cs.MA 2026-04 unverdicted novelty 7.0

    InfoChess proposes a symmetric adversarial game focused purely on information control and probabilistic king-location inference, with RL agents outperforming heuristic baselines and gameplay dissected via belief entro...

  10. Territory Paint Wars: Diagnosing and Mitigating Failure Modes in Competitive Multi-Agent PPO

    cs.LG 2026-04 conditional novelty 7.0

    PPO in a new competitive game fails due to five implementation bugs and then competitive overfitting where self-play stays near 50% but generalization drops to 21.6%; mixing 20% random opponents restores generalizatio...

  11. Voyager: An Open-Ended Embodied Agent with Large Language Models

    cs.AI 2023-05 unverdicted novelty 7.0

    Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more uniq...

  12. Learning to Communicate Locally for Large-Scale Multi-Agent Pathfinding

    cs.AI 2026-05 unverdicted novelty 6.0

    LC-MAPF is a decentralized MAPF solver that uses a learnable multi-round communication module among nearby agents to outperform prior IL and RL methods while preserving scalability.

  13. QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL

    cs.LG 2026-05 unverdicted novelty 6.0

    QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markov...

  14. QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL

    cs.LG 2026-05 unverdicted novelty 6.0

    QHyer achieves state-of-the-art results in offline goal-conditioned RL by replacing return-to-go with a state-conditioned Q-estimator and introducing a gated hybrid attention-mamba backbone for content-adaptive histor...

  15. Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.

  16. When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient

    cs.LG 2026-04 unverdicted novelty 6.0

    Certain errors in proxy rewards for policy gradient methods can be benign or beneficial by preventing policies from stalling on outputs with mediocre ground truth rewards, enabling improved RLHF metrics and reward des...

  17. Biased Dreams: Limitations to Epistemic Uncertainty Quantification in Latent Space Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Latent transitions in models like Dreamer are biased toward dense regions, creating attractors that hide true dynamics discrepancies and cause epistemic uncertainty to be unreliable while overestimating rewards.

  18. CivBench: Progress-Based Evaluation for LLMs' Strategic Decision-Making in Civilization V

    cs.AI 2026-04 unverdicted novelty 6.0

    CivBench trains models on turn-level states in Civilization V to predict victory probabilities, providing a progress-based evaluation of LLM strategic capabilities across 307 games with 7 models.

  19. Vocabulary Dropout for Curriculum Diversity in LLM Co-Evolution

    cs.CL 2026-04 unverdicted novelty 6.0

    Vocabulary dropout prevents diversity collapse in LLM co-evolution by masking proposer logits, yielding average +4.4 point solver gains on mathematical reasoning benchmarks at 8B scale.

  20. Heterogeneous Self-Play for Realistic Highway Traffic Simulation

    cs.AI 2026-03 accept novelty 6.0

    PHASE uses heterogeneous self-play and context-conditioned policies to achieve realistic, zero-shot highway traffic simulation that outperforms traditional rule-based and self-play models on real-world datasets.

  21. Muon is Scalable for LLM Training

    cs.LG 2025-02 unverdicted novelty 6.0

    Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.

  22. TD-MPC2: Scalable, Robust World Models for Continuous Control

    cs.LG 2023-10 conditional novelty 6.0

    TD-MPC2 scales an implicit world-model RL method to a 317M-parameter agent that masters 80 tasks across four domains with a single hyperparameter configuration.

  23. Language Models (Mostly) Know What They Know

    cs.CL 2022-07 unverdicted novelty 6.0

    Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

  24. A General Language Assistant as a Laboratory for Alignment

    cs.CL 2021-12 conditional novelty 6.0

    Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.

  25. Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning

    cs.RO 2021-08 conditional novelty 6.0

    Isaac Gym achieves 2-3 orders of magnitude faster robot policy training by keeping physics simulation and PyTorch-based RL entirely on GPU with direct buffer sharing.

  26. Data-Augmented Game Starts for Accelerating Self-Play Exploration in Imperfect Information Games

    cs.LG 2026-05 unverdicted novelty 5.0

    DAGS initializes policy-gradient self-play from human-derived intermediate states to reduce exploitability in challenging imperfect-information games, with a multi-task flag fix for resulting bias and new benchmark en...

  27. Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse

    cs.CV 2026-05 unverdicted novelty 5.0

    The paper organizes research on generalist game AI into Dataset, Model, Harness, and Benchmark pillars and charts a five-level progression from single-game mastery to agents that create and live inside game multiverses.

  28. On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length

    cs.AI 2026-05 unverdicted novelty 5.0

    Longer action horizons bottleneck LLM agent training through instability, but training with reduced horizons stabilizes learning and enables better generalization to longer horizons.

  29. A High-Throughput Compute-Efficient POMDP Hide-And-Seek-Engine (HASE) for Multi-Agent Operations

    cs.MA 2026-04 unverdicted novelty 5.0

    A C++ Dec-POMDP simulator using data-oriented design and zero-copy PyTorch integration achieves up to 33 million steps per second on a 16-core CPU, enabling multi-agent policy training in minutes with PPO, DQN, and SAC.

  30. RAMP: Hybrid DRL for Online Learning of Numeric Action Models

    cs.AI 2026-04 unverdicted novelty 5.0

    RAMP learns numeric action models online via a DRL-planning feedback loop and outperforms PPO on IPC numeric domains in solvability and plan quality.

  31. Gymnasium: A Standard Interface for Reinforcement Learning Environments

    cs.LG 2024-07 accept novelty 5.0

    Gymnasium establishes a standardized API for RL environments to improve interoperability, reproducibility, and ease of development in reinforcement learning.

  32. Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse

    cs.CV 2026-05 unverdicted novelty 3.0

    This work traces four eras of generalist game players across dataset, model, harness, and benchmark pillars and charts a five-level roadmap ending in agents that create and evolve within game multiverses.

  33. Benefits of Low-Cost Bio-Inspiration in the Age of Overparametrization

    cs.RO 2026-04 unverdicted novelty 3.0

    Shallow MLPs and dense CPGs outperform deeper MLPs and Actor-Critic RL in bounded robot control tasks with limited proprioception, with a Parameter Impact metric indicating extra RL parameters yield no performance gai...

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · cited by 30 Pith papers · 8 internal anchors

  1. [1]

    TD-Gammon, a self-teaching backgammon program, achieves master-level play

    Tesauro, G. TD-Gammon, a self-teaching backgammon program, achieves master-level play. Neural computation6, 215–219 (1994)

  2. [2]

    Campbell, M., Hoane Jr., A. J. & Hsu, F.-h. Deep Blue.Artif. Intell.134, 57–83. issn: 0004- 3702 (Jan. 2002)

  3. [3]

    Playing Atari with Deep Reinforcement Learning

    Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D. & Riedmiller, M. Playing atari with deep reinforcement learning.arXiv preprint arXiv:1312.5602(2013)

  4. [4]

    Mastering the game of Go with deep neural networks and tree search.nature 529, 484 (2016)

    Silver,D.,Huang,A.,Maddison,C.J.,Guez,A.,Sifre,L.,VanDenDriessche,G.,Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M.,et al. Mastering the game of Go with deep neural networks and tree search.nature 529, 484 (2016)

  5. [5]

    Learning Dexterity https : / / openai

    OpenAI. Learning Dexterity https : / / openai . com / blog / learning - dexterity/. [Online; accessed 28-May-2019]. 2018

  6. [6]

    & Socher, R.A Deep Reinforced Model for Abstractive Summarization

    Paulus, R., Xiong, C. & Socher, R.A Deep Reinforced Model for Abstractive Summarization

  7. [7]

    arXiv: 1705.04304 [cs.CL]

  8. [8]

    M., Mathieu, M., Dudzik, A., Chung, J., Choi, D

    Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., Choi, D. H., Powell, R., Ewalds, T., Georgiev, P.,et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning.Nature, 1–5 (2019)

  9. [9]

    H., Codel, C., Hofmann, K., Houghton, B., Kuno, N., Milani, S., Mohanty, S

    Guss, W. H., Codel, C., Hofmann, K., Houghton, B., Kuno, N., Milani, S., Mohanty, S. P., Liebana, D. P., Salakhutdinov, R., Topin, N., Veloso, M. & Wang, P. The MineRL Competition on Sample Efficient Reinforcement Learning using Human Priors.CoRR abs/1904.10079. arXiv: 1904.10079. <http://arxiv.org/abs/1904.10079> (2019)

  10. [10]

    Dota 2 — Wikipedia, The Free Encyclopediahttps://en.wikipedia

    Wikipediacontributors. Dota 2 — Wikipedia, The Free Encyclopediahttps://en.wikipedia. org/w/index.php?title=Dota_2&oldid=913733447. [Online; accessed 9-September- 2019]. 2019

  11. [11]

    The International 2018 — Wikipedia, The Free Encyclopediahttps: //en.wikipedia.org/w/index.php?title=The_International_2018&oldid= 912865272

    Wikipedia contributors. The International 2018 — Wikipedia, The Free Encyclopediahttps: //en.wikipedia.org/w/index.php?title=The_International_2018&oldid= 912865272. [Online; accessed 9-September-2019]. 2019

  12. [12]

    Allis, L. V. Searching for solutions in games and artificial intelligencein (1994)

  13. [13]

    Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T.,et al.A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play.Science 362, 1140–1144 (2018)

  14. [14]

    A., Schmidhuber, J

    Gers, F. A., Schmidhuber, J. & Cummins, F. Learning to forget: Continual prediction with LSTM (1999)

  15. [15]

    Proximal Policy Optimization Algorithms

    Schulman, J., Wolski, F., Dhariwal, P., Radford, A. & Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347(2017)

  16. [16]

    Konda, V. R. & Tsitsiklis, J. N. Actor-critic algorithms in Advances in neural information processing systems(2000), 1008–1014

  17. [17]

    Asynchronous methods for deep reinforcement learningin International conference on ma- chine learning(2016), 1928–1937

    Mnih,V.,Badia,A.P.,Mirza,M.,Graves,A.,Lillicrap,T.,Harley,T.,Silver,D.&Kavukcuoglu, K. Asynchronous methods for deep reinforcement learningin International conference on ma- chine learning(2016), 1928–1937. 19

  18. [18]

    Schulman, J., Moritz, P., Levine, S., Jordan, M. I. & Abbeel, P. High-Dimensional Continuous Control Using Generalized Advantage Estimation.CoRR abs/1506.02438 (2016)

  19. [19]

    Distributed Prioritized Experience Replay

    Horgan, D., Quan, J., Budden, D., Barth-Maron, G., Hessel, M., van Hasselt, H. & Silver, D. Distributed Prioritized Experience Replay.CoRR abs/1803.00933. arXiv: 1803.00933. <http://arxiv.org/abs/1803.00933> (2018)

  20. [20]

    NVIDIA Collective Communications Library (NCCL) https : / / developer

    NVIDIA. NVIDIA Collective Communications Library (NCCL) https : / / developer . nvidia.com/nccl. [Online; accessed 9-September-2019]. 2019

  21. [21]

    Adam: A Method for Stochastic Optimization

    Kingma,D.P.&Ba,J.Adam:Amethodforstochasticoptimization. arXiv preprint arXiv:1412.6980 (2014)

  22. [22]

    Williams, R. J. & Peng, J. An Efficient Gradient-Based Algorithm for On-Line Training of Recurrent Network Trajectories.Neural Computation2, 490–501 (1990)

  23. [23]

    & Kingma, D

    Gray, S., Radford, A. & Kingma, D. P.GPU Kernels for Block-Sparse Weights2017

  24. [24]

    Chen, T., Goodfellow, I. J. & Shlens, J.Net2Net: Accelerating Learning via Knowledge Transfer in 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings(2016). <http://arxiv.org/abs/ 1511.05641>

  25. [25]

    & Zhang, L.Solving Rubik’s Cube with a Robot Hand

    OpenAI, Akkaya, I., Andrychowicz, M., Chociej, M., Litwin, M., McGrew, B., Petron, A., Paino, A., Plappert, M., Powell, G., Ribas, R., Schneider, J., Tezak, N., Tworek, J., Welinder, P., Weng, L., Yuan, Q., Zaremba, W. & Zhang, L.Solving Rubik’s Cube with a Robot Hand

  26. [26]

    arXiv: 1910.07113 [cs.LG]

  27. [27]

    Dalvi, N., Domingos, P., Mausam, Sanghai, S. & Verma, D.Adversarial Classificationin Pro- ceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, Seattle, WA, USA, 2004), 99–108.isbn: 1-58113-888-1. doi:10.1145/ 1014052.1014066. <http://doi.acm.org/10.1145/1014052.1014066>

  28. [28]

    & Singh, K

    Jain, A., Bansal, R., Kumar, A. & Singh, K. A comparative study of visual and auditory reaction times on the basis of gender and physical activity levels of medical first year students. International journal of applied and basic medical research5, 125–127 (2015)

  29. [29]

    & Graepel, T.TrueSkill: a Bayesian skill rating systemin Advances in neural information processing systems(2007), 569–576

    Herbrich, R., Minka, T. & Graepel, T.TrueSkill: a Bayesian skill rating systemin Advances in neural information processing systems(2007), 569–576

  30. [30]

    An Empirical Model of Large-Batch Training

    McCandlish, S., Kaplan, J., Amodei, D. & Team, O. D. An empirical model of large-batch training. arXiv preprint arXiv:1812.06162(2018)

  31. [31]

    & Schulman, J

    Cobbe, K., Klimov, O., Hesse, C., Kim, T. & Schulman, J. Quantifying Generalization in Reinforcement Learning.CoRR abs/1812.02341. arXiv: 1812.02341. <http://arxiv. org/abs/1812.02341> (2018)

  32. [32]

    M., Dunning, I., Marris, L., Lever, G., Castaneda, A

    Jaderberg, M., Czarnecki, W. M., Dunning, I., Marris, L., Lever, G., Castaneda, A. G., Beattie, C., Rabinowitz, N. C., Morcos, A. S., Ruderman, A.,et al.Human-level performance in first- person multiplayer games with population-based deep reinforcement learning.arXiv preprint arXiv:1807.01281 (2018)

  33. [33]

    & Bowling, M

    Moravčík, M., Schmid, M., Burch, N., Lisý, V., Morrill, D., Bard, N., Davis, T., Waugh, K., Johanson, M. & Bowling, M. Deepstack: Expert-level artificial intelligence in heads-up no-limit poker. Science 356, 508–513 (2017). 20

  34. [34]

    & Szafron, D

    Schaeffer, J., Culberson, J., Treloar, N., Knight, B., Lu, P. & Szafron, D. A world championship caliber checkers program.Artificial Intelligence53, 273–289. issn: 0004-3702 (1992)

  35. [35]

    Emergent Complexity via Multi-Agent Competition

    Bansal, T., Pachocki, J., Sidor, S., Sutskever, I. & Mordatch, I. Emergent complexity via multi-agent competition.arXiv preprint arXiv:1710.03748(2017)

  36. [36]

    Sukhbaatar,S.,Lin,Z.,Kostrikov,I.,Synnaeve,G.,Szlam,A.&Fergus,R.Intrinsicmotivation and automatic curricula via asymmetric self-play.arXiv preprint arXiv:1703.05407(2017)

  37. [37]

    Brown, G. W. inActivity Analysis of Production and Allocation(ed Koopmans, T. C.) (Wiley, New York, 1951)

  38. [38]

    Deep reinforcement learning from self-play in imperfect-information games.arXiv preprint arXiv:1603.01121,

    Heinrich, J. & Silver, D. Deep Reinforcement Learning from Self-Play in Imperfect-Information Games. CoRR abs/1603.01121. arXiv: 1603.01121 (2016)

  39. [39]

    Mastering the game of go without human knowledge

    Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A.,et al. Mastering the game of go without human knowledge. Nature 550, 354 (2017)

  40. [40]

    & Barber, D.Thinking fast and slow with deep learning and tree search in Advances in Neural Information Processing Systems(2017), 5360–5370

    Anthony, T., Tian, Z. & Barber, D.Thinking fast and slow with deep learning and tree search in Advances in Neural Information Processing Systems(2017), 5360–5370

  41. [41]

    & Sandholm, T

    Brown, N. & Sandholm, T. Superhuman AI for multiplayer poker.Science, eaay2400 (2019)

  42. [42]

    Watkins, C. J. & Dayan, P. Q-learning.Machine learning8, 279–292 (1992)

  43. [43]

    D., Narasimhan, K., Saeedi, A

    Kulkarni, T. D., Narasimhan, K., Saeedi, A. & Tenenbaum, J. Hierarchical deep reinforce- ment learning: Integrating temporal abstraction and intrinsic motivationin Advances in neural information processing systems(2016), 3675–3683

  44. [44]

    Exploration by Random Network Distillation

    Burda, Y., Edwards, H., Storkey, A. & Klimov, O. Exploration by random network distillation. arXiv preprint arXiv:1810.12894(2018)

  45. [45]

    Ecoffet, A., Huizinga, J., Lehman, J., Stanley, K. O. & Clune, J. Montezuma’s revenge solved by go-explore, a new algorithm for hard-exploration problems (sets records on pitfall too). Uber Engineering Blog, Nov(2018)

  46. [46]

    Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y. & He, K. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour (June 2017)

  47. [47]

    & Ginsburg, B

    You, Y., Gitman, I. & Ginsburg, B. Scaling SGD Batch Size to 32K for ImageNet Training (Aug. 2017)

  48. [48]

    & Keutzer, K.ImageNet Training in Minutesin Proceedings of the 47th International Conference on Parallel Processing(ACM, Eugene, OR, USA, 2018), 1:1–1:10

    You, Y., Zhang, Z., Hsieh, C.-J., Demmel, J. & Keutzer, K.ImageNet Training in Minutesin Proceedings of the 47th International Conference on Parallel Processing(ACM, Eugene, OR, USA, 2018), 1:1–1:10. isbn: 978-1-4503-6510-9. doi:10.1145/3225058.3225069. <http: //doi.acm.org/10.1145/3225058.3225069>

  49. [49]

    Mnih,V.,Badia,A.P.,Mirza,M.,Graves,A.,Lillicrap,T.,Harley,T.,Silver,D.&Kavukcuoglu, K. Asynchronous Methods for Deep Reinforcement Learningin Proceedings of The 33rd Inter- national Conference on Machine Learning(edsBalcan,M.F.&Weinberger,K.Q.) 48(PMLR, New York, New York, USA, June 2016), 1928–1937. <http://proceedings.mlr.press/ v48/mniha16.html>

  50. [50]

    Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, I.,et al.Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures.arXiv preprint arXiv:1802.01561(2018). 21

  51. [51]

    Fahlman, S. E. & Lebiere, C. in (ed Touretzky, D. S.) 524–532 (Morgan Kaufmann Publish- ers Inc., San Francisco, CA, USA, 1990).isbn: 1-55860-100-7. <http : / / dl . acm . org / citation.cfm?id=109230.107380>

  52. [52]

    & Hebert, M

    Wang, Y., Ramanan, D. & Hebert, M. Growing a Brain: Fine-Tuning by Increasing Model Capacity in2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017(2017), 3029–3038. doi:10.1109/CVPR.2017.323 . <https://doi.org/10.1109/CVPR.2017.323>

  53. [53]

    M., Jayakumar, S

    Czarnecki, W. M., Jayakumar, S. M., Jaderberg, M., Hasenclever, L., Teh, Y. W., Osindero, S., Heess, N. & Pascanu, R. Mix&Match - Agent Curricula for Reinforcement Learning.CoRR abs/1806.01780. arXiv: 1806 . 01780. <http : / / arxiv . org / abs / 1806 . 01780> (2018)

  54. [54]

    & Hoiem, D

    Li, Z. & Hoiem, D. Learning without Forgetting.IEEE Transactions on Pattern Analysis and Machine Intelligence40, 2935–2947. issn: 0162-8828 (Dec. 2018)

  55. [55]

    Progressive Neural Networks

    Rusu, A. A., Rabinowitz, N. C., Desjardins, G., Soyer, H., Kirkpatrick, J., Kavukcuoglu, K., Pascanu, R. & Hadsell, R. Progressive Neural Networks. arXiv preprint arXiv:1606.04671 (2016)

  56. [56]

    Distilling the Knowledge in a Neural Network

    Hinton, G., Vinyals, O. & Dean, J. Distilling the Knowledge in a Neural Networkin NIPS Deep Learning and Representation Learning Workshop(2015). <http://arxiv.org/abs/ 1503.02531>

  57. [57]

    Ross, S., Gordon, G. & Bagnell, D.A Reduction of Imitation Learning and Structured Predic- tion to No-Regret Online Learningin Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics(eds Gordon, G., Dunson, D. & Dudík, M.)15 (PMLR, Fort Lauderdale, FL, USA, Apr. 2011), 627–635. <http://proceedings.mlr.press/ v15/ros...

  58. [58]

    & Koltun, V

    Levine, S. & Koltun, V. Guided Policy Searchin Proceedings of the 30th International Con- ference on International Conference on Machine Learning - Volume 28(JMLR.org, Atlanta, GA, USA, 2013), III-1–III-9. <http://dl.acm.org/citation.cfm?id=3042817. 3042937>

  59. [59]

    AI and Compute https://openai.com/blog/ai- and- compute/

    OpenAI. AI and Compute https://openai.com/blog/ai- and- compute/ . [Online; accessed 9-Sept-2019]. 2018

  60. [60]

    Y., Harada, D

    Ng, A. Y., Harada, D. & Russell, S.Policy invariance under reward transformations: Theory and application to reward shapingin In Proceedings of the Sixteenth International Conference on Machine Learning(Morgan Kaufmann, 1999), 278–287

  61. [61]

    OpenAI Gym

    Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J. & Zaremba, W. OpenAI Gym.CoRR abs/1606.01540. arXiv: 1606.01540. <http://arxiv.org/ abs/1606.01540> (2016)

  62. [62]

    M., Pérolat, J., Jaderberg, M

    Balduzzi, D., Garnelo, M., Bachrach, Y., Czarnecki, W. M., Pérolat, J., Jaderberg, M. & Graepel, T. Open-ended Learning in Symmetric Zero-sum Games.CoRR abs/1901.08106. arXiv: 1901.08106. <http://arxiv.org/abs/1901.08106> (2019)

  63. [63]

    cheating

    Williams, R. J. & Peng, J. Function optimization using connectionist reinforcement learning algorithms. Connection Science3, 241–268 (1991). 22 Appendix Table of Contents A Compute Usage 25 B Surgery 25 C Hyperparameters 29 D Evaluating agents’ understanding 30 D.1 Understanding OpenAI Five Finals . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 D...

  64. [64]

    Observation Space

    Predictions by different heroes differ as they specifically predict whether they will participate in bringing given building down. Predictions should not be read as calibrated probabilities, because they are trained with a discount factor. See Figure 11a and Figure 11b for descriptions of the events corresponding to two of these buildings. We also looked at ...

  65. [65]

    level up,

    Ability Builds:Each hero has four spell abilities. Over the course of the game, a player can choose which of these to “level up,” making that particular skill more powerful. For these, in evaluation games we follow a fixed schedule (improve ability X at level 1, then Y at level 2, then Z at level 3, etc). In training, we randomize around this fixed script s...

  66. [66]

    We divide items into consumables — items which are consumed for a one-time benefit such as healing — and everything else

    Item Purchasing: As a hero gains gold, they can purchase items. We divide items into consumables — items which are consumed for a one-time benefit such as healing — and everything else. For consumables, we use a simple logic which ensures that the agent always has a certain set of consumables; when the agent uses one up, we then purchase a new one. After a...

  67. [67]

    inventory

    Item Swap: Each player can choose 6 of the items they hold to keep in their “inventory” where they are actively usable, leaving up to 3 inactive items in their “backpack.” Instead of letting the model control this, we use a heuristic which approximately keeps the most valuable items in the inventory

  68. [68]

    Team”) and some just to the hero who took the action “Solo

    Courier Control: Each side has a single “Courier” unit which cannot fight but can carry items from the shop to the player which purchased them. We use a state-machine based logic to control this character. G Reward Weights Our agent’s ultimate goal is to win the game. In order to simplify the credit assignment problem (the task of figuring out which of the ...

  69. [69]

    softmax sample/argmax Chosen Action ID FC

    The primary action is chosen via a linear projection over the available actions. softmax sample/argmax Chosen Action ID FC

  70. [70]

    Other action parameters are linear projections from the LSTM softmax sample/argmax Offset X Offset YFC softmax sample/argmax Embedding multiply sigmoid dot product softmax sample/argmax Target Unit

  71. [71]

    The unit keys are masked by a learned per-action mask based on the sampled action

    The target unit is chosen via an attention mechanism over the available units. The unit keys are masked by a learned per-action mask based on the sampled action. FC FC Figure 18: The hidden state of the LSTM and unit embeddings are used to parameterize the actions. each responsible for the observations and actions of one of the heroes in the team. At a hi...

  72. [72]

    distance to me

    In practice the Observation Processing portion of the model is also cloned 5 times for the five different heroes. The weights are identical and the observations are nearly identical — but there are a handful of derived features which are different for each replica (such as “distance to me” for each unit; see Table 4 for the list of observations that vary). T...

  73. [73]

    Flattened Observation

    The “Flattened Observation” and “Hero Embedding” are processed before being sent into the LSTM (see Figure 17b) by a fully-connected layer and a “cross-hero pool” operation, to ensure that the non-identical observations can be used by other members of the team if needed

  74. [74]

    Unit Embeddings

    The “Unit Embeddings” from the observation processing are carried along beside the LSTM, and used by the action heads to choose a unit to target (see Figure 18). In addition to the action logits, the value function is computed as another linear projection of the LSTM state. Thus our value function and action policy share a network and share gradients. 48 ...

  75. [75]

    If a long and very specific series of actions is necessary to be taken by the agent in order to randomly stumble on a reward, and any deviation from that sequence will result in negative advantage, then the longer this series, the less likely is agent to explore this skill thoroughly and learn to use it when necessary

  76. [76]

    If an environment is highly repetitive, then the agent is more likely to find and stay in a local minimum

  77. [77]

    designing skyscrapers

    In order to be robust to various strategies humans employ, our agents must have encountered a wide variety of situations in training. This parallels the success of domain randomization in transferring policies from simulation to real-world robotics[5]. We randomize many parts of the environment: • Initial State: In our rollout games, heroes start with ran...