pith. sign in

arxiv: 1907.02057 · v1 · pith:CLHRNOYTnew · submitted 2019-07-03 · 💻 cs.LG · cs.AI· cs.RO· stat.ML

Benchmarking Model-Based Reinforcement Learning

Pith reviewed 2026-05-25 10:10 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.ROstat.ML
keywords model-based reinforcement learningbenchmarkingsample efficiencydynamics modelingplanning horizonearly terminationreinforcement learning environmentsalgorithm comparison
0
0 comments X

The pith

A unified benchmark of model-based RL algorithms reveals three recurring challenges: inaccurate dynamics, uncertain planning horizons, and early episode termination.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper collects multiple model-based reinforcement learning algorithms and introduces more than 18 environments built specifically to test them side by side. All algorithms run under identical settings that include added noise, allowing direct measurement of relative sample efficiency. The comparisons make visible that performance gaps arise from how each method learns and uses its dynamics model, decides how many steps to plan ahead, and manages episodes that stop early. A reader would care because model-based methods are expected to learn with far less data than model-free ones, yet these three issues appear to prevent that advantage from appearing consistently. The authors release the full benchmark so later work can measure progress against the same baseline.

Core claim

When a broad set of model-based RL algorithms is evaluated inside the same collection of environments and problem settings, their differences reduce to three core difficulties: the dynamics bottleneck in which small model errors grow during planning, the planning horizon dilemma of choosing how far to look ahead without knowing the right length in advance, and the early-termination dilemma created by episodes that end before the planned horizon is reached.

What carries the argument

The suite of over 18 MBRL-specific environments together with the standardized evaluation protocol that isolates the effects of dynamics learning, planning length, and episode termination.

If this is right

  • Further gains require dynamics models whose prediction errors do not compound over multiple planning steps.
  • Algorithms must incorporate explicit mechanisms for selecting or adapting the number of steps they plan ahead.
  • Environments and algorithms need consistent rules for handling episodes that terminate before the planning horizon ends.
  • Releasing the environments and code makes it possible to measure whether new methods actually reduce the identified bottlenecks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Extending the same benchmark to physical robot tasks would test whether the three dilemmas remain the dominant limits once simulation-to-reality gaps are added.
  • If the dynamics, horizon, and termination issues are resolved, model-based methods could become preferable for any setting where collecting new experience is costly.
  • The inclusion of noisy environments suggests that robustness to observation or transition noise should be a standard requirement in future RL algorithm comparisons.

Load-bearing premise

The environments and noise conditions chosen are representative of the general difficulties in model-based RL rather than being special to this particular test collection.

What would settle it

Re-running the full set of algorithms on an independent collection of environments not constructed for MBRL and obtaining substantially different performance orderings or different explanations for the gaps would show the three dilemmas are not the main limiting factors.

Figures

Figures reproduced from arXiv: 1907.02057 by Eric Langlois, Guodong Zhang, Ignasi Clavera, Jerrick Hoang, Jimmy Ba, Pieter Abbeel, Shunshi Zhang, Tingwu Wang, Xuchan Bao, Yeming Wen.

Figure 1
Figure 1. Figure 1: A subset of all 18 performance curve figures of the bench-marked algorithms. All the algorithms are run for 200k time-steps and with 4 random seeds. The remaining figures are in appendix C. computational resources, the estimated wall-clock time, and whether the algorithm is fast enough to run at real-time at test time, namely, if the action selection can be done faster than the default time-step of the env… view at source ↗
Figure 2
Figure 2. Figure 2: Performance curve for each algorithm trained for 1 million time-steps. The results show that MBRL algorithms plateau at a performance level well below their model-free counterparts and themselves with ground-truth dynamics. This points out that when learning models, more data does not result in better performance. For instance, PETS’s performance plateaus after 400k time-steps at a value much lower than th… view at source ↗
Figure 3
Figure 3. Figure 3: The relative performance with different planning horizon. One of the critical choices in shooting methods is the planning horizon. In [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Performance curve for MBRL algorithms. There are still 3 more figures in a continued [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: (Continued) Performance curve for MBRL algorithms. [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The performance curve for algorithms with noise. We represent the noise standard deviation [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: (Continued) The performance curve for algorithms with noise. We represent the noise [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The performance grid using different planning horizon and depth. [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The performance curve for different environment length and planning horizon in SLBO. [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
read the original abstract

Model-based reinforcement learning (MBRL) is widely seen as having the potential to be significantly more sample efficient than model-free RL. However, research in model-based RL has not been very standardized. It is fairly common for authors to experiment with self-designed environments, and there are several separate lines of research, which are sometimes closed-sourced or not reproducible. Accordingly, it is an open question how these various existing MBRL algorithms perform relative to each other. To facilitate research in MBRL, in this paper we gather a wide collection of MBRL algorithms and propose over 18 benchmarking environments specially designed for MBRL. We benchmark these algorithms with unified problem settings, including noisy environments. Beyond cataloguing performance, we explore and unify the underlying algorithmic differences across MBRL algorithms. We characterize three key research challenges for future MBRL research: the dynamics bottleneck, the planning horizon dilemma, and the early-termination dilemma. Finally, to maximally facilitate future research on MBRL, we open-source our benchmark in http://www.cs.toronto.edu/~tingwuwang/mbrl.html.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript collects existing MBRL algorithms, introduces over 18 custom environments specially designed for MBRL, evaluates the algorithms under unified settings that include noise, unifies their algorithmic differences, characterizes three key challenges (dynamics bottleneck, planning horizon dilemma, early-termination dilemma), and open-sources the full benchmark.

Significance. If the observed patterns hold beyond the custom suite, the work supplies a much-needed standardized, reproducible benchmark for MBRL that directly tackles the field's fragmentation and lack of comparable results. The open-sourcing of code and environments, together with the inclusion of noisy dynamics, constitutes a concrete contribution to reproducibility that future papers can build upon.

major comments (2)
  1. [Abstract and introduction] Abstract and introduction: the central claim that the benchmark 'characterize[s] three key research challenges for future MBRL research' rests on performance patterns observed exclusively in environments described as 'specially designed for MBRL'. No comparison against established continuous-control suites (e.g., standard MuJoCo tasks) is reported, leaving open the possibility that the dynamics bottleneck, planning-horizon dilemma, and early-termination dilemma are artifacts of the chosen environment construction rather than intrinsic MBRL properties.
  2. [Benchmark environments section] The weakest assumption identified in the reader's report—that the custom environments are representative enough to reveal general algorithmic differences—is load-bearing for the three-dilemma characterization. Without an explicit design protocol or hold-out validation set that decouples environment features from the targeted dilemmas, the empirical support for generality remains incomplete.
minor comments (1)
  1. [Abstract] The open-source link should be verified to remain accessible and to include exact environment specifications and random seeds used in the reported runs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below, proposing targeted revisions to clarify scope and strengthen the presentation of our benchmark's design rationale while preserving the paper's focus on MBRL-specific environments.

read point-by-point responses
  1. Referee: [Abstract and introduction] Abstract and introduction: the central claim that the benchmark 'characterize[s] three key research challenges for future MBRL research' rests on performance patterns observed exclusively in environments described as 'specially designed for MBRL'. No comparison against established continuous-control suites (e.g., standard MuJoCo tasks) is reported, leaving open the possibility that the dynamics bottleneck, planning-horizon dilemma, and early-termination dilemma are artifacts of the chosen environment construction rather than intrinsic MBRL properties.

    Authors: We agree that the absence of direct comparisons to standard MuJoCo tasks leaves open the question of whether the three dilemmas generalize beyond our suite. Our environments were intentionally constructed to isolate MBRL-specific issues (e.g., compounding model error under noise and variable termination) that are often obscured in well-tuned model-free benchmarks. We will revise the abstract and introduction to explicitly qualify the claim: the dilemmas are characterized within an MBRL-oriented benchmark designed to surface them, rather than asserted as universal without further validation. We will also add a short discussion noting that future work could test the same algorithms on MuJoCo to assess transfer of the observed patterns. No new MuJoCo experiments are planned for this revision, as they would require a substantially expanded scope. revision: partial

  2. Referee: [Benchmark environments section] The weakest assumption identified in the reader's report—that the custom environments are representative enough to reveal general algorithmic differences—is load-bearing for the three-dilemma characterization. Without an explicit design protocol or hold-out validation set that decouples environment features from the targeted dilemmas, the empirical support for generality remains incomplete.

    Authors: We accept that an explicit design protocol was not sufficiently detailed. In the revised manuscript we will insert a dedicated subsection under 'Benchmark Environments' that enumerates the design principles: (1) controlled injection of dynamics noise to probe the dynamics bottleneck, (2) tunable planning horizons and reward sparsity to expose the planning-horizon dilemma, and (3) variable early-termination conditions to study the early-termination dilemma. Each environment's parameters are listed with the specific dilemma it targets. A hold-out validation set was not part of the original design; we will acknowledge this limitation and note it as a recommended practice for future benchmark extensions rather than adding one retroactively, as constructing a statistically independent hold-out would require new environment families. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmarking with no derivations or self-referential reductions

full rationale

The paper is an empirical study that collects existing MBRL algorithms, introduces new environments, runs unified benchmarks, and reports observed performance patterns to characterize challenges. No equations, parameter fits, or derivations are present that reduce to inputs by construction. The characterization of challenges follows directly from the benchmark results rather than any self-definitional or fitted-input mechanism. Self-citations, if any, are not load-bearing for a central claim that reduces to prior author work. This matches the default case of a self-contained empirical paper with score 0.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central contribution rests on the domain assumption that standardizing environments and problem settings produces meaningful relative rankings of MBRL algorithms; no free parameters or invented entities are introduced beyond the choice of benchmark tasks themselves.

free parameters (1)
  • environment design choices
    The over 18 environments are specially designed, requiring choices about dynamics, noise levels, and termination conditions that are not derived from first principles.
axioms (1)
  • domain assumption MBRL algorithms can be fairly compared when run under identical environment and noise settings
    Invoked when the paper states it benchmarks algorithms 'with unified problem settings, including noisy environments' to address the lack of standardization.

pith-pipeline@v0.9.0 · 5756 in / 1359 out tokens · 41366 ms · 2026-05-25T10:10:41.915347+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Advantage-Guided Diffusion for Model-Based Reinforcement Learning

    cs.AI 2026-04 unverdicted novelty 7.0

    Advantage-guided diffusion (SAG and EAG) steers sampling in diffusion world models to higher-advantage trajectories, enabling policy improvement and better sample efficiency on MuJoCo tasks.

  2. Mastering Atari with Discrete World Models

    cs.LG 2020-10 accept novelty 7.0

    DreamerV2 reaches human-level performance on 55 Atari games by learning behaviors inside a separately trained discrete-latent world model.

  3. Dream to Control: Learning Behaviors by Latent Imagination

    cs.LG 2019-12 accept novelty 7.0

    Dreamer learns to control from images by imagining and optimizing behaviors in a learned latent world model, outperforming prior methods on 20 visual tasks in data efficiency and final performance.

  4. QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL

    cs.LG 2026-05 unverdicted novelty 6.0

    QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markov...

  5. QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL

    cs.LG 2026-05 unverdicted novelty 6.0

    QHyer achieves state-of-the-art results in offline goal-conditioned RL by replacing return-to-go with a state-conditioned Q-estimator and introducing a gated hybrid attention-mamba backbone for content-adaptive histor...

  6. Breaking the Computational Barrier: Provably Efficient Actor-Critic for Low-Rank MDPs

    cs.LG 2026-05 unverdicted novelty 6.0

    An actor-critic RL algorithm for low-rank MDPs achieves improved sample efficiency using solely a policy evaluation oracle.

  7. Is Conditional Generative Modeling all you need for Decision-Making?

    cs.LG 2022-11 unverdicted novelty 6.0

    Return-conditional diffusion models for policies outperform offline RL on benchmarks by circumventing dynamic programming and enable constraint or skill composition.

  8. What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

    cs.RO 2021-08 accept novelty 6.0

    A comprehensive benchmark study of offline imitation learning methods on multi-stage robot manipulation tasks identifies key sensitivities to algorithm design, data quality, and stopping criteria while releasing all d...

  9. D2 Actor Critic: Diffusion Actor Meets Distributional Critic

    cs.LG 2025-10 unverdicted novelty 5.0

    D2AC combines a diffusion actor with a distributional critic via fused distributional RL and clipped double Q-learning to reach state-of-the-art results on 18 hard control benchmarks including Humanoid, Dog, and Shadow Hand.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 8 Pith papers · 17 internal anchors

  1. [1]

    Learning Dexterous In-Hand Manipulation

    Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob McGrew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous in-hand manipulation. arXiv preprint arXiv:1808.00177, 2018

  2. [2]

    Unifying count-based exploration and intrinsic motivation

    Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying count-based exploration and intrinsic motivation. In D. D. Lee, M. Sugiyama, U. V . Luxburg, I. Guyon, and R. Garnett, editors,Advances in Neural Information Processing Systems 29, pages 1471–1479. Curran Associates, Inc., 2016

  3. [3]

    The explicit linear quadratic regulator for constrained systems

    Alberto Bemporad, Manfred Morari, Vivek Dua, and Efstratios N Pistikopoulos. The explicit linear quadratic regulator for constrained systems. Automatica, 38(1):3–20, 2002

  4. [4]

    The cross-entropy method for optimization

    Zdravko I Botev, Dirk P Kroese, Reuven Y Rubinstein, and Pierre L’Ecuyer. The cross-entropy method for optimization. In Handbook of statistics, volume 31, pages 35–59. Elsevier, 2013

  5. [5]

    OpenAI Gym

    Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016

  6. [6]

    Path integral guided policy search

    Yevgen Chebotar, Mrinal Kalakrishnan, Ali Yahya, Adrian Li, Stefan Schaal, and Sergey Levine. Path integral guided policy search. In 2017 IEEE international conference on robotics and automation (ICRA), pages 3381–3388. IEEE, 2017

  7. [7]

    Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models

    Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforce- ment learning in a handful of trials using probabilistic dynamics models. arXiv preprint arXiv:1805.12114, 2018

  8. [8]

    Model-Based Reinforcement Learning via Meta-Policy Optimization

    Ignasi Clavera, Jonas Rothfuss, John Schulman, Yasuhiro Fujita, Tamim Asfour, and Pieter Abbeel. Model-based reinforcement learning via meta-policy optimization. CoRR, abs/1809.05214, 2018

  9. [9]

    A tutorial on the cross-entropy method

    Pieter-Tjerk De Boer, Dirk P Kroese, Shie Mannor, and Reuven Y Rubinstein. A tutorial on the cross-entropy method. Annals of operations research, 134(1):19–67, 2005

  10. [10]

    Model based bayesian exploration

    Richard Dearden, Nir Friedman, and David Andre. Model based bayesian exploration. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, UAI’99, pages 150–159, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc

  11. [11]

    Pilco: A model-based and data-efficient approach to policy search

    Marc Deisenroth and Carl E Rasmussen. Pilco: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11), pages 465–472, 2011

  12. [12]

    Gaussian processes for data-efficient learning in robotics and control

    Marc Peter Deisenroth, Dieter Fox, and Carl Edward Rasmussen. Gaussian processes for data-efficient learning in robotics and control. IEEE transactions on pattern analysis and machine intelligence, 37(2):408–423, 2015

  13. [13]

    Imagenet: A large- scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

  14. [14]

    Benchmarking deep reinforcement learning for continuous control

    Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pages 1329–1338, 2016

  15. [15]

    C. Finn, M. Zhang, J. Fu, X. Tan, Z. McCarthy, E. Scharff, and S. Levine. Guided policy search code implementation, 2016. Software available from rll.berkeley.edu/gps

  16. [16]

    Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks

    Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adapta- tion of deep networks. CoRR, abs/1703.03400, 2017

  17. [17]

    Guided cost learning: Deep inverse optimal control via policy optimization

    Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning, pages 49–58, 2016

  18. [18]

    Addressing Function Approximation Error in Actor-Critic Methods

    Scott Fujimoto, Herke van Hoof, and David Meger. Addressing function approximation error in actor-critic methods. arXiv preprint arXiv:1802.09477, 2018

  19. [19]

    Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

    Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018. 9

  20. [20]

    Emergence of Locomotion Behaviours in Rich Environments

    Nicolas Heess, Srinivasan Sriram, Jay Lemmon, Josh Merel, Greg Wayne, Yuval Tassa, Tom Erez, Ziyu Wang, Ali Eslami, Martin Riedmiller, et al. Emergence of locomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286, 2017

  21. [21]

    Learning continuous control policies by stochastic value gradients

    Nicolas Heess, Gregory Wayne, David Silver, Timothy Lillicrap, Tom Erez, and Yuval Tassa. Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, pages 2944–2952, 2015

  22. [22]

    Deep reinforcement learning that matters

    Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018

  23. [23]

    Vime: Variational information maximizing exploration

    Rein Houthooft, Xi Chen, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. Vime: Variational information maximizing exploration. In D. D. Lee, M. Sugiyama, U. V . Luxburg, I. Guyon, and R. Garnett, editors,Advances in Neural Information Processing Systems 29, pages 1109–1117. Curran Associates, Inc., 2016

  24. [24]

    Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control

    Riashat Islam, Peter Henderson, Maziar Gomrokchi, and Doina Precup. Reproducibility of benchmarked deep reinforcement learning tasks for continuous control. arXiv preprint arXiv:1708.04133, 2017

  25. [25]

    Data-Efficient Reinforcement Learning with Probabilistic Model Predictive Control

    Sanket Kamthe and Marc Peter Deisenroth. Data-efficient reinforcement learning with proba- bilistic model predictive control. arXiv preprint arXiv:1706.06491, 2017

  26. [26]

    Moses: Open source toolkit for statistical machine translation

    Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, et al. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th annual meeting of the association for computational linguistics companion volume proceedings of the d...

  27. [27]

    Model-Ensemble Trust-Region Policy Optimization

    Thanard Kurutach, Ignasi Clavera, Yan Duan, Aviv Tamar, and Pieter Abbeel. Model-ensemble trust-region policy optimization. arXiv preprint arXiv:1802.10592, 2018

  28. [28]

    Learning neural network policies with guided policy search under unknown dynamics

    Sergey Levine and Pieter Abbeel. Learning neural network policies with guided policy search under unknown dynamics. In Advances in Neural Information Processing Systems , pages 1071–1079, 2014

  29. [29]

    Learning contact-rich manipulation skills with guided policy search

    Sergey Levine, Nolan Wagener, and Pieter Abbeel. Learning contact-rich manipulation skills with guided policy search. In 2015 IEEE international conference on robotics and automation (ICRA), pages 156–163. IEEE, 2015

  30. [30]

    Continuous control with deep reinforcement learning

    Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015

  31. [31]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014

  32. [32]

    Algo- rithmic framework for model-based deep reinforcement learning with theoretical guarantees

    Yuping Luo, Huazhe Xu, Yuanzhi Li, Yuandong Tian, Trevor Darrell, and Tengyu Ma. Algo- rithmic framework for model-based deep reinforcement learning with theoretical guarantees. ICLR, 2019

  33. [33]

    Asynchronous methods for deep rein- forcement learning

    V olodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep rein- forcement learning. In International Conference on Machine Learning , pages 1928–1937, 2016

  34. [34]

    Playing Atari with Deep Reinforcement Learning

    V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013

  35. [35]

    Guided policy search via approximate mirror descent

    William H Montgomery and Sergey Levine. Guided policy search via approximate mirror descent. In Advances in Neural Information Processing Systems, pages 4008–4016, 2016

  36. [36]

    Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning

    Anusha Nagabandi, Gregory Kahn, Ronald S Fearing, and Sergey Levine. Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. arXiv preprint arXiv:1708.02596, 2017. 10

  37. [37]

    Librispeech: an asr corpus based on public domain audio books

    Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210. IEEE, 2015

  38. [38]

    Deepmimic: Example- guided deep reinforcement learning of physics-based character skills

    Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. Deepmimic: Example- guided deep reinforcement learning of physics-based character skills. ACM Transactions on Graphics (TOG), 37(4):143, 2018

  39. [39]

    A survey of numerical methods for optimal control

    Anil V Rao. A survey of numerical methods for optimal control. Advances in the Astronautical Sciences, 135(1):497–528, 2009

  40. [40]

    Robust constrained model predictive control

    Arthur George Richards. Robust constrained model predictive control . PhD thesis, Mas- sachusetts Institute of Technology, 2005

  41. [41]

    A reduction of imitation learning and structured prediction to no-regret online learning

    Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635, 2011

  42. [42]

    Ray Interference: a Source of Plateaus in Deep Reinforcement Learning

    Tom Schaul, Diana Borsa, Joseph Modayil, and Razvan Pascanu. Ray interference: a source of plateaus in deep reinforcement learning. arXiv preprint arXiv:1904.11455, 2019

  43. [43]

    Trust region policy optimization

    John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 1889–1897, 2015

  44. [44]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

  45. [45]

    Integrated architectures for learning, planning, and reacting based on approximating dynamic programming

    Richard S Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Machine Learning Proceedings 1990, pages 216–224. Elsevier, 1990

  46. [46]

    Dyna, an integrated architecture for learning, planning, and reacting

    Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting. ACM SIGART Bulletin, 2(4):160–163, 1991

  47. [47]

    Planning by incremental dynamic programming

    Richard S Sutton. Planning by incremental dynamic programming. In Machine Learning Proceedings 1991, pages 353–357. Elsevier, 1991

  48. [48]

    DeepMind Control Suite

    Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018

  49. [49]

    Synthesis and stabilization of complex behaviors through online trajectory optimization

    Yuval Tassa, Tom Erez, and Emanuel Todorov. Synthesis and stabilization of complex behaviors through online trajectory optimization. InIntelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pages 4906–4913. IEEE, 2012

  50. [50]

    Seaborn: statistical data visualization

    Michael Waskom. Seaborn: statistical data visualization. http://seaborn.pydata.org/. Accessed: 2010-09-30

  51. [51]

    Efficient model-based exploration

    Marco Wiering and Jürgen Schmidhuber. Efficient model-based exploration. In PROCEED- INGS OF THE SIXTH INTERNATIONAL CONFERENCE ON SIMULATION OF ADAPTIVE BEHAVIOR: FROM ANIMALS TO ANIMATS 6, pages 223–228. MIT Press/Bradford Books, 1998

  52. [52]

    mountains

    Tianhao Zhang, Gregory Kahn, Sergey Levine, and Pieter Abbeel. Learning deep control policies for autonomous aerial vehicles with mpc-guided policy search. In 2016 IEEE international conference on robotics and automation (ICRA), pages 528–535. IEEE, 2016. 11 A Environment Overview We provide an overview of the environments in this section. Table 6 shows t...