Benchmarking Model-Based Reinforcement Learning

Eric Langlois; Guodong Zhang; Ignasi Clavera; Jerrick Hoang; Jimmy Ba; Pieter Abbeel; Shunshi Zhang; Tingwu Wang; Xuchan Bao; Yeming Wen

arxiv: 1907.02057 · v1 · pith:CLHRNOYTnew · submitted 2019-07-03 · 💻 cs.LG · cs.AI· cs.RO· stat.ML

Benchmarking Model-Based Reinforcement Learning

Tingwu Wang , Xuchan Bao , Ignasi Clavera , Jerrick Hoang , Yeming Wen , Eric Langlois , Shunshi Zhang , Guodong Zhang

show 2 more authors

Pieter Abbeel Jimmy Ba

This is my paper

Pith reviewed 2026-05-25 10:10 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.ROstat.ML

keywords model-based reinforcement learningbenchmarkingsample efficiencydynamics modelingplanning horizonearly terminationreinforcement learning environmentsalgorithm comparison

0 comments

The pith

A unified benchmark of model-based RL algorithms reveals three recurring challenges: inaccurate dynamics, uncertain planning horizons, and early episode termination.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper collects multiple model-based reinforcement learning algorithms and introduces more than 18 environments built specifically to test them side by side. All algorithms run under identical settings that include added noise, allowing direct measurement of relative sample efficiency. The comparisons make visible that performance gaps arise from how each method learns and uses its dynamics model, decides how many steps to plan ahead, and manages episodes that stop early. A reader would care because model-based methods are expected to learn with far less data than model-free ones, yet these three issues appear to prevent that advantage from appearing consistently. The authors release the full benchmark so later work can measure progress against the same baseline.

Core claim

When a broad set of model-based RL algorithms is evaluated inside the same collection of environments and problem settings, their differences reduce to three core difficulties: the dynamics bottleneck in which small model errors grow during planning, the planning horizon dilemma of choosing how far to look ahead without knowing the right length in advance, and the early-termination dilemma created by episodes that end before the planned horizon is reached.

What carries the argument

The suite of over 18 MBRL-specific environments together with the standardized evaluation protocol that isolates the effects of dynamics learning, planning length, and episode termination.

If this is right

Further gains require dynamics models whose prediction errors do not compound over multiple planning steps.
Algorithms must incorporate explicit mechanisms for selecting or adapting the number of steps they plan ahead.
Environments and algorithms need consistent rules for handling episodes that terminate before the planning horizon ends.
Releasing the environments and code makes it possible to measure whether new methods actually reduce the identified bottlenecks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending the same benchmark to physical robot tasks would test whether the three dilemmas remain the dominant limits once simulation-to-reality gaps are added.
If the dynamics, horizon, and termination issues are resolved, model-based methods could become preferable for any setting where collecting new experience is costly.
The inclusion of noisy environments suggests that robustness to observation or transition noise should be a standard requirement in future RL algorithm comparisons.

Load-bearing premise

The environments and noise conditions chosen are representative of the general difficulties in model-based RL rather than being special to this particular test collection.

What would settle it

Re-running the full set of algorithms on an independent collection of environments not constructed for MBRL and obtaining substantially different performance orderings or different explanations for the gaps would show the three dilemmas are not the main limiting factors.

Figures

Figures reproduced from arXiv: 1907.02057 by Eric Langlois, Guodong Zhang, Ignasi Clavera, Jerrick Hoang, Jimmy Ba, Pieter Abbeel, Shunshi Zhang, Tingwu Wang, Xuchan Bao, Yeming Wen.

**Figure 2.** Figure 2: Performance curve for each algorithm trained for 1 million time-steps. The results show that MBRL algorithms plateau at a performance level well below their model-free counterparts and themselves with ground-truth dynamics. This points out that when learning models, more data does not result in better performance. For instance, PETS’s performance plateaus after 400k time-steps at a value much lower than th… view at source ↗

**Figure 3.** Figure 3: The relative performance with different planning horizon. One of the critical choices in shooting methods is the planning horizon. In [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Performance curve for MBRL algorithms. There are still 3 more figures in a continued [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

**Figure 5.** Figure 5: (Continued) Performance curve for MBRL algorithms. [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: The performance curve for algorithms with noise. We represent the noise standard deviation [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: (Continued) The performance curve for algorithms with noise. We represent the noise [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: The performance grid using different planning horizon and depth. [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

**Figure 9.** Figure 9: The performance curve for different environment length and planning horizon in SLBO. [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

read the original abstract

Model-based reinforcement learning (MBRL) is widely seen as having the potential to be significantly more sample efficient than model-free RL. However, research in model-based RL has not been very standardized. It is fairly common for authors to experiment with self-designed environments, and there are several separate lines of research, which are sometimes closed-sourced or not reproducible. Accordingly, it is an open question how these various existing MBRL algorithms perform relative to each other. To facilitate research in MBRL, in this paper we gather a wide collection of MBRL algorithms and propose over 18 benchmarking environments specially designed for MBRL. We benchmark these algorithms with unified problem settings, including noisy environments. Beyond cataloguing performance, we explore and unify the underlying algorithmic differences across MBRL algorithms. We characterize three key research challenges for future MBRL research: the dynamics bottleneck, the planning horizon dilemma, and the early-termination dilemma. Finally, to maximally facilitate future research on MBRL, we open-source our benchmark in http://www.cs.toronto.edu/~tingwuwang/mbrl.html.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript collects existing MBRL algorithms, introduces over 18 custom environments specially designed for MBRL, evaluates the algorithms under unified settings that include noise, unifies their algorithmic differences, characterizes three key challenges (dynamics bottleneck, planning horizon dilemma, early-termination dilemma), and open-sources the full benchmark.

Significance. If the observed patterns hold beyond the custom suite, the work supplies a much-needed standardized, reproducible benchmark for MBRL that directly tackles the field's fragmentation and lack of comparable results. The open-sourcing of code and environments, together with the inclusion of noisy dynamics, constitutes a concrete contribution to reproducibility that future papers can build upon.

major comments (2)

[Abstract and introduction] Abstract and introduction: the central claim that the benchmark 'characterize[s] three key research challenges for future MBRL research' rests on performance patterns observed exclusively in environments described as 'specially designed for MBRL'. No comparison against established continuous-control suites (e.g., standard MuJoCo tasks) is reported, leaving open the possibility that the dynamics bottleneck, planning-horizon dilemma, and early-termination dilemma are artifacts of the chosen environment construction rather than intrinsic MBRL properties.
[Benchmark environments section] The weakest assumption identified in the reader's report—that the custom environments are representative enough to reveal general algorithmic differences—is load-bearing for the three-dilemma characterization. Without an explicit design protocol or hold-out validation set that decouples environment features from the targeted dilemmas, the empirical support for generality remains incomplete.

minor comments (1)

[Abstract] The open-source link should be verified to remain accessible and to include exact environment specifications and random seeds used in the reported runs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below, proposing targeted revisions to clarify scope and strengthen the presentation of our benchmark's design rationale while preserving the paper's focus on MBRL-specific environments.

read point-by-point responses

Referee: [Abstract and introduction] Abstract and introduction: the central claim that the benchmark 'characterize[s] three key research challenges for future MBRL research' rests on performance patterns observed exclusively in environments described as 'specially designed for MBRL'. No comparison against established continuous-control suites (e.g., standard MuJoCo tasks) is reported, leaving open the possibility that the dynamics bottleneck, planning-horizon dilemma, and early-termination dilemma are artifacts of the chosen environment construction rather than intrinsic MBRL properties.

Authors: We agree that the absence of direct comparisons to standard MuJoCo tasks leaves open the question of whether the three dilemmas generalize beyond our suite. Our environments were intentionally constructed to isolate MBRL-specific issues (e.g., compounding model error under noise and variable termination) that are often obscured in well-tuned model-free benchmarks. We will revise the abstract and introduction to explicitly qualify the claim: the dilemmas are characterized within an MBRL-oriented benchmark designed to surface them, rather than asserted as universal without further validation. We will also add a short discussion noting that future work could test the same algorithms on MuJoCo to assess transfer of the observed patterns. No new MuJoCo experiments are planned for this revision, as they would require a substantially expanded scope. revision: partial
Referee: [Benchmark environments section] The weakest assumption identified in the reader's report—that the custom environments are representative enough to reveal general algorithmic differences—is load-bearing for the three-dilemma characterization. Without an explicit design protocol or hold-out validation set that decouples environment features from the targeted dilemmas, the empirical support for generality remains incomplete.

Authors: We accept that an explicit design protocol was not sufficiently detailed. In the revised manuscript we will insert a dedicated subsection under 'Benchmark Environments' that enumerates the design principles: (1) controlled injection of dynamics noise to probe the dynamics bottleneck, (2) tunable planning horizons and reward sparsity to expose the planning-horizon dilemma, and (3) variable early-termination conditions to study the early-termination dilemma. Each environment's parameters are listed with the specific dilemma it targets. A hold-out validation set was not part of the original design; we will acknowledge this limitation and note it as a recommended practice for future benchmark extensions rather than adding one retroactively, as constructing a statistically independent hold-out would require new environment families. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmarking with no derivations or self-referential reductions

full rationale

The paper is an empirical study that collects existing MBRL algorithms, introduces new environments, runs unified benchmarks, and reports observed performance patterns to characterize challenges. No equations, parameter fits, or derivations are present that reduce to inputs by construction. The characterization of challenges follows directly from the benchmark results rather than any self-definitional or fitted-input mechanism. Self-citations, if any, are not load-bearing for a central claim that reduces to prior author work. This matches the default case of a self-contained empirical paper with score 0.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central contribution rests on the domain assumption that standardizing environments and problem settings produces meaningful relative rankings of MBRL algorithms; no free parameters or invented entities are introduced beyond the choice of benchmark tasks themselves.

free parameters (1)

environment design choices
The over 18 environments are specially designed, requiring choices about dynamics, noise levels, and termination conditions that are not derived from first principles.

axioms (1)

domain assumption MBRL algorithms can be fairly compared when run under identical environment and noise settings
Invoked when the paper states it benchmarks algorithms 'with unified problem settings, including noisy environments' to address the lack of standardization.

pith-pipeline@v0.9.0 · 5756 in / 1359 out tokens · 41366 ms · 2026-05-25T10:10:41.915347+00:00 · methodology

discussion (0)

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Advantage-Guided Diffusion for Model-Based Reinforcement Learning
cs.AI 2026-04 unverdicted novelty 7.0

Advantage-guided diffusion (SAG and EAG) steers sampling in diffusion world models to higher-advantage trajectories, enabling policy improvement and better sample efficiency on MuJoCo tasks.
Mastering Atari with Discrete World Models
cs.LG 2020-10 accept novelty 7.0

DreamerV2 reaches human-level performance on 55 Atari games by learning behaviors inside a separately trained discrete-latent world model.
Dream to Control: Learning Behaviors by Latent Imagination
cs.LG 2019-12 accept novelty 7.0

Dreamer learns to control from images by imagining and optimizing behaviors in a learned latent world model, outperforming prior methods on 20 visual tasks in data efficiency and final performance.
QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL
cs.LG 2026-05 unverdicted novelty 6.0

QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markov...
QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL
cs.LG 2026-05 unverdicted novelty 6.0

QHyer achieves state-of-the-art results in offline goal-conditioned RL by replacing return-to-go with a state-conditioned Q-estimator and introducing a gated hybrid attention-mamba backbone for content-adaptive histor...
Breaking the Computational Barrier: Provably Efficient Actor-Critic for Low-Rank MDPs
cs.LG 2026-05 unverdicted novelty 6.0

An actor-critic RL algorithm for low-rank MDPs achieves improved sample efficiency using solely a policy evaluation oracle.
Is Conditional Generative Modeling all you need for Decision-Making?
cs.LG 2022-11 unverdicted novelty 6.0

Return-conditional diffusion models for policies outperform offline RL on benchmarks by circumventing dynamic programming and enable constraint or skill composition.
What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
cs.RO 2021-08 accept novelty 6.0

A comprehensive benchmark study of offline imitation learning methods on multi-stage robot manipulation tasks identifies key sensitivities to algorithm design, data quality, and stopping criteria while releasing all d...
D2 Actor Critic: Diffusion Actor Meets Distributional Critic
cs.LG 2025-10 unverdicted novelty 5.0

D2AC combines a diffusion actor with a distributional critic via fused distributional RL and clipped double Q-learning to reach state-of-the-art results on 18 hard control benchmarks including Humanoid, Dog, and Shadow Hand.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 8 Pith papers · 17 internal anchors

[1]

Learning Dexterous In-Hand Manipulation

Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob McGrew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous in-hand manipulation. arXiv preprint arXiv:1808.00177, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[2]

Unifying count-based exploration and intrinsic motivation

Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying count-based exploration and intrinsic motivation. In D. D. Lee, M. Sugiyama, U. V . Luxburg, I. Guyon, and R. Garnett, editors,Advances in Neural Information Processing Systems 29, pages 1471–1479. Curran Associates, Inc., 2016

work page 2016
[3]

The explicit linear quadratic regulator for constrained systems

Alberto Bemporad, Manfred Morari, Vivek Dua, and Efstratios N Pistikopoulos. The explicit linear quadratic regulator for constrained systems. Automatica, 38(1):3–20, 2002

work page 2002
[4]

The cross-entropy method for optimization

Zdravko I Botev, Dirk P Kroese, Reuven Y Rubinstein, and Pierre L’Ecuyer. The cross-entropy method for optimization. In Handbook of statistics, volume 31, pages 35–59. Elsevier, 2013

work page 2013
[5]

OpenAI Gym

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[6]

Path integral guided policy search

Yevgen Chebotar, Mrinal Kalakrishnan, Ali Yahya, Adrian Li, Stefan Schaal, and Sergey Levine. Path integral guided policy search. In 2017 IEEE international conference on robotics and automation (ICRA), pages 3381–3388. IEEE, 2017

work page 2017
[7]

Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models

Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforce- ment learning in a handful of trials using probabilistic dynamics models. arXiv preprint arXiv:1805.12114, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[8]

Model-Based Reinforcement Learning via Meta-Policy Optimization

Ignasi Clavera, Jonas Rothfuss, John Schulman, Yasuhiro Fujita, Tamim Asfour, and Pieter Abbeel. Model-based reinforcement learning via meta-policy optimization. CoRR, abs/1809.05214, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[9]

A tutorial on the cross-entropy method

Pieter-Tjerk De Boer, Dirk P Kroese, Shie Mannor, and Reuven Y Rubinstein. A tutorial on the cross-entropy method. Annals of operations research, 134(1):19–67, 2005

work page 2005
[10]

Model based bayesian exploration

Richard Dearden, Nir Friedman, and David Andre. Model based bayesian exploration. In Proceedings of the Fifteenth Conference on Uncertainty in Artiﬁcial Intelligence, UAI’99, pages 150–159, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc

work page 1999
[11]

Pilco: A model-based and data-efﬁcient approach to policy search

Marc Deisenroth and Carl E Rasmussen. Pilco: A model-based and data-efﬁcient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11), pages 465–472, 2011

work page 2011
[12]

Gaussian processes for data-efﬁcient learning in robotics and control

Marc Peter Deisenroth, Dieter Fox, and Carl Edward Rasmussen. Gaussian processes for data-efﬁcient learning in robotics and control. IEEE transactions on pattern analysis and machine intelligence, 37(2):408–423, 2015

work page 2015
[13]

Imagenet: A large- scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

work page 2009
[14]

Benchmarking deep reinforcement learning for continuous control

Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pages 1329–1338, 2016

work page 2016
[15]

C. Finn, M. Zhang, J. Fu, X. Tan, Z. McCarthy, E. Scharff, and S. Levine. Guided policy search code implementation, 2016. Software available from rll.berkeley.edu/gps

work page 2016
[16]

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adapta- tion of deep networks. CoRR, abs/1703.03400, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[17]

Guided cost learning: Deep inverse optimal control via policy optimization

Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning, pages 49–58, 2016

work page 2016
[18]

Addressing Function Approximation Error in Actor-Critic Methods

Scott Fujimoto, Herke van Hoof, and David Meger. Addressing function approximation error in actor-critic methods. arXiv preprint arXiv:1802.09477, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[19]

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018. 9

work page internal anchor Pith review Pith/arXiv arXiv 2018
[20]

Emergence of Locomotion Behaviours in Rich Environments

Nicolas Heess, Srinivasan Sriram, Jay Lemmon, Josh Merel, Greg Wayne, Yuval Tassa, Tom Erez, Ziyu Wang, Ali Eslami, Martin Riedmiller, et al. Emergence of locomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[21]

Learning continuous control policies by stochastic value gradients

Nicolas Heess, Gregory Wayne, David Silver, Timothy Lillicrap, Tom Erez, and Yuval Tassa. Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, pages 2944–2952, 2015

work page 2015
[22]

Deep reinforcement learning that matters

Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters. In Thirty-Second AAAI Conference on Artiﬁcial Intelligence, 2018

work page 2018
[23]

Vime: Variational information maximizing exploration

Rein Houthooft, Xi Chen, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. Vime: Variational information maximizing exploration. In D. D. Lee, M. Sugiyama, U. V . Luxburg, I. Guyon, and R. Garnett, editors,Advances in Neural Information Processing Systems 29, pages 1109–1117. Curran Associates, Inc., 2016

work page 2016
[24]

Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control

Riashat Islam, Peter Henderson, Maziar Gomrokchi, and Doina Precup. Reproducibility of benchmarked deep reinforcement learning tasks for continuous control. arXiv preprint arXiv:1708.04133, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[25]

Data-Efficient Reinforcement Learning with Probabilistic Model Predictive Control

Sanket Kamthe and Marc Peter Deisenroth. Data-efﬁcient reinforcement learning with proba- bilistic model predictive control. arXiv preprint arXiv:1706.06491, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[26]

Moses: Open source toolkit for statistical machine translation

Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, et al. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th annual meeting of the association for computational linguistics companion volume proceedings of the d...

work page 2007
[27]

Model-Ensemble Trust-Region Policy Optimization

Thanard Kurutach, Ignasi Clavera, Yan Duan, Aviv Tamar, and Pieter Abbeel. Model-ensemble trust-region policy optimization. arXiv preprint arXiv:1802.10592, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[28]

Learning neural network policies with guided policy search under unknown dynamics

Sergey Levine and Pieter Abbeel. Learning neural network policies with guided policy search under unknown dynamics. In Advances in Neural Information Processing Systems , pages 1071–1079, 2014

work page 2014
[29]

Learning contact-rich manipulation skills with guided policy search

Sergey Levine, Nolan Wagener, and Pieter Abbeel. Learning contact-rich manipulation skills with guided policy search. In 2015 IEEE international conference on robotics and automation (ICRA), pages 156–163. IEEE, 2015

work page 2015
[30]

Continuous control with deep reinforcement learning

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[31]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014

work page 2014
[32]

Algo- rithmic framework for model-based deep reinforcement learning with theoretical guarantees

Yuping Luo, Huazhe Xu, Yuanzhi Li, Yuandong Tian, Trevor Darrell, and Tengyu Ma. Algo- rithmic framework for model-based deep reinforcement learning with theoretical guarantees. ICLR, 2019

work page 2019
[33]

Asynchronous methods for deep rein- forcement learning

V olodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep rein- forcement learning. In International Conference on Machine Learning , pages 1928–1937, 2016

work page 1928
[34]

Playing Atari with Deep Reinforcement Learning

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[35]

Guided policy search via approximate mirror descent

William H Montgomery and Sergey Levine. Guided policy search via approximate mirror descent. In Advances in Neural Information Processing Systems, pages 4008–4016, 2016

work page 2016
[36]

Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning

Anusha Nagabandi, Gregory Kahn, Ronald S Fearing, and Sergey Levine. Neural network dynamics for model-based deep reinforcement learning with model-free ﬁne-tuning. arXiv preprint arXiv:1708.02596, 2017. 10

work page internal anchor Pith review Pith/arXiv arXiv 2017
[37]

Librispeech: an asr corpus based on public domain audio books

Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210. IEEE, 2015

work page 2015
[38]

Deepmimic: Example- guided deep reinforcement learning of physics-based character skills

Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. Deepmimic: Example- guided deep reinforcement learning of physics-based character skills. ACM Transactions on Graphics (TOG), 37(4):143, 2018

work page 2018
[39]

A survey of numerical methods for optimal control

Anil V Rao. A survey of numerical methods for optimal control. Advances in the Astronautical Sciences, 135(1):497–528, 2009

work page 2009
[40]

Robust constrained model predictive control

Arthur George Richards. Robust constrained model predictive control . PhD thesis, Mas- sachusetts Institute of Technology, 2005

work page 2005
[41]

A reduction of imitation learning and structured prediction to no-regret online learning

Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artiﬁcial intelligence and statistics, pages 627–635, 2011

work page 2011
[42]

Ray Interference: a Source of Plateaus in Deep Reinforcement Learning

Tom Schaul, Diana Borsa, Joseph Modayil, and Razvan Pascanu. Ray interference: a source of plateaus in deep reinforcement learning. arXiv preprint arXiv:1904.11455, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[43]

Trust region policy optimization

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 1889–1897, 2015

work page 2015
[44]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[45]

Integrated architectures for learning, planning, and reacting based on approximating dynamic programming

Richard S Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Machine Learning Proceedings 1990, pages 216–224. Elsevier, 1990

work page 1990
[46]

Dyna, an integrated architecture for learning, planning, and reacting

Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting. ACM SIGART Bulletin, 2(4):160–163, 1991

work page 1991
[47]

Planning by incremental dynamic programming

Richard S Sutton. Planning by incremental dynamic programming. In Machine Learning Proceedings 1991, pages 353–357. Elsevier, 1991

work page 1991
[48]

DeepMind Control Suite

Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[49]

Synthesis and stabilization of complex behaviors through online trajectory optimization

Yuval Tassa, Tom Erez, and Emanuel Todorov. Synthesis and stabilization of complex behaviors through online trajectory optimization. InIntelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pages 4906–4913. IEEE, 2012

work page 2012
[50]

Seaborn: statistical data visualization

Michael Waskom. Seaborn: statistical data visualization. http://seaborn.pydata.org/. Accessed: 2010-09-30

work page 2010
[51]

Efﬁcient model-based exploration

Marco Wiering and Jürgen Schmidhuber. Efﬁcient model-based exploration. In PROCEED- INGS OF THE SIXTH INTERNATIONAL CONFERENCE ON SIMULATION OF ADAPTIVE BEHAVIOR: FROM ANIMALS TO ANIMATS 6, pages 223–228. MIT Press/Bradford Books, 1998

work page 1998
[52]

mountains

Tianhao Zhang, Gregory Kahn, Sergey Levine, and Pieter Abbeel. Learning deep control policies for autonomous aerial vehicles with mpc-guided policy search. In 2016 IEEE international conference on robotics and automation (ICRA), pages 528–535. IEEE, 2016. 11 A Environment Overview We provide an overview of the environments in this section. Table 6 shows t...

work page arXiv 2016

[1] [1]

Learning Dexterous In-Hand Manipulation

Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob McGrew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous in-hand manipulation. arXiv preprint arXiv:1808.00177, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[2] [2]

Unifying count-based exploration and intrinsic motivation

Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying count-based exploration and intrinsic motivation. In D. D. Lee, M. Sugiyama, U. V . Luxburg, I. Guyon, and R. Garnett, editors,Advances in Neural Information Processing Systems 29, pages 1471–1479. Curran Associates, Inc., 2016

work page 2016

[3] [3]

The explicit linear quadratic regulator for constrained systems

Alberto Bemporad, Manfred Morari, Vivek Dua, and Efstratios N Pistikopoulos. The explicit linear quadratic regulator for constrained systems. Automatica, 38(1):3–20, 2002

work page 2002

[4] [4]

The cross-entropy method for optimization

Zdravko I Botev, Dirk P Kroese, Reuven Y Rubinstein, and Pierre L’Ecuyer. The cross-entropy method for optimization. In Handbook of statistics, volume 31, pages 35–59. Elsevier, 2013

work page 2013

[5] [5]

OpenAI Gym

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[6] [6]

Path integral guided policy search

Yevgen Chebotar, Mrinal Kalakrishnan, Ali Yahya, Adrian Li, Stefan Schaal, and Sergey Levine. Path integral guided policy search. In 2017 IEEE international conference on robotics and automation (ICRA), pages 3381–3388. IEEE, 2017

work page 2017

[7] [7]

Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models

Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforce- ment learning in a handful of trials using probabilistic dynamics models. arXiv preprint arXiv:1805.12114, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[8] [8]

Model-Based Reinforcement Learning via Meta-Policy Optimization

Ignasi Clavera, Jonas Rothfuss, John Schulman, Yasuhiro Fujita, Tamim Asfour, and Pieter Abbeel. Model-based reinforcement learning via meta-policy optimization. CoRR, abs/1809.05214, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[9] [9]

A tutorial on the cross-entropy method

Pieter-Tjerk De Boer, Dirk P Kroese, Shie Mannor, and Reuven Y Rubinstein. A tutorial on the cross-entropy method. Annals of operations research, 134(1):19–67, 2005

work page 2005

[10] [10]

Model based bayesian exploration

Richard Dearden, Nir Friedman, and David Andre. Model based bayesian exploration. In Proceedings of the Fifteenth Conference on Uncertainty in Artiﬁcial Intelligence, UAI’99, pages 150–159, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc

work page 1999

[11] [11]

Pilco: A model-based and data-efﬁcient approach to policy search

Marc Deisenroth and Carl E Rasmussen. Pilco: A model-based and data-efﬁcient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11), pages 465–472, 2011

work page 2011

[12] [12]

Gaussian processes for data-efﬁcient learning in robotics and control

Marc Peter Deisenroth, Dieter Fox, and Carl Edward Rasmussen. Gaussian processes for data-efﬁcient learning in robotics and control. IEEE transactions on pattern analysis and machine intelligence, 37(2):408–423, 2015

work page 2015

[13] [13]

Imagenet: A large- scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

work page 2009

[14] [14]

Benchmarking deep reinforcement learning for continuous control

Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pages 1329–1338, 2016

work page 2016

[15] [15]

C. Finn, M. Zhang, J. Fu, X. Tan, Z. McCarthy, E. Scharff, and S. Levine. Guided policy search code implementation, 2016. Software available from rll.berkeley.edu/gps

work page 2016

[16] [16]

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adapta- tion of deep networks. CoRR, abs/1703.03400, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[17] [17]

Guided cost learning: Deep inverse optimal control via policy optimization

Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning, pages 49–58, 2016

work page 2016

[18] [18]

Addressing Function Approximation Error in Actor-Critic Methods

Scott Fujimoto, Herke van Hoof, and David Meger. Addressing function approximation error in actor-critic methods. arXiv preprint arXiv:1802.09477, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[19] [19]

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018. 9

work page internal anchor Pith review Pith/arXiv arXiv 2018

[20] [20]

Emergence of Locomotion Behaviours in Rich Environments

Nicolas Heess, Srinivasan Sriram, Jay Lemmon, Josh Merel, Greg Wayne, Yuval Tassa, Tom Erez, Ziyu Wang, Ali Eslami, Martin Riedmiller, et al. Emergence of locomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[21] [21]

Learning continuous control policies by stochastic value gradients

Nicolas Heess, Gregory Wayne, David Silver, Timothy Lillicrap, Tom Erez, and Yuval Tassa. Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, pages 2944–2952, 2015

work page 2015

[22] [22]

Deep reinforcement learning that matters

Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters. In Thirty-Second AAAI Conference on Artiﬁcial Intelligence, 2018

work page 2018

[23] [23]

Vime: Variational information maximizing exploration

Rein Houthooft, Xi Chen, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. Vime: Variational information maximizing exploration. In D. D. Lee, M. Sugiyama, U. V . Luxburg, I. Guyon, and R. Garnett, editors,Advances in Neural Information Processing Systems 29, pages 1109–1117. Curran Associates, Inc., 2016

work page 2016

[24] [24]

Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control

Riashat Islam, Peter Henderson, Maziar Gomrokchi, and Doina Precup. Reproducibility of benchmarked deep reinforcement learning tasks for continuous control. arXiv preprint arXiv:1708.04133, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[25] [25]

Data-Efficient Reinforcement Learning with Probabilistic Model Predictive Control

Sanket Kamthe and Marc Peter Deisenroth. Data-efﬁcient reinforcement learning with proba- bilistic model predictive control. arXiv preprint arXiv:1706.06491, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[26] [26]

Moses: Open source toolkit for statistical machine translation

Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, et al. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th annual meeting of the association for computational linguistics companion volume proceedings of the d...

work page 2007

[27] [27]

Model-Ensemble Trust-Region Policy Optimization

Thanard Kurutach, Ignasi Clavera, Yan Duan, Aviv Tamar, and Pieter Abbeel. Model-ensemble trust-region policy optimization. arXiv preprint arXiv:1802.10592, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[28] [28]

Learning neural network policies with guided policy search under unknown dynamics

Sergey Levine and Pieter Abbeel. Learning neural network policies with guided policy search under unknown dynamics. In Advances in Neural Information Processing Systems , pages 1071–1079, 2014

work page 2014

[29] [29]

Learning contact-rich manipulation skills with guided policy search

Sergey Levine, Nolan Wagener, and Pieter Abbeel. Learning contact-rich manipulation skills with guided policy search. In 2015 IEEE international conference on robotics and automation (ICRA), pages 156–163. IEEE, 2015

work page 2015

[30] [30]

Continuous control with deep reinforcement learning

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[31] [31]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014

work page 2014

[32] [32]

Algo- rithmic framework for model-based deep reinforcement learning with theoretical guarantees

Yuping Luo, Huazhe Xu, Yuanzhi Li, Yuandong Tian, Trevor Darrell, and Tengyu Ma. Algo- rithmic framework for model-based deep reinforcement learning with theoretical guarantees. ICLR, 2019

work page 2019

[33] [33]

Asynchronous methods for deep rein- forcement learning

V olodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep rein- forcement learning. In International Conference on Machine Learning , pages 1928–1937, 2016

work page 1928

[34] [34]

Playing Atari with Deep Reinforcement Learning

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[35] [35]

Guided policy search via approximate mirror descent

William H Montgomery and Sergey Levine. Guided policy search via approximate mirror descent. In Advances in Neural Information Processing Systems, pages 4008–4016, 2016

work page 2016

[36] [36]

Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning

Anusha Nagabandi, Gregory Kahn, Ronald S Fearing, and Sergey Levine. Neural network dynamics for model-based deep reinforcement learning with model-free ﬁne-tuning. arXiv preprint arXiv:1708.02596, 2017. 10

work page internal anchor Pith review Pith/arXiv arXiv 2017

[37] [37]

Librispeech: an asr corpus based on public domain audio books

Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210. IEEE, 2015

work page 2015

[38] [38]

Deepmimic: Example- guided deep reinforcement learning of physics-based character skills

Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. Deepmimic: Example- guided deep reinforcement learning of physics-based character skills. ACM Transactions on Graphics (TOG), 37(4):143, 2018

work page 2018

[39] [39]

A survey of numerical methods for optimal control

Anil V Rao. A survey of numerical methods for optimal control. Advances in the Astronautical Sciences, 135(1):497–528, 2009

work page 2009

[40] [40]

Robust constrained model predictive control

Arthur George Richards. Robust constrained model predictive control . PhD thesis, Mas- sachusetts Institute of Technology, 2005

work page 2005

[41] [41]

A reduction of imitation learning and structured prediction to no-regret online learning

Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artiﬁcial intelligence and statistics, pages 627–635, 2011

work page 2011

[42] [42]

Ray Interference: a Source of Plateaus in Deep Reinforcement Learning

Tom Schaul, Diana Borsa, Joseph Modayil, and Razvan Pascanu. Ray interference: a source of plateaus in deep reinforcement learning. arXiv preprint arXiv:1904.11455, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904

[43] [43]

Trust region policy optimization

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 1889–1897, 2015

work page 2015

[44] [44]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[45] [45]

Integrated architectures for learning, planning, and reacting based on approximating dynamic programming

Richard S Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Machine Learning Proceedings 1990, pages 216–224. Elsevier, 1990

work page 1990

[46] [46]

Dyna, an integrated architecture for learning, planning, and reacting

Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting. ACM SIGART Bulletin, 2(4):160–163, 1991

work page 1991

[47] [47]

Planning by incremental dynamic programming

Richard S Sutton. Planning by incremental dynamic programming. In Machine Learning Proceedings 1991, pages 353–357. Elsevier, 1991

work page 1991

[48] [48]

DeepMind Control Suite

Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[49] [49]

Synthesis and stabilization of complex behaviors through online trajectory optimization

Yuval Tassa, Tom Erez, and Emanuel Todorov. Synthesis and stabilization of complex behaviors through online trajectory optimization. InIntelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pages 4906–4913. IEEE, 2012

work page 2012

[50] [50]

Seaborn: statistical data visualization

Michael Waskom. Seaborn: statistical data visualization. http://seaborn.pydata.org/. Accessed: 2010-09-30

work page 2010

[51] [51]

Efﬁcient model-based exploration

Marco Wiering and Jürgen Schmidhuber. Efﬁcient model-based exploration. In PROCEED- INGS OF THE SIXTH INTERNATIONAL CONFERENCE ON SIMULATION OF ADAPTIVE BEHAVIOR: FROM ANIMALS TO ANIMATS 6, pages 223–228. MIT Press/Bradford Books, 1998

work page 1998

[52] [52]

mountains

Tianhao Zhang, Gregory Kahn, Sergey Levine, and Pieter Abbeel. Learning deep control policies for autonomous aerial vehicles with mpc-guided policy search. In 2016 IEEE international conference on robotics and automation (ICRA), pages 528–535. IEEE, 2016. 11 A Environment Overview We provide an overview of the environments in this section. Table 6 shows t...

work page arXiv 2016