Benchmarking Model-Based Reinforcement Learning
Pith reviewed 2026-05-25 10:10 UTC · model grok-4.3
The pith
A unified benchmark of model-based RL algorithms reveals three recurring challenges: inaccurate dynamics, uncertain planning horizons, and early episode termination.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When a broad set of model-based RL algorithms is evaluated inside the same collection of environments and problem settings, their differences reduce to three core difficulties: the dynamics bottleneck in which small model errors grow during planning, the planning horizon dilemma of choosing how far to look ahead without knowing the right length in advance, and the early-termination dilemma created by episodes that end before the planned horizon is reached.
What carries the argument
The suite of over 18 MBRL-specific environments together with the standardized evaluation protocol that isolates the effects of dynamics learning, planning length, and episode termination.
If this is right
- Further gains require dynamics models whose prediction errors do not compound over multiple planning steps.
- Algorithms must incorporate explicit mechanisms for selecting or adapting the number of steps they plan ahead.
- Environments and algorithms need consistent rules for handling episodes that terminate before the planning horizon ends.
- Releasing the environments and code makes it possible to measure whether new methods actually reduce the identified bottlenecks.
Where Pith is reading between the lines
- Extending the same benchmark to physical robot tasks would test whether the three dilemmas remain the dominant limits once simulation-to-reality gaps are added.
- If the dynamics, horizon, and termination issues are resolved, model-based methods could become preferable for any setting where collecting new experience is costly.
- The inclusion of noisy environments suggests that robustness to observation or transition noise should be a standard requirement in future RL algorithm comparisons.
Load-bearing premise
The environments and noise conditions chosen are representative of the general difficulties in model-based RL rather than being special to this particular test collection.
What would settle it
Re-running the full set of algorithms on an independent collection of environments not constructed for MBRL and obtaining substantially different performance orderings or different explanations for the gaps would show the three dilemmas are not the main limiting factors.
Figures
read the original abstract
Model-based reinforcement learning (MBRL) is widely seen as having the potential to be significantly more sample efficient than model-free RL. However, research in model-based RL has not been very standardized. It is fairly common for authors to experiment with self-designed environments, and there are several separate lines of research, which are sometimes closed-sourced or not reproducible. Accordingly, it is an open question how these various existing MBRL algorithms perform relative to each other. To facilitate research in MBRL, in this paper we gather a wide collection of MBRL algorithms and propose over 18 benchmarking environments specially designed for MBRL. We benchmark these algorithms with unified problem settings, including noisy environments. Beyond cataloguing performance, we explore and unify the underlying algorithmic differences across MBRL algorithms. We characterize three key research challenges for future MBRL research: the dynamics bottleneck, the planning horizon dilemma, and the early-termination dilemma. Finally, to maximally facilitate future research on MBRL, we open-source our benchmark in http://www.cs.toronto.edu/~tingwuwang/mbrl.html.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript collects existing MBRL algorithms, introduces over 18 custom environments specially designed for MBRL, evaluates the algorithms under unified settings that include noise, unifies their algorithmic differences, characterizes three key challenges (dynamics bottleneck, planning horizon dilemma, early-termination dilemma), and open-sources the full benchmark.
Significance. If the observed patterns hold beyond the custom suite, the work supplies a much-needed standardized, reproducible benchmark for MBRL that directly tackles the field's fragmentation and lack of comparable results. The open-sourcing of code and environments, together with the inclusion of noisy dynamics, constitutes a concrete contribution to reproducibility that future papers can build upon.
major comments (2)
- [Abstract and introduction] Abstract and introduction: the central claim that the benchmark 'characterize[s] three key research challenges for future MBRL research' rests on performance patterns observed exclusively in environments described as 'specially designed for MBRL'. No comparison against established continuous-control suites (e.g., standard MuJoCo tasks) is reported, leaving open the possibility that the dynamics bottleneck, planning-horizon dilemma, and early-termination dilemma are artifacts of the chosen environment construction rather than intrinsic MBRL properties.
- [Benchmark environments section] The weakest assumption identified in the reader's report—that the custom environments are representative enough to reveal general algorithmic differences—is load-bearing for the three-dilemma characterization. Without an explicit design protocol or hold-out validation set that decouples environment features from the targeted dilemmas, the empirical support for generality remains incomplete.
minor comments (1)
- [Abstract] The open-source link should be verified to remain accessible and to include exact environment specifications and random seeds used in the reported runs.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. We address each major comment below, proposing targeted revisions to clarify scope and strengthen the presentation of our benchmark's design rationale while preserving the paper's focus on MBRL-specific environments.
read point-by-point responses
-
Referee: [Abstract and introduction] Abstract and introduction: the central claim that the benchmark 'characterize[s] three key research challenges for future MBRL research' rests on performance patterns observed exclusively in environments described as 'specially designed for MBRL'. No comparison against established continuous-control suites (e.g., standard MuJoCo tasks) is reported, leaving open the possibility that the dynamics bottleneck, planning-horizon dilemma, and early-termination dilemma are artifacts of the chosen environment construction rather than intrinsic MBRL properties.
Authors: We agree that the absence of direct comparisons to standard MuJoCo tasks leaves open the question of whether the three dilemmas generalize beyond our suite. Our environments were intentionally constructed to isolate MBRL-specific issues (e.g., compounding model error under noise and variable termination) that are often obscured in well-tuned model-free benchmarks. We will revise the abstract and introduction to explicitly qualify the claim: the dilemmas are characterized within an MBRL-oriented benchmark designed to surface them, rather than asserted as universal without further validation. We will also add a short discussion noting that future work could test the same algorithms on MuJoCo to assess transfer of the observed patterns. No new MuJoCo experiments are planned for this revision, as they would require a substantially expanded scope. revision: partial
-
Referee: [Benchmark environments section] The weakest assumption identified in the reader's report—that the custom environments are representative enough to reveal general algorithmic differences—is load-bearing for the three-dilemma characterization. Without an explicit design protocol or hold-out validation set that decouples environment features from the targeted dilemmas, the empirical support for generality remains incomplete.
Authors: We accept that an explicit design protocol was not sufficiently detailed. In the revised manuscript we will insert a dedicated subsection under 'Benchmark Environments' that enumerates the design principles: (1) controlled injection of dynamics noise to probe the dynamics bottleneck, (2) tunable planning horizons and reward sparsity to expose the planning-horizon dilemma, and (3) variable early-termination conditions to study the early-termination dilemma. Each environment's parameters are listed with the specific dilemma it targets. A hold-out validation set was not part of the original design; we will acknowledge this limitation and note it as a recommended practice for future benchmark extensions rather than adding one retroactively, as constructing a statistically independent hold-out would require new environment families. revision: yes
Circularity Check
No circularity: empirical benchmarking with no derivations or self-referential reductions
full rationale
The paper is an empirical study that collects existing MBRL algorithms, introduces new environments, runs unified benchmarks, and reports observed performance patterns to characterize challenges. No equations, parameter fits, or derivations are present that reduce to inputs by construction. The characterization of challenges follows directly from the benchmark results rather than any self-definitional or fitted-input mechanism. Self-citations, if any, are not load-bearing for a central claim that reduces to prior author work. This matches the default case of a self-contained empirical paper with score 0.
Axiom & Free-Parameter Ledger
free parameters (1)
- environment design choices
axioms (1)
- domain assumption MBRL algorithms can be fairly compared when run under identical environment and noise settings
Forward citations
Cited by 9 Pith papers
-
Advantage-Guided Diffusion for Model-Based Reinforcement Learning
Advantage-guided diffusion (SAG and EAG) steers sampling in diffusion world models to higher-advantage trajectories, enabling policy improvement and better sample efficiency on MuJoCo tasks.
-
Mastering Atari with Discrete World Models
DreamerV2 reaches human-level performance on 55 Atari games by learning behaviors inside a separately trained discrete-latent world model.
-
Dream to Control: Learning Behaviors by Latent Imagination
Dreamer learns to control from images by imagining and optimizing behaviors in a learned latent world model, outperforming prior methods on 20 visual tasks in data efficiency and final performance.
-
QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL
QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markov...
-
QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL
QHyer achieves state-of-the-art results in offline goal-conditioned RL by replacing return-to-go with a state-conditioned Q-estimator and introducing a gated hybrid attention-mamba backbone for content-adaptive histor...
-
Breaking the Computational Barrier: Provably Efficient Actor-Critic for Low-Rank MDPs
An actor-critic RL algorithm for low-rank MDPs achieves improved sample efficiency using solely a policy evaluation oracle.
-
Is Conditional Generative Modeling all you need for Decision-Making?
Return-conditional diffusion models for policies outperform offline RL on benchmarks by circumventing dynamic programming and enable constraint or skill composition.
-
What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
A comprehensive benchmark study of offline imitation learning methods on multi-stage robot manipulation tasks identifies key sensitivities to algorithm design, data quality, and stopping criteria while releasing all d...
-
D2 Actor Critic: Diffusion Actor Meets Distributional Critic
D2AC combines a diffusion actor with a distributional critic via fused distributional RL and clipped double Q-learning to reach state-of-the-art results on 18 hard control benchmarks including Humanoid, Dog, and Shadow Hand.
Reference graph
Works this paper leans on
-
[1]
Learning Dexterous In-Hand Manipulation
Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob McGrew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous in-hand manipulation. arXiv preprint arXiv:1808.00177, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[2]
Unifying count-based exploration and intrinsic motivation
Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying count-based exploration and intrinsic motivation. In D. D. Lee, M. Sugiyama, U. V . Luxburg, I. Guyon, and R. Garnett, editors,Advances in Neural Information Processing Systems 29, pages 1471–1479. Curran Associates, Inc., 2016
work page 2016
-
[3]
The explicit linear quadratic regulator for constrained systems
Alberto Bemporad, Manfred Morari, Vivek Dua, and Efstratios N Pistikopoulos. The explicit linear quadratic regulator for constrained systems. Automatica, 38(1):3–20, 2002
work page 2002
-
[4]
The cross-entropy method for optimization
Zdravko I Botev, Dirk P Kroese, Reuven Y Rubinstein, and Pierre L’Ecuyer. The cross-entropy method for optimization. In Handbook of statistics, volume 31, pages 35–59. Elsevier, 2013
work page 2013
-
[5]
Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[6]
Path integral guided policy search
Yevgen Chebotar, Mrinal Kalakrishnan, Ali Yahya, Adrian Li, Stefan Schaal, and Sergey Levine. Path integral guided policy search. In 2017 IEEE international conference on robotics and automation (ICRA), pages 3381–3388. IEEE, 2017
work page 2017
-
[7]
Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models
Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforce- ment learning in a handful of trials using probabilistic dynamics models. arXiv preprint arXiv:1805.12114, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[8]
Model-Based Reinforcement Learning via Meta-Policy Optimization
Ignasi Clavera, Jonas Rothfuss, John Schulman, Yasuhiro Fujita, Tamim Asfour, and Pieter Abbeel. Model-based reinforcement learning via meta-policy optimization. CoRR, abs/1809.05214, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[9]
A tutorial on the cross-entropy method
Pieter-Tjerk De Boer, Dirk P Kroese, Shie Mannor, and Reuven Y Rubinstein. A tutorial on the cross-entropy method. Annals of operations research, 134(1):19–67, 2005
work page 2005
-
[10]
Model based bayesian exploration
Richard Dearden, Nir Friedman, and David Andre. Model based bayesian exploration. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, UAI’99, pages 150–159, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc
work page 1999
-
[11]
Pilco: A model-based and data-efficient approach to policy search
Marc Deisenroth and Carl E Rasmussen. Pilco: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11), pages 465–472, 2011
work page 2011
-
[12]
Gaussian processes for data-efficient learning in robotics and control
Marc Peter Deisenroth, Dieter Fox, and Carl Edward Rasmussen. Gaussian processes for data-efficient learning in robotics and control. IEEE transactions on pattern analysis and machine intelligence, 37(2):408–423, 2015
work page 2015
-
[13]
Imagenet: A large- scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009
work page 2009
-
[14]
Benchmarking deep reinforcement learning for continuous control
Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pages 1329–1338, 2016
work page 2016
-
[15]
C. Finn, M. Zhang, J. Fu, X. Tan, Z. McCarthy, E. Scharff, and S. Levine. Guided policy search code implementation, 2016. Software available from rll.berkeley.edu/gps
work page 2016
-
[16]
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adapta- tion of deep networks. CoRR, abs/1703.03400, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[17]
Guided cost learning: Deep inverse optimal control via policy optimization
Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning, pages 49–58, 2016
work page 2016
-
[18]
Addressing Function Approximation Error in Actor-Critic Methods
Scott Fujimoto, Herke van Hoof, and David Meger. Addressing function approximation error in actor-critic methods. arXiv preprint arXiv:1802.09477, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[19]
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018. 9
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[20]
Emergence of Locomotion Behaviours in Rich Environments
Nicolas Heess, Srinivasan Sriram, Jay Lemmon, Josh Merel, Greg Wayne, Yuval Tassa, Tom Erez, Ziyu Wang, Ali Eslami, Martin Riedmiller, et al. Emergence of locomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[21]
Learning continuous control policies by stochastic value gradients
Nicolas Heess, Gregory Wayne, David Silver, Timothy Lillicrap, Tom Erez, and Yuval Tassa. Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, pages 2944–2952, 2015
work page 2015
-
[22]
Deep reinforcement learning that matters
Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018
work page 2018
-
[23]
Vime: Variational information maximizing exploration
Rein Houthooft, Xi Chen, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. Vime: Variational information maximizing exploration. In D. D. Lee, M. Sugiyama, U. V . Luxburg, I. Guyon, and R. Garnett, editors,Advances in Neural Information Processing Systems 29, pages 1109–1117. Curran Associates, Inc., 2016
work page 2016
-
[24]
Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control
Riashat Islam, Peter Henderson, Maziar Gomrokchi, and Doina Precup. Reproducibility of benchmarked deep reinforcement learning tasks for continuous control. arXiv preprint arXiv:1708.04133, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[25]
Data-Efficient Reinforcement Learning with Probabilistic Model Predictive Control
Sanket Kamthe and Marc Peter Deisenroth. Data-efficient reinforcement learning with proba- bilistic model predictive control. arXiv preprint arXiv:1706.06491, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[26]
Moses: Open source toolkit for statistical machine translation
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, et al. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th annual meeting of the association for computational linguistics companion volume proceedings of the d...
work page 2007
-
[27]
Model-Ensemble Trust-Region Policy Optimization
Thanard Kurutach, Ignasi Clavera, Yan Duan, Aviv Tamar, and Pieter Abbeel. Model-ensemble trust-region policy optimization. arXiv preprint arXiv:1802.10592, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[28]
Learning neural network policies with guided policy search under unknown dynamics
Sergey Levine and Pieter Abbeel. Learning neural network policies with guided policy search under unknown dynamics. In Advances in Neural Information Processing Systems , pages 1071–1079, 2014
work page 2014
-
[29]
Learning contact-rich manipulation skills with guided policy search
Sergey Levine, Nolan Wagener, and Pieter Abbeel. Learning contact-rich manipulation skills with guided policy search. In 2015 IEEE international conference on robotics and automation (ICRA), pages 156–163. IEEE, 2015
work page 2015
-
[30]
Continuous control with deep reinforcement learning
Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[31]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014
work page 2014
-
[32]
Algo- rithmic framework for model-based deep reinforcement learning with theoretical guarantees
Yuping Luo, Huazhe Xu, Yuanzhi Li, Yuandong Tian, Trevor Darrell, and Tengyu Ma. Algo- rithmic framework for model-based deep reinforcement learning with theoretical guarantees. ICLR, 2019
work page 2019
-
[33]
Asynchronous methods for deep rein- forcement learning
V olodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep rein- forcement learning. In International Conference on Machine Learning , pages 1928–1937, 2016
work page 1928
-
[34]
Playing Atari with Deep Reinforcement Learning
V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[35]
Guided policy search via approximate mirror descent
William H Montgomery and Sergey Levine. Guided policy search via approximate mirror descent. In Advances in Neural Information Processing Systems, pages 4008–4016, 2016
work page 2016
-
[36]
Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning
Anusha Nagabandi, Gregory Kahn, Ronald S Fearing, and Sergey Levine. Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. arXiv preprint arXiv:1708.02596, 2017. 10
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[37]
Librispeech: an asr corpus based on public domain audio books
Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210. IEEE, 2015
work page 2015
-
[38]
Deepmimic: Example- guided deep reinforcement learning of physics-based character skills
Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. Deepmimic: Example- guided deep reinforcement learning of physics-based character skills. ACM Transactions on Graphics (TOG), 37(4):143, 2018
work page 2018
-
[39]
A survey of numerical methods for optimal control
Anil V Rao. A survey of numerical methods for optimal control. Advances in the Astronautical Sciences, 135(1):497–528, 2009
work page 2009
-
[40]
Robust constrained model predictive control
Arthur George Richards. Robust constrained model predictive control . PhD thesis, Mas- sachusetts Institute of Technology, 2005
work page 2005
-
[41]
A reduction of imitation learning and structured prediction to no-regret online learning
Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635, 2011
work page 2011
-
[42]
Ray Interference: a Source of Plateaus in Deep Reinforcement Learning
Tom Schaul, Diana Borsa, Joseph Modayil, and Razvan Pascanu. Ray interference: a source of plateaus in deep reinforcement learning. arXiv preprint arXiv:1904.11455, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[43]
Trust region policy optimization
John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 1889–1897, 2015
work page 2015
-
[44]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[45]
Richard S Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Machine Learning Proceedings 1990, pages 216–224. Elsevier, 1990
work page 1990
-
[46]
Dyna, an integrated architecture for learning, planning, and reacting
Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting. ACM SIGART Bulletin, 2(4):160–163, 1991
work page 1991
-
[47]
Planning by incremental dynamic programming
Richard S Sutton. Planning by incremental dynamic programming. In Machine Learning Proceedings 1991, pages 353–357. Elsevier, 1991
work page 1991
-
[48]
Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[49]
Synthesis and stabilization of complex behaviors through online trajectory optimization
Yuval Tassa, Tom Erez, and Emanuel Todorov. Synthesis and stabilization of complex behaviors through online trajectory optimization. InIntelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pages 4906–4913. IEEE, 2012
work page 2012
-
[50]
Seaborn: statistical data visualization
Michael Waskom. Seaborn: statistical data visualization. http://seaborn.pydata.org/. Accessed: 2010-09-30
work page 2010
-
[51]
Efficient model-based exploration
Marco Wiering and Jürgen Schmidhuber. Efficient model-based exploration. In PROCEED- INGS OF THE SIXTH INTERNATIONAL CONFERENCE ON SIMULATION OF ADAPTIVE BEHAVIOR: FROM ANIMALS TO ANIMATS 6, pages 223–228. MIT Press/Bradford Books, 1998
work page 1998
-
[52]
Tianhao Zhang, Gregory Kahn, Sergey Levine, and Pieter Abbeel. Learning deep control policies for autonomous aerial vehicles with mpc-guided policy search. In 2016 IEEE international conference on robotics and automation (ICRA), pages 528–535. IEEE, 2016. 11 A Environment Overview We provide an overview of the environments in this section. Table 6 shows t...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.