Recognition: unknown
Biased Dreams: Limitations to Epistemic Uncertainty Quantification in Latent Space Models
Pith reviewed 2026-05-07 16:52 UTC · model grok-4.3
The pith
Latent dynamics models in reinforcement learning bias transitions toward well-represented regions, which can hide environment discrepancies and overestimate rewards.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Latent transitions are biased toward well-represented regions of latent space and exhibit an attractor behavior that can deviate from true environment dynamics. As a result, discrepancies in environment dynamics may not manifest as high epistemic uncertainty in latent space. Because these attractors often lie in high-reward regions, latent rollouts systematically overestimate predicted rewards. The findings indicate that epistemic uncertainty quantification, which works for physical dynamics models, has key limitations when applied to latent dynamics models used in methods such as Dreamer.
What carries the argument
Attractor behavior of latent transitions in Recurrent State Space Models, which draws next-state predictions toward densely sampled regions irrespective of actual dynamics.
If this is right
- Discrepancies between the model and the true environment may remain invisible to epistemic uncertainty measures in latent space.
- Reward predictions from latent rollouts are biased upward when attractors coincide with high-reward areas.
- Exploration driven by epistemic uncertainty can be directed toward the wrong states because uncertainty does not reflect true model error.
- Standard techniques for mitigating model exploitation in model-based reinforcement learning lose effectiveness when the dynamics model operates in latent space.
Where Pith is reading between the lines
- Agents trained with these latent models may appear competent in simulation but encounter unexpected failures on transfer because hidden dynamics errors are not reflected in uncertainty signals.
- Hybrid approaches that combine latent representations with occasional physical-state checks could reduce the impact of attractor bias.
- Long-horizon planning in latent space may require additional consistency constraints beyond current uncertainty quantification to avoid compounding errors along attractor trajectories.
Load-bearing premise
That the observed attractor behavior and its effects on uncertainty and reward estimates are general properties of latent dynamics models rather than limited to the environments and architectures tested.
What would settle it
Train a latent dynamics model on an environment that contains a clear dynamics change only in sparsely sampled regions, then measure whether epistemic uncertainty rises in those regions or whether the model still collapses predictions to high-density attractors without flagging the mismatch.
Figures
read the original abstract
Model-Based Reinforcement Learning distinguishes between physical dynamics models operating on proprioceptive inputs and latent dynamics models operating on high-dimensional image observations. A prominent latent approach is the Recurrent State Space Model used in the Dreamer family. While epistemic uncertainty quantification to inform exploration and mitigate model exploitation is well established for physical dynamics models, its transfer to latent dynamics models has received limited scrutiny. We empirically demonstrate that latent transitions are biased toward well-represented regions of latent space, exhibiting an attractor behavior that can deviate from true environment dynamics. As a result, discrepancies in environment dynamics may not manifest in latent space, undermining the reliability of epistemic uncertainty estimates. Because these attractors often lie in high-reward regions, latent rollouts systematically overestimate predicted rewards. Our findings highlight key limitations of epistemic uncertainty estimation in latent dynamics models and motivate more critical evaluation of this method.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper empirically demonstrates that in latent dynamics models such as the Recurrent State Space Models from the Dreamer family, transitions are biased toward well-represented regions of latent space. This produces attractor behavior that deviates from true environment dynamics, so that discrepancies in the environment may not appear in latent space. Consequently, epistemic uncertainty estimates become unreliable and latent rollouts systematically overestimate rewards because the attractors often lie in high-reward regions.
Significance. If the observed bias and its consequences hold beyond the tested setups, the result is significant for model-based RL. It identifies a concrete mechanism by which epistemic uncertainty quantification can fail when transferred from proprioceptive to high-dimensional latent models, directly affecting exploration and model-exploitation safeguards. The work therefore supplies a practical caution that can guide the design of more robust uncertainty methods in latent-space MBRL.
major comments (2)
- [Empirical Evaluation] The central claim that attractor bias is a general property of latent dynamics models rests on empirical demonstrations in the Dreamer family and selected environments. No theoretical derivation is supplied showing why any latent transition model must develop such attractors; the argument therefore inherits the risk that the effect is architecture- or training-specific.
- [Empirical demonstrations] Support for the claim that epistemic uncertainty estimates are undermined and rewards overestimated is described as moderate because full details on controls, quantitative metrics, and statistical significance are not provided in the abstract-level summary of the experiments. This leaves open the possibility of example selection and weakens the load-bearing assertion that the bias is systematic.
Simulated Author's Rebuttal
We thank the referee for the constructive and insightful comments. We address each major comment below, clarifying the empirical scope of our work and expanding experimental details where appropriate. Revisions have been made to improve precision and completeness.
read point-by-point responses
-
Referee: [Empirical Evaluation] The central claim that attractor bias is a general property of latent dynamics models rests on empirical demonstrations in the Dreamer family and selected environments. No theoretical derivation is supplied showing why any latent transition model must develop such attractors; the argument therefore inherits the risk that the effect is architecture- or training-specific.
Authors: We agree that the manuscript does not supply a theoretical derivation establishing that attractor bias must arise in every possible latent dynamics model. Our contribution is an empirical demonstration within the Recurrent State Space Models of the Dreamer family and the environments evaluated. The revised manuscript now explicitly states this scope in the abstract, introduction, and discussion, and lists the lack of a general theoretical guarantee as a limitation. We believe the empirical evidence remains valuable for highlighting a concrete failure mode in widely deployed latent MBRL methods. revision: yes
-
Referee: [Empirical demonstrations] Support for the claim that epistemic uncertainty estimates are undermined and rewards overestimated is described as moderate because full details on controls, quantitative metrics, and statistical significance are not provided in the abstract-level summary of the experiments. This leaves open the possibility of example selection and weakens the load-bearing assertion that the bias is systematic.
Authors: We acknowledge that the original presentation summarized experiments at a high level and omitted some controls and metrics. The revised manuscript expands the experimental section with full hyperparameter tables, descriptions of control experiments (including ablations on representation learning), the precise quantitative metrics used to quantify attractor bias and reward overestimation, and statistical significance results across multiple random seeds and environments. Additional environments and runs have been added to reduce the risk of example selection. revision: yes
- Absence of a theoretical derivation proving that attractor bias must occur in arbitrary latent transition models.
Circularity Check
No circularity: purely empirical observations with no derivations or self-referential steps
full rationale
The paper's central claims rest on direct empirical demonstrations of attractor behavior in latent transitions within the Dreamer family of models, using experimental rollouts and uncertainty estimates across selected environments. No mathematical derivations, fitted parameters renamed as predictions, ansatzes, or load-bearing self-citations are present in the provided text or abstract. The analysis does not invoke uniqueness theorems, prior author work for core premises, or any chain that reduces a result to its own inputs by construction. This is a standard empirical limitation study whose validity hinges on experimental reproducibility rather than internal logical closure.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Latent space models can be trained to approximate environment dynamics from image observations
Reference graph
Works this paper leans on
-
[1]
On uncertainty in deep state space models for model-based reinforcement learning
Philipp Becker and Gerhard Neumann. On uncertainty in deep state space models for model-based reinforcement learning. Transactions on Machine Learning Research (TMLR), 2022
2022
-
[2]
Combining reconstruction and contrastive methods for multimodal representations in rl
Philipp Becker, Sebastian Mossburger, Fabian Otto, and Gerhard Neumann. Combining reconstruction and contrastive methods for multimodal representations in rl. Reinforcement Learning Conference (RLC), 2024
2024
-
[3]
Dota 2 with Large Scale Deep Reinforcement Learning
Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemys aw D e biak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019
work page internal anchor Pith review arXiv 1912
-
[4]
Sample-efficient reinforcement learning with stochastic ensemble value expansion
Jacob Buckman, Danijar Hafner, George Tucker, Eugene Brevdo, and Honglak Lee. Sample-efficient reinforcement learning with stochastic ensemble value expansion. Advances in Neural Information Processing Systems (NeurIPS), 31, 2018
2018
-
[5]
Deep reinforcement learning in a handful of trials using probabilistic dynamics models
Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. Advances in Neural Information Processing Systems (NeurIPS), 31, 2018
2018
-
[6]
Magnetic control of tokamak plasmas through deep reinforcement learning
Jonas Degrave, Federico Felici, Jonas Buchli, Michael Neunert, Brendan Tracey, Francesco Carpanese, Timo Ewalds, Roland Hafner, Abbas Abdolmaleki, Diego de Las Casas, et al. Magnetic control of tokamak plasmas through deep reinforcement learning. Nature, 602 0 (7897): 0 414--419, 2022
2022
-
[7]
Pilco: A model-based and data-efficient approach to policy search
Marc Deisenroth and Carl E Rasmussen. Pilco: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp.\ 465--472, 2011
2011
-
[8]
Model-value inconsistency as a signal for epistemic uncertainty
Angelos Filos, Eszter V \'e rtes, Zita Marinho, Gregory Farquhar, Diana Borsa, Abram Friesen, Feryal Behbahani, Tom Schaul, Andr \'e Barreto, and Simon Osindero. Model-value inconsistency as a signal for epistemic uncertainty. International Conference on Learning Representations (ICLR), 2022
2022
-
[9]
Trust the model where it trusts itself--model-based actor-critic with uncertainty-aware rollout adaption
Bernd Frauenknecht, Artur Eisele, Devdutt Subhasish, Friedrich Solowjow, and Sebastian Trimpe. Trust the model where it trusts itself--model-based actor-critic with uncertainty-aware rollout adaption. International Conference on Machine Learning (ICML), 2024
2024
-
[10]
On rollouts in model-based reinforcement learning
Bernd Frauenknecht, Devdutt Subhasish, Friedrich Solowjow, and Sebastian Trimpe. On rollouts in model-based reinforcement learning. International Conference on Learning Representations (ICLR), 2025
2025
-
[11]
Dynamical variational autoencoders: A comprehensive review
Laurent Girin, Simon Leglaive, Xiaoyu Bie, Julien Diard, Thomas Hueber, and Xavier Alameda-Pineda. Dynamical variational autoencoders: A comprehensive review. Foundations and Trends in Machine Learning, 15 0 (1-2): 0 1--175, 2022
2022
-
[12]
Towards an interpretable latent space in structured models for video prediction
Rushil Gupta, Vishal Sharma, Yash Jain, Yitao Liang, Guy Van den Broeck, and Parag Singla. Towards an interpretable latent space in structured models for video prediction. arXiv preprint arXiv:2107.07713, 2021
-
[13]
World models
David Ha and J \"u rgen Schmidhuber. World models. Conference on Neural Information Processing Systems (NeurIPS), 2 0 (3): 0 440, 2018
2018
-
[14]
Dream to Control: Learning Behaviors by Latent Imagination
Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603, 2019 a
work page internal anchor Pith review arXiv 1912
-
[15]
Learning latent dynamics for planning from pixels
Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. In International Conference on Machine Learning (ICML), pp.\ 2555--2565. PMLR, 2019 b
2019
-
[16]
Mastering Atari with Discrete World Models
Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. arXiv preprint arXiv:2010.02193, 2020
work page internal anchor Pith review arXiv 2010
-
[17]
Mastering Diverse Domains through World Models
Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023
work page internal anchor Pith review arXiv 2023
-
[18]
Temporal difference learning for model predictive control
Nicklas Hansen, Xiaolong Wang, and Hao Su. Temporal difference learning for model predictive control. International Conference on Machine Learning (ICML), 2022
2022
-
[19]
Td-mpc2: Scalable, robust world models for continuous control
Nicklas Hansen, Hao Su, and Xiaolong Wang. Td-mpc2: Scalable, robust world models for continuous control. International Conference on Learning Representations (ICLR), 2024
2024
-
[20]
When to trust your model: Model-based policy optimization
Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. Advances in Neural Information Processing Systems (NeurIPS), 32, 2019
2019
-
[21]
Auto-Encoding Variational Bayes
Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013
work page internal anchor Pith review arXiv 2013
-
[22]
Bidirectional model-based policy optimization
Hang Lai, Jian Shen, Weinan Zhang, and Yong Yu. Bidirectional model-based policy optimization. In International Conference on Machine Learning (ICML), pp.\ 5618--5627. PMLR, 2020
2020
-
[23]
Simple and scalable predictive uncertainty estimation using deep ensembles
Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in Neural Information Processing Systems (NeurIPS), 30, 2017
2017
-
[24]
Guided policy search
Sergey Levine and Vladlen Koltun. Guided policy search. In International Conference on Machine Learning (ICML), pp.\ 1--9. PMLR, 2013
2013
-
[25]
Fld: Fourier latent dynamics for structured motion representation and learning
Chenhao Li, Elijah Stanger-Jones, Steve Heim, and Sangbae Kim. Fld: Fourier latent dynamics for structured motion representation and learning. International Conference on Learning Representations (ICLR), 2024
2024
-
[26]
Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning
Anusha Nagabandi, Gregory Kahn, Ronald S Fearing, and Sergey Levine. Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp.\ 7559--7566. IEEE, 2018
2018
-
[27]
On the jensen--shannon symmetrization of distances relying on abstract means
Frank Nielsen. On the jensen--shannon symmetrization of distances relying on abstract means. Entropy, 21 0 (5): 0 485, 2019
2019
-
[28]
Four principles for physically interpretable world models
Jordan Peper, Zhenjiang Mao, Yuang Geng, Siyuan Pan, and Ivan Ruchkin. Four principles for physically interpretable world models. IEEE International Conference on Robotics and Automation (ICRA), 2025
2025
-
[29]
Transformer-based world models are happy with 100k interactions
Jan Robine, Marc H \"o ftmann, Tobias Uelwer, and Stefan Harmeling. Transformer-based world models are happy with 100k interactions. International Conference on Learning Representations (ICLR), 2023
2023
-
[30]
Model-based reinforcement learning via latent-space collocation
Oleh Rybkin, Chuning Zhu, Anusha Nagabandi, Kostas Daniilidis, Igor Mordatch, and Sergey Levine. Model-based reinforcement learning via latent-space collocation. In International Conference on Machine Learning (ICML), pp.\ 9190--9201. PMLR, 2021
2021
-
[31]
Curious exploration via structured world models yields zero-shot object manipulation
Cansu Sancaktar, Sebastian Blaes, and Georg Martius. Curious exploration via structured world models yields zero-shot object manipulation. Advances in Neural Information Processing Systems (NeurIPS), 35: 0 24170--24183, 2022
2022
-
[32]
Sensei: Semantic exploration guided by foundation models to learn versatile world models
Cansu Sancaktar, Christian Gumbsch, Andrii Zadaianchuk, Pavel Kolev, and Georg Martius. Sensei: Semantic exploration guided by foundation models to learn versatile world models. arXiv preprint arXiv:2503.01584, 2025
-
[33]
Planning to explore via self-supervised world models
Ramanan Sekar, Oleh Rybkin, Kostas Daniilidis, Pieter Abbeel, Danijar Hafner, and Deepak Pathak. Planning to explore via self-supervised world models. In International Conference on Machine Learning (ICML), pp.\ 8583--8592. PMLR, 2020
2020
-
[34]
Uncertainty-aware latent safety filters for avoiding out-of-distribution failures
Junwon Seo, Kensuke Nakamura, and Andrea Bajcsy. Uncertainty-aware latent safety filters for avoiding out-of-distribution failures. Conference on Robot Learning (CoRL), 2025
2025
-
[35]
Learning to plan optimistically: Uncertainty-guided deep exploration via latent model ensembles
Tim Seyde, Wilko Schwarting, Sertac Karaman, and Daniela Rus. Learning to plan optimistically: Uncertainty-guided deep exploration via latent model ensembles. Computing Research Repository (CoRR), 2020
2020
-
[36]
Optimistic active exploration of dynamical systems
Bhavya Sukhija, Lenart Treven, Cansu Sancaktar, Sebastian Blaes, Stelian Coros, and Andreas Krause. Optimistic active exploration of dynamical systems. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp.\ 38122--38153. Curran Associates, Inc., 2023
2023
-
[37]
Richard S Sutton and Andrew G. Barto. Reinforcement learning: an introduction. 2018
2018
-
[38]
Deepmind control suite, 2018
Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, Timothy Lillicrap, and Martin Riedmiller. Deepmind control suite, 2018
2018
-
[39]
Mujoco: A physics engine for model-based control
Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp.\ 5026--5033. IEEE, 2012
2012
-
[40]
Dynamic horizon value estimation for model-based reinforcement learning
Junjie Wang, Qichao Zhang, Dongbin Zhao, Mengchen Zhao, and Jianye Hao. Dynamic horizon value estimation for model-based reinforcement learning. arXiv preprint arXiv:2009.09593, 2020
-
[41]
Information theoretic mpc for model-based reinforcement learning
Grady Williams, Nolan Wagener, Brian Goldfain, Paul Drews, James M Rehg, Byron Boots, and Evangelos A Theodorou. Information theoretic mpc for model-based reinforcement learning. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp.\ 1714--1721. IEEE, 2017
2017
-
[42]
Mopo: Model-based offline policy optimization
Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Y Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. Mopo: Model-based offline policy optimization. Advances in Neural Information Processing Systems (NeurIPS), 33: 0 14129--14142, 2020
2020
-
[43]
Dino-wm: World models on pre-trained visual features enable zero-shot planning
Gaoyue Zhou, Hengkai Pan, Yann LeCun, and Lerrel Pinto. Dino-wm: World models on pre-trained visual features enable zero-shot planning. International Conference on Machine Learning (ICML), 2024
2024
-
[44]
Bridging imagination and reality for model-based deep reinforcement learning
Guangxiang Zhu, Minghao Zhang, Honglak Lee, and Chongjie Zhang. Bridging imagination and reality for model-based deep reinforcement learning. Advances in Neural Information Processing Systems (NeurIPS), 33: 0 8993--9006, 2020
2020
-
[45]
write newline
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.