arxiv: 2604.25416 · v1 · submitted 2026-04-28 · 💻 cs.LG

Recognition: unknown

Biased Dreams: Limitations to Epistemic Uncertainty Quantification in Latent Space Models

Julia Berger , Bernd Frauenknecht , Sebastian Trimpe , Bastian Leibe

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:52 UTC · model grok-4.3

classification 💻 cs.LG

keywords latent dynamics modelsepistemic uncertaintymodel-based reinforcement learningRecurrent State Space ModelDreamerattractor behaviorreward overestimationexploration

0 comments

The pith

Latent dynamics models in reinforcement learning bias transitions toward well-represented regions, which can hide environment discrepancies and overestimate rewards.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows that latent transitions in models such as the Recurrent State Space Model pull predictions toward areas of latent space that were seen often during training. These attractors can differ from the true environment dynamics, so mismatches between model and reality may not register as elevated epistemic uncertainty. Because the attractors frequently sit in high-reward regions, imagined rollouts produce reward estimates that are systematically higher than the environment actually delivers. Readers should care because many image-based model-based reinforcement learning methods rely on these latent models for planning and exploration; undetected biases could produce policies that fail when transferred to reality. The work therefore questions whether standard epistemic uncertainty techniques transfer reliably from physical to latent dynamics.

Core claim

Latent transitions are biased toward well-represented regions of latent space and exhibit an attractor behavior that can deviate from true environment dynamics. As a result, discrepancies in environment dynamics may not manifest as high epistemic uncertainty in latent space. Because these attractors often lie in high-reward regions, latent rollouts systematically overestimate predicted rewards. The findings indicate that epistemic uncertainty quantification, which works for physical dynamics models, has key limitations when applied to latent dynamics models used in methods such as Dreamer.

What carries the argument

Attractor behavior of latent transitions in Recurrent State Space Models, which draws next-state predictions toward densely sampled regions irrespective of actual dynamics.

If this is right

Discrepancies between the model and the true environment may remain invisible to epistemic uncertainty measures in latent space.
Reward predictions from latent rollouts are biased upward when attractors coincide with high-reward areas.
Exploration driven by epistemic uncertainty can be directed toward the wrong states because uncertainty does not reflect true model error.
Standard techniques for mitigating model exploitation in model-based reinforcement learning lose effectiveness when the dynamics model operates in latent space.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Agents trained with these latent models may appear competent in simulation but encounter unexpected failures on transfer because hidden dynamics errors are not reflected in uncertainty signals.
Hybrid approaches that combine latent representations with occasional physical-state checks could reduce the impact of attractor bias.
Long-horizon planning in latent space may require additional consistency constraints beyond current uncertainty quantification to avoid compounding errors along attractor trajectories.

Load-bearing premise

That the observed attractor behavior and its effects on uncertainty and reward estimates are general properties of latent dynamics models rather than limited to the environments and architectures tested.

What would settle it

Train a latent dynamics model on an environment that contains a clear dynamics change only in sparsely sampled regions, then measure whether epistemic uncertainty rises in those regions or whether the model still collapses predictions to high-density attractors without flagging the mismatch.

Figures

Figures reproduced from arXiv: 2604.25416 by Bastian Leibe, Bernd Frauenknecht, Julia Berger, Sebastian Trimpe.

**Figure 1.** Figure 1: Illustrative visualization of our key findings. For view at source ↗

**Figure 2.** Figure 2: Attractor evaluation for RSSM and PE under ID and OOD start states for Cheetah Run and Halfcheetah tasks, respectively. While the PE trajectory shows unstable OOD predictions, the RSSM trajectory is unexpectedly drawn to the dominant transition flow, revealing an attractor that guides it toward well-represented latent regions. ID and OOD Start States. Rollouts are initialized from ID and OOD physical start… view at source ↗

**Figure 3.** Figure 3: A Dreamer agent’s performance is largely unchanged by the use of a physical state decoder view at source ↗

**Figure 4.** Figure 4: Physical discrepancy versus prior uncertainty for view at source ↗

**Figure 5.** Figure 5: Signed reward discrepancies r pred t − r sim t in prior and posterior rollouts in RSSM and Cat-RSSM. Solid lines are averaged over 5 seeds per task, shaded areas show standard deviation. The gray dashed line indicated end of warm-up steps. Prior rollouts systematically overestimate rewards, suggesting the attractors bias toward well-represented, high-reward latent regions. 6 Discussion and Future Work Our … view at source ↗

**Figure 6.** Figure 6: Attractor analysis of RSSM and Cat-RSSM across the Cartpole Swingup, Hopper Hop, and Walker Run environments, shown for both ID (left subplots) and OOD (right subplots) settings. 13 view at source ↗

**Figure 7.** Figure 7: RL performance is largely unchanged by the use of the physical state decoder (PD) be view at source ↗

**Figure 8.** Figure 8: Comparison of predicted physical trajectories in view at source ↗

**Figure 9.** Figure 9: Comparison of reconstructed image trajectories in view at source ↗

**Figure 10.** Figure 10: Comparison of reconstructed image trajectories in view at source ↗

**Figure 11.** Figure 11: Comparison of reconstructed physical trajectories in view at source ↗

**Figure 12.** Figure 12: Comparison of reconstructed physical trajectories in view at source ↗

read the original abstract

Model-Based Reinforcement Learning distinguishes between physical dynamics models operating on proprioceptive inputs and latent dynamics models operating on high-dimensional image observations. A prominent latent approach is the Recurrent State Space Model used in the Dreamer family. While epistemic uncertainty quantification to inform exploration and mitigate model exploitation is well established for physical dynamics models, its transfer to latent dynamics models has received limited scrutiny. We empirically demonstrate that latent transitions are biased toward well-represented regions of latent space, exhibiting an attractor behavior that can deviate from true environment dynamics. As a result, discrepancies in environment dynamics may not manifest in latent space, undermining the reliability of epistemic uncertainty estimates. Because these attractors often lie in high-reward regions, latent rollouts systematically overestimate predicted rewards. Our findings highlight key limitations of epistemic uncertainty estimation in latent dynamics models and motivate more critical evaluation of this method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper empirically demonstrates that in latent dynamics models such as the Recurrent State Space Models from the Dreamer family, transitions are biased toward well-represented regions of latent space. This produces attractor behavior that deviates from true environment dynamics, so that discrepancies in the environment may not appear in latent space. Consequently, epistemic uncertainty estimates become unreliable and latent rollouts systematically overestimate rewards because the attractors often lie in high-reward regions.

Significance. If the observed bias and its consequences hold beyond the tested setups, the result is significant for model-based RL. It identifies a concrete mechanism by which epistemic uncertainty quantification can fail when transferred from proprioceptive to high-dimensional latent models, directly affecting exploration and model-exploitation safeguards. The work therefore supplies a practical caution that can guide the design of more robust uncertainty methods in latent-space MBRL.

major comments (2)

[Empirical Evaluation] The central claim that attractor bias is a general property of latent dynamics models rests on empirical demonstrations in the Dreamer family and selected environments. No theoretical derivation is supplied showing why any latent transition model must develop such attractors; the argument therefore inherits the risk that the effect is architecture- or training-specific.
[Empirical demonstrations] Support for the claim that epistemic uncertainty estimates are undermined and rewards overestimated is described as moderate because full details on controls, quantitative metrics, and statistical significance are not provided in the abstract-level summary of the experiments. This leaves open the possibility of example selection and weakens the load-bearing assertion that the bias is systematic.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and insightful comments. We address each major comment below, clarifying the empirical scope of our work and expanding experimental details where appropriate. Revisions have been made to improve precision and completeness.

read point-by-point responses

Referee: [Empirical Evaluation] The central claim that attractor bias is a general property of latent dynamics models rests on empirical demonstrations in the Dreamer family and selected environments. No theoretical derivation is supplied showing why any latent transition model must develop such attractors; the argument therefore inherits the risk that the effect is architecture- or training-specific.

Authors: We agree that the manuscript does not supply a theoretical derivation establishing that attractor bias must arise in every possible latent dynamics model. Our contribution is an empirical demonstration within the Recurrent State Space Models of the Dreamer family and the environments evaluated. The revised manuscript now explicitly states this scope in the abstract, introduction, and discussion, and lists the lack of a general theoretical guarantee as a limitation. We believe the empirical evidence remains valuable for highlighting a concrete failure mode in widely deployed latent MBRL methods. revision: yes
Referee: [Empirical demonstrations] Support for the claim that epistemic uncertainty estimates are undermined and rewards overestimated is described as moderate because full details on controls, quantitative metrics, and statistical significance are not provided in the abstract-level summary of the experiments. This leaves open the possibility of example selection and weakens the load-bearing assertion that the bias is systematic.

Authors: We acknowledge that the original presentation summarized experiments at a high level and omitted some controls and metrics. The revised manuscript expands the experimental section with full hyperparameter tables, descriptions of control experiments (including ablations on representation learning), the precise quantitative metrics used to quantify attractor bias and reward overestimation, and statistical significance results across multiple random seeds and environments. Additional environments and runs have been added to reduce the risk of example selection. revision: yes

standing simulated objections not resolved

Absence of a theoretical derivation proving that attractor bias must occur in arbitrary latent transition models.

Circularity Check

0 steps flagged

No circularity: purely empirical observations with no derivations or self-referential steps

full rationale

The paper's central claims rest on direct empirical demonstrations of attractor behavior in latent transitions within the Dreamer family of models, using experimental rollouts and uncertainty estimates across selected environments. No mathematical derivations, fitted parameters renamed as predictions, ansatzes, or load-bearing self-citations are present in the provided text or abstract. The analysis does not invoke uniqueness theorems, prior author work for core premises, or any chain that reduces a result to its own inputs by construction. This is a standard empirical limitation study whose validity hinges on experimental reproducibility rather than internal logical closure.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is an empirical study; no free parameters, new entities, or non-standard axioms are introduced or fitted in the abstract. Relies on standard domain assumptions about latent space models in RL.

axioms (1)

domain assumption Latent space models can be trained to approximate environment dynamics from image observations
Implicit in the use of RSSM and Dreamer architectures for the empirical tests.

pith-pipeline@v0.9.0 · 5448 in / 1188 out tokens · 59662 ms · 2026-05-07T16:52:41.455421+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 8 canonical work pages · 5 internal anchors

[1]

On uncertainty in deep state space models for model-based reinforcement learning

Philipp Becker and Gerhard Neumann. On uncertainty in deep state space models for model-based reinforcement learning. Transactions on Machine Learning Research (TMLR), 2022

2022
[2]

Combining reconstruction and contrastive methods for multimodal representations in rl

Philipp Becker, Sebastian Mossburger, Fabian Otto, and Gerhard Neumann. Combining reconstruction and contrastive methods for multimodal representations in rl. Reinforcement Learning Conference (RLC), 2024

2024
[3]

Dota 2 with Large Scale Deep Reinforcement Learning

Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemys aw D e biak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019

work page internal anchor Pith review arXiv 1912
[4]

Sample-efficient reinforcement learning with stochastic ensemble value expansion

Jacob Buckman, Danijar Hafner, George Tucker, Eugene Brevdo, and Honglak Lee. Sample-efficient reinforcement learning with stochastic ensemble value expansion. Advances in Neural Information Processing Systems (NeurIPS), 31, 2018

2018
[5]

Deep reinforcement learning in a handful of trials using probabilistic dynamics models

Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. Advances in Neural Information Processing Systems (NeurIPS), 31, 2018

2018
[6]

Magnetic control of tokamak plasmas through deep reinforcement learning

Jonas Degrave, Federico Felici, Jonas Buchli, Michael Neunert, Brendan Tracey, Francesco Carpanese, Timo Ewalds, Roland Hafner, Abbas Abdolmaleki, Diego de Las Casas, et al. Magnetic control of tokamak plasmas through deep reinforcement learning. Nature, 602 0 (7897): 0 414--419, 2022

2022
[7]

Pilco: A model-based and data-efficient approach to policy search

Marc Deisenroth and Carl E Rasmussen. Pilco: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp.\ 465--472, 2011

2011
[8]

Model-value inconsistency as a signal for epistemic uncertainty

Angelos Filos, Eszter V \'e rtes, Zita Marinho, Gregory Farquhar, Diana Borsa, Abram Friesen, Feryal Behbahani, Tom Schaul, Andr \'e Barreto, and Simon Osindero. Model-value inconsistency as a signal for epistemic uncertainty. International Conference on Learning Representations (ICLR), 2022

2022
[9]

Trust the model where it trusts itself--model-based actor-critic with uncertainty-aware rollout adaption

Bernd Frauenknecht, Artur Eisele, Devdutt Subhasish, Friedrich Solowjow, and Sebastian Trimpe. Trust the model where it trusts itself--model-based actor-critic with uncertainty-aware rollout adaption. International Conference on Machine Learning (ICML), 2024

2024
[10]

On rollouts in model-based reinforcement learning

Bernd Frauenknecht, Devdutt Subhasish, Friedrich Solowjow, and Sebastian Trimpe. On rollouts in model-based reinforcement learning. International Conference on Learning Representations (ICLR), 2025

2025
[11]

Dynamical variational autoencoders: A comprehensive review

Laurent Girin, Simon Leglaive, Xiaoyu Bie, Julien Diard, Thomas Hueber, and Xavier Alameda-Pineda. Dynamical variational autoencoders: A comprehensive review. Foundations and Trends in Machine Learning, 15 0 (1-2): 0 1--175, 2022

2022
[12]

Towards an interpretable latent space in structured models for video prediction

Rushil Gupta, Vishal Sharma, Yash Jain, Yitao Liang, Guy Van den Broeck, and Parag Singla. Towards an interpretable latent space in structured models for video prediction. arXiv preprint arXiv:2107.07713, 2021

work page arXiv 2021
[13]

World models

David Ha and J \"u rgen Schmidhuber. World models. Conference on Neural Information Processing Systems (NeurIPS), 2 0 (3): 0 440, 2018

2018
[14]

Dream to Control: Learning Behaviors by Latent Imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603, 2019 a

work page internal anchor Pith review arXiv 1912
[15]

Learning latent dynamics for planning from pixels

Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. In International Conference on Machine Learning (ICML), pp.\ 2555--2565. PMLR, 2019 b

2019
[16]

Mastering Atari with Discrete World Models

Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. arXiv preprint arXiv:2010.02193, 2020

work page internal anchor Pith review arXiv 2010
[17]

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023

work page internal anchor Pith review arXiv 2023
[18]

Temporal difference learning for model predictive control

Nicklas Hansen, Xiaolong Wang, and Hao Su. Temporal difference learning for model predictive control. International Conference on Machine Learning (ICML), 2022

2022
[19]

Td-mpc2: Scalable, robust world models for continuous control

Nicklas Hansen, Hao Su, and Xiaolong Wang. Td-mpc2: Scalable, robust world models for continuous control. International Conference on Learning Representations (ICLR), 2024

2024
[20]

When to trust your model: Model-based policy optimization

Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. Advances in Neural Information Processing Systems (NeurIPS), 32, 2019

2019
[21]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review arXiv 2013
[22]

Bidirectional model-based policy optimization

Hang Lai, Jian Shen, Weinan Zhang, and Yong Yu. Bidirectional model-based policy optimization. In International Conference on Machine Learning (ICML), pp.\ 5618--5627. PMLR, 2020

2020
[23]

Simple and scalable predictive uncertainty estimation using deep ensembles

Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in Neural Information Processing Systems (NeurIPS), 30, 2017

2017
[24]

Guided policy search

Sergey Levine and Vladlen Koltun. Guided policy search. In International Conference on Machine Learning (ICML), pp.\ 1--9. PMLR, 2013

2013
[25]

Fld: Fourier latent dynamics for structured motion representation and learning

Chenhao Li, Elijah Stanger-Jones, Steve Heim, and Sangbae Kim. Fld: Fourier latent dynamics for structured motion representation and learning. International Conference on Learning Representations (ICLR), 2024

2024
[26]

Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning

Anusha Nagabandi, Gregory Kahn, Ronald S Fearing, and Sergey Levine. Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp.\ 7559--7566. IEEE, 2018

2018
[27]

On the jensen--shannon symmetrization of distances relying on abstract means

Frank Nielsen. On the jensen--shannon symmetrization of distances relying on abstract means. Entropy, 21 0 (5): 0 485, 2019

2019
[28]

Four principles for physically interpretable world models

Jordan Peper, Zhenjiang Mao, Yuang Geng, Siyuan Pan, and Ivan Ruchkin. Four principles for physically interpretable world models. IEEE International Conference on Robotics and Automation (ICRA), 2025

2025
[29]

Transformer-based world models are happy with 100k interactions

Jan Robine, Marc H \"o ftmann, Tobias Uelwer, and Stefan Harmeling. Transformer-based world models are happy with 100k interactions. International Conference on Learning Representations (ICLR), 2023

2023
[30]

Model-based reinforcement learning via latent-space collocation

Oleh Rybkin, Chuning Zhu, Anusha Nagabandi, Kostas Daniilidis, Igor Mordatch, and Sergey Levine. Model-based reinforcement learning via latent-space collocation. In International Conference on Machine Learning (ICML), pp.\ 9190--9201. PMLR, 2021

2021
[31]

Curious exploration via structured world models yields zero-shot object manipulation

Cansu Sancaktar, Sebastian Blaes, and Georg Martius. Curious exploration via structured world models yields zero-shot object manipulation. Advances in Neural Information Processing Systems (NeurIPS), 35: 0 24170--24183, 2022

2022
[32]

Sensei: Semantic exploration guided by foundation models to learn versatile world models

Cansu Sancaktar, Christian Gumbsch, Andrii Zadaianchuk, Pavel Kolev, and Georg Martius. Sensei: Semantic exploration guided by foundation models to learn versatile world models. arXiv preprint arXiv:2503.01584, 2025

work page arXiv 2025
[33]

Planning to explore via self-supervised world models

Ramanan Sekar, Oleh Rybkin, Kostas Daniilidis, Pieter Abbeel, Danijar Hafner, and Deepak Pathak. Planning to explore via self-supervised world models. In International Conference on Machine Learning (ICML), pp.\ 8583--8592. PMLR, 2020

2020
[34]

Uncertainty-aware latent safety filters for avoiding out-of-distribution failures

Junwon Seo, Kensuke Nakamura, and Andrea Bajcsy. Uncertainty-aware latent safety filters for avoiding out-of-distribution failures. Conference on Robot Learning (CoRL), 2025

2025
[35]

Learning to plan optimistically: Uncertainty-guided deep exploration via latent model ensembles

Tim Seyde, Wilko Schwarting, Sertac Karaman, and Daniela Rus. Learning to plan optimistically: Uncertainty-guided deep exploration via latent model ensembles. Computing Research Repository (CoRR), 2020

2020
[36]

Optimistic active exploration of dynamical systems

Bhavya Sukhija, Lenart Treven, Cansu Sancaktar, Sebastian Blaes, Stelian Coros, and Andreas Krause. Optimistic active exploration of dynamical systems. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp.\ 38122--38153. Curran Associates, Inc., 2023

2023
[37]

Richard S Sutton and Andrew G. Barto. Reinforcement learning: an introduction. 2018

2018
[38]

Deepmind control suite, 2018

Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, Timothy Lillicrap, and Martin Riedmiller. Deepmind control suite, 2018

2018
[39]

Mujoco: A physics engine for model-based control

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp.\ 5026--5033. IEEE, 2012

2012
[40]

Dynamic horizon value estimation for model-based reinforcement learning

Junjie Wang, Qichao Zhang, Dongbin Zhao, Mengchen Zhao, and Jianye Hao. Dynamic horizon value estimation for model-based reinforcement learning. arXiv preprint arXiv:2009.09593, 2020

work page arXiv 2009
[41]

Information theoretic mpc for model-based reinforcement learning

Grady Williams, Nolan Wagener, Brian Goldfain, Paul Drews, James M Rehg, Byron Boots, and Evangelos A Theodorou. Information theoretic mpc for model-based reinforcement learning. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp.\ 1714--1721. IEEE, 2017

2017
[42]

Mopo: Model-based offline policy optimization

Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Y Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. Mopo: Model-based offline policy optimization. Advances in Neural Information Processing Systems (NeurIPS), 33: 0 14129--14142, 2020

2020
[43]

Dino-wm: World models on pre-trained visual features enable zero-shot planning

Gaoyue Zhou, Hengkai Pan, Yann LeCun, and Lerrel Pinto. Dino-wm: World models on pre-trained visual features enable zero-shot planning. International Conference on Machine Learning (ICML), 2024

2024
[44]

Bridging imagination and reality for model-based deep reinforcement learning

Guangxiang Zhu, Minghao Zhang, Honglak Lee, and Chongjie Zhang. Bridging imagination and reality for model-based deep reinforcement learning. Advances in Neural Information Processing Systems (NeurIPS), 33: 0 8993--9006, 2020

2020
[45]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...