pith. machine review for the scientific record. sign in

arxiv: 2604.25416 · v1 · submitted 2026-04-28 · 💻 cs.LG

Recognition: unknown

Biased Dreams: Limitations to Epistemic Uncertainty Quantification in Latent Space Models

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:52 UTC · model grok-4.3

classification 💻 cs.LG
keywords latent dynamics modelsepistemic uncertaintymodel-based reinforcement learningRecurrent State Space ModelDreamerattractor behaviorreward overestimationexploration
0
0 comments X

The pith

Latent dynamics models in reinforcement learning bias transitions toward well-represented regions, which can hide environment discrepancies and overestimate rewards.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows that latent transitions in models such as the Recurrent State Space Model pull predictions toward areas of latent space that were seen often during training. These attractors can differ from the true environment dynamics, so mismatches between model and reality may not register as elevated epistemic uncertainty. Because the attractors frequently sit in high-reward regions, imagined rollouts produce reward estimates that are systematically higher than the environment actually delivers. Readers should care because many image-based model-based reinforcement learning methods rely on these latent models for planning and exploration; undetected biases could produce policies that fail when transferred to reality. The work therefore questions whether standard epistemic uncertainty techniques transfer reliably from physical to latent dynamics.

Core claim

Latent transitions are biased toward well-represented regions of latent space and exhibit an attractor behavior that can deviate from true environment dynamics. As a result, discrepancies in environment dynamics may not manifest as high epistemic uncertainty in latent space. Because these attractors often lie in high-reward regions, latent rollouts systematically overestimate predicted rewards. The findings indicate that epistemic uncertainty quantification, which works for physical dynamics models, has key limitations when applied to latent dynamics models used in methods such as Dreamer.

What carries the argument

Attractor behavior of latent transitions in Recurrent State Space Models, which draws next-state predictions toward densely sampled regions irrespective of actual dynamics.

If this is right

  • Discrepancies between the model and the true environment may remain invisible to epistemic uncertainty measures in latent space.
  • Reward predictions from latent rollouts are biased upward when attractors coincide with high-reward areas.
  • Exploration driven by epistemic uncertainty can be directed toward the wrong states because uncertainty does not reflect true model error.
  • Standard techniques for mitigating model exploitation in model-based reinforcement learning lose effectiveness when the dynamics model operates in latent space.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Agents trained with these latent models may appear competent in simulation but encounter unexpected failures on transfer because hidden dynamics errors are not reflected in uncertainty signals.
  • Hybrid approaches that combine latent representations with occasional physical-state checks could reduce the impact of attractor bias.
  • Long-horizon planning in latent space may require additional consistency constraints beyond current uncertainty quantification to avoid compounding errors along attractor trajectories.

Load-bearing premise

That the observed attractor behavior and its effects on uncertainty and reward estimates are general properties of latent dynamics models rather than limited to the environments and architectures tested.

What would settle it

Train a latent dynamics model on an environment that contains a clear dynamics change only in sparsely sampled regions, then measure whether epistemic uncertainty rises in those regions or whether the model still collapses predictions to high-density attractors without flagging the mismatch.

Figures

Figures reproduced from arXiv: 2604.25416 by Bastian Leibe, Bernd Frauenknecht, Julia Berger, Sebastian Trimpe.

Figure 1
Figure 1. Figure 1: Illustrative visualization of our key findings. For view at source ↗
Figure 2
Figure 2. Figure 2: Attractor evaluation for RSSM and PE under ID and OOD start states for Cheetah Run and Halfcheetah tasks, respectively. While the PE trajectory shows unstable OOD predictions, the RSSM trajectory is unexpectedly drawn to the dominant transition flow, revealing an attractor that guides it toward well-represented latent regions. ID and OOD Start States. Rollouts are initialized from ID and OOD physical start… view at source ↗
Figure 3
Figure 3. Figure 3: A Dreamer agent’s performance is largely unchanged by the use of a physical state decoder view at source ↗
Figure 4
Figure 4. Figure 4: Physical discrepancy versus prior uncertainty for view at source ↗
Figure 5
Figure 5. Figure 5: Signed reward discrepancies r pred t − r sim t in prior and posterior rollouts in RSSM and Cat-RSSM. Solid lines are averaged over 5 seeds per task, shaded areas show standard deviation. The gray dashed line indicated end of warm-up steps. Prior rollouts systematically overestimate rewards, suggesting the attractors bias toward well-represented, high-reward latent regions. 6 Discussion and Future Work Our … view at source ↗
Figure 6
Figure 6. Figure 6: Attractor analysis of RSSM and Cat-RSSM across the Cartpole Swingup, Hopper Hop, and Walker Run environments, shown for both ID (left subplots) and OOD (right subplots) settings. 13 view at source ↗
Figure 7
Figure 7. Figure 7: RL performance is largely unchanged by the use of the physical state decoder (PD) be view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of predicted physical trajectories in view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of reconstructed image trajectories in view at source ↗
Figure 10
Figure 10. Figure 10: Comparison of reconstructed image trajectories in view at source ↗
Figure 11
Figure 11. Figure 11: Comparison of reconstructed physical trajectories in view at source ↗
Figure 12
Figure 12. Figure 12: Comparison of reconstructed physical trajectories in view at source ↗
read the original abstract

Model-Based Reinforcement Learning distinguishes between physical dynamics models operating on proprioceptive inputs and latent dynamics models operating on high-dimensional image observations. A prominent latent approach is the Recurrent State Space Model used in the Dreamer family. While epistemic uncertainty quantification to inform exploration and mitigate model exploitation is well established for physical dynamics models, its transfer to latent dynamics models has received limited scrutiny. We empirically demonstrate that latent transitions are biased toward well-represented regions of latent space, exhibiting an attractor behavior that can deviate from true environment dynamics. As a result, discrepancies in environment dynamics may not manifest in latent space, undermining the reliability of epistemic uncertainty estimates. Because these attractors often lie in high-reward regions, latent rollouts systematically overestimate predicted rewards. Our findings highlight key limitations of epistemic uncertainty estimation in latent dynamics models and motivate more critical evaluation of this method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper empirically demonstrates that in latent dynamics models such as the Recurrent State Space Models from the Dreamer family, transitions are biased toward well-represented regions of latent space. This produces attractor behavior that deviates from true environment dynamics, so that discrepancies in the environment may not appear in latent space. Consequently, epistemic uncertainty estimates become unreliable and latent rollouts systematically overestimate rewards because the attractors often lie in high-reward regions.

Significance. If the observed bias and its consequences hold beyond the tested setups, the result is significant for model-based RL. It identifies a concrete mechanism by which epistemic uncertainty quantification can fail when transferred from proprioceptive to high-dimensional latent models, directly affecting exploration and model-exploitation safeguards. The work therefore supplies a practical caution that can guide the design of more robust uncertainty methods in latent-space MBRL.

major comments (2)
  1. [Empirical Evaluation] The central claim that attractor bias is a general property of latent dynamics models rests on empirical demonstrations in the Dreamer family and selected environments. No theoretical derivation is supplied showing why any latent transition model must develop such attractors; the argument therefore inherits the risk that the effect is architecture- or training-specific.
  2. [Empirical demonstrations] Support for the claim that epistemic uncertainty estimates are undermined and rewards overestimated is described as moderate because full details on controls, quantitative metrics, and statistical significance are not provided in the abstract-level summary of the experiments. This leaves open the possibility of example selection and weakens the load-bearing assertion that the bias is systematic.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and insightful comments. We address each major comment below, clarifying the empirical scope of our work and expanding experimental details where appropriate. Revisions have been made to improve precision and completeness.

read point-by-point responses
  1. Referee: [Empirical Evaluation] The central claim that attractor bias is a general property of latent dynamics models rests on empirical demonstrations in the Dreamer family and selected environments. No theoretical derivation is supplied showing why any latent transition model must develop such attractors; the argument therefore inherits the risk that the effect is architecture- or training-specific.

    Authors: We agree that the manuscript does not supply a theoretical derivation establishing that attractor bias must arise in every possible latent dynamics model. Our contribution is an empirical demonstration within the Recurrent State Space Models of the Dreamer family and the environments evaluated. The revised manuscript now explicitly states this scope in the abstract, introduction, and discussion, and lists the lack of a general theoretical guarantee as a limitation. We believe the empirical evidence remains valuable for highlighting a concrete failure mode in widely deployed latent MBRL methods. revision: yes

  2. Referee: [Empirical demonstrations] Support for the claim that epistemic uncertainty estimates are undermined and rewards overestimated is described as moderate because full details on controls, quantitative metrics, and statistical significance are not provided in the abstract-level summary of the experiments. This leaves open the possibility of example selection and weakens the load-bearing assertion that the bias is systematic.

    Authors: We acknowledge that the original presentation summarized experiments at a high level and omitted some controls and metrics. The revised manuscript expands the experimental section with full hyperparameter tables, descriptions of control experiments (including ablations on representation learning), the precise quantitative metrics used to quantify attractor bias and reward overestimation, and statistical significance results across multiple random seeds and environments. Additional environments and runs have been added to reduce the risk of example selection. revision: yes

standing simulated objections not resolved
  • Absence of a theoretical derivation proving that attractor bias must occur in arbitrary latent transition models.

Circularity Check

0 steps flagged

No circularity: purely empirical observations with no derivations or self-referential steps

full rationale

The paper's central claims rest on direct empirical demonstrations of attractor behavior in latent transitions within the Dreamer family of models, using experimental rollouts and uncertainty estimates across selected environments. No mathematical derivations, fitted parameters renamed as predictions, ansatzes, or load-bearing self-citations are present in the provided text or abstract. The analysis does not invoke uniqueness theorems, prior author work for core premises, or any chain that reduces a result to its own inputs by construction. This is a standard empirical limitation study whose validity hinges on experimental reproducibility rather than internal logical closure.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is an empirical study; no free parameters, new entities, or non-standard axioms are introduced or fitted in the abstract. Relies on standard domain assumptions about latent space models in RL.

axioms (1)
  • domain assumption Latent space models can be trained to approximate environment dynamics from image observations
    Implicit in the use of RSSM and Dreamer architectures for the empirical tests.

pith-pipeline@v0.9.0 · 5448 in / 1188 out tokens · 59662 ms · 2026-05-07T16:52:41.455421+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 8 canonical work pages · 5 internal anchors

  1. [1]

    On uncertainty in deep state space models for model-based reinforcement learning

    Philipp Becker and Gerhard Neumann. On uncertainty in deep state space models for model-based reinforcement learning. Transactions on Machine Learning Research (TMLR), 2022

  2. [2]

    Combining reconstruction and contrastive methods for multimodal representations in rl

    Philipp Becker, Sebastian Mossburger, Fabian Otto, and Gerhard Neumann. Combining reconstruction and contrastive methods for multimodal representations in rl. Reinforcement Learning Conference (RLC), 2024

  3. [3]

    Dota 2 with Large Scale Deep Reinforcement Learning

    Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemys aw D e biak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019

  4. [4]

    Sample-efficient reinforcement learning with stochastic ensemble value expansion

    Jacob Buckman, Danijar Hafner, George Tucker, Eugene Brevdo, and Honglak Lee. Sample-efficient reinforcement learning with stochastic ensemble value expansion. Advances in Neural Information Processing Systems (NeurIPS), 31, 2018

  5. [5]

    Deep reinforcement learning in a handful of trials using probabilistic dynamics models

    Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. Advances in Neural Information Processing Systems (NeurIPS), 31, 2018

  6. [6]

    Magnetic control of tokamak plasmas through deep reinforcement learning

    Jonas Degrave, Federico Felici, Jonas Buchli, Michael Neunert, Brendan Tracey, Francesco Carpanese, Timo Ewalds, Roland Hafner, Abbas Abdolmaleki, Diego de Las Casas, et al. Magnetic control of tokamak plasmas through deep reinforcement learning. Nature, 602 0 (7897): 0 414--419, 2022

  7. [7]

    Pilco: A model-based and data-efficient approach to policy search

    Marc Deisenroth and Carl E Rasmussen. Pilco: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp.\ 465--472, 2011

  8. [8]

    Model-value inconsistency as a signal for epistemic uncertainty

    Angelos Filos, Eszter V \'e rtes, Zita Marinho, Gregory Farquhar, Diana Borsa, Abram Friesen, Feryal Behbahani, Tom Schaul, Andr \'e Barreto, and Simon Osindero. Model-value inconsistency as a signal for epistemic uncertainty. International Conference on Learning Representations (ICLR), 2022

  9. [9]

    Trust the model where it trusts itself--model-based actor-critic with uncertainty-aware rollout adaption

    Bernd Frauenknecht, Artur Eisele, Devdutt Subhasish, Friedrich Solowjow, and Sebastian Trimpe. Trust the model where it trusts itself--model-based actor-critic with uncertainty-aware rollout adaption. International Conference on Machine Learning (ICML), 2024

  10. [10]

    On rollouts in model-based reinforcement learning

    Bernd Frauenknecht, Devdutt Subhasish, Friedrich Solowjow, and Sebastian Trimpe. On rollouts in model-based reinforcement learning. International Conference on Learning Representations (ICLR), 2025

  11. [11]

    Dynamical variational autoencoders: A comprehensive review

    Laurent Girin, Simon Leglaive, Xiaoyu Bie, Julien Diard, Thomas Hueber, and Xavier Alameda-Pineda. Dynamical variational autoencoders: A comprehensive review. Foundations and Trends in Machine Learning, 15 0 (1-2): 0 1--175, 2022

  12. [12]

    Towards an interpretable latent space in structured models for video prediction

    Rushil Gupta, Vishal Sharma, Yash Jain, Yitao Liang, Guy Van den Broeck, and Parag Singla. Towards an interpretable latent space in structured models for video prediction. arXiv preprint arXiv:2107.07713, 2021

  13. [13]

    World models

    David Ha and J \"u rgen Schmidhuber. World models. Conference on Neural Information Processing Systems (NeurIPS), 2 0 (3): 0 440, 2018

  14. [14]

    Dream to Control: Learning Behaviors by Latent Imagination

    Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603, 2019 a

  15. [15]

    Learning latent dynamics for planning from pixels

    Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. In International Conference on Machine Learning (ICML), pp.\ 2555--2565. PMLR, 2019 b

  16. [16]

    Mastering Atari with Discrete World Models

    Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. arXiv preprint arXiv:2010.02193, 2020

  17. [17]

    Mastering Diverse Domains through World Models

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023

  18. [18]

    Temporal difference learning for model predictive control

    Nicklas Hansen, Xiaolong Wang, and Hao Su. Temporal difference learning for model predictive control. International Conference on Machine Learning (ICML), 2022

  19. [19]

    Td-mpc2: Scalable, robust world models for continuous control

    Nicklas Hansen, Hao Su, and Xiaolong Wang. Td-mpc2: Scalable, robust world models for continuous control. International Conference on Learning Representations (ICLR), 2024

  20. [20]

    When to trust your model: Model-based policy optimization

    Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. Advances in Neural Information Processing Systems (NeurIPS), 32, 2019

  21. [21]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013

  22. [22]

    Bidirectional model-based policy optimization

    Hang Lai, Jian Shen, Weinan Zhang, and Yong Yu. Bidirectional model-based policy optimization. In International Conference on Machine Learning (ICML), pp.\ 5618--5627. PMLR, 2020

  23. [23]

    Simple and scalable predictive uncertainty estimation using deep ensembles

    Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in Neural Information Processing Systems (NeurIPS), 30, 2017

  24. [24]

    Guided policy search

    Sergey Levine and Vladlen Koltun. Guided policy search. In International Conference on Machine Learning (ICML), pp.\ 1--9. PMLR, 2013

  25. [25]

    Fld: Fourier latent dynamics for structured motion representation and learning

    Chenhao Li, Elijah Stanger-Jones, Steve Heim, and Sangbae Kim. Fld: Fourier latent dynamics for structured motion representation and learning. International Conference on Learning Representations (ICLR), 2024

  26. [26]

    Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning

    Anusha Nagabandi, Gregory Kahn, Ronald S Fearing, and Sergey Levine. Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp.\ 7559--7566. IEEE, 2018

  27. [27]

    On the jensen--shannon symmetrization of distances relying on abstract means

    Frank Nielsen. On the jensen--shannon symmetrization of distances relying on abstract means. Entropy, 21 0 (5): 0 485, 2019

  28. [28]

    Four principles for physically interpretable world models

    Jordan Peper, Zhenjiang Mao, Yuang Geng, Siyuan Pan, and Ivan Ruchkin. Four principles for physically interpretable world models. IEEE International Conference on Robotics and Automation (ICRA), 2025

  29. [29]

    Transformer-based world models are happy with 100k interactions

    Jan Robine, Marc H \"o ftmann, Tobias Uelwer, and Stefan Harmeling. Transformer-based world models are happy with 100k interactions. International Conference on Learning Representations (ICLR), 2023

  30. [30]

    Model-based reinforcement learning via latent-space collocation

    Oleh Rybkin, Chuning Zhu, Anusha Nagabandi, Kostas Daniilidis, Igor Mordatch, and Sergey Levine. Model-based reinforcement learning via latent-space collocation. In International Conference on Machine Learning (ICML), pp.\ 9190--9201. PMLR, 2021

  31. [31]

    Curious exploration via structured world models yields zero-shot object manipulation

    Cansu Sancaktar, Sebastian Blaes, and Georg Martius. Curious exploration via structured world models yields zero-shot object manipulation. Advances in Neural Information Processing Systems (NeurIPS), 35: 0 24170--24183, 2022

  32. [32]

    Sensei: Semantic exploration guided by foundation models to learn versatile world models

    Cansu Sancaktar, Christian Gumbsch, Andrii Zadaianchuk, Pavel Kolev, and Georg Martius. Sensei: Semantic exploration guided by foundation models to learn versatile world models. arXiv preprint arXiv:2503.01584, 2025

  33. [33]

    Planning to explore via self-supervised world models

    Ramanan Sekar, Oleh Rybkin, Kostas Daniilidis, Pieter Abbeel, Danijar Hafner, and Deepak Pathak. Planning to explore via self-supervised world models. In International Conference on Machine Learning (ICML), pp.\ 8583--8592. PMLR, 2020

  34. [34]

    Uncertainty-aware latent safety filters for avoiding out-of-distribution failures

    Junwon Seo, Kensuke Nakamura, and Andrea Bajcsy. Uncertainty-aware latent safety filters for avoiding out-of-distribution failures. Conference on Robot Learning (CoRL), 2025

  35. [35]

    Learning to plan optimistically: Uncertainty-guided deep exploration via latent model ensembles

    Tim Seyde, Wilko Schwarting, Sertac Karaman, and Daniela Rus. Learning to plan optimistically: Uncertainty-guided deep exploration via latent model ensembles. Computing Research Repository (CoRR), 2020

  36. [36]

    Optimistic active exploration of dynamical systems

    Bhavya Sukhija, Lenart Treven, Cansu Sancaktar, Sebastian Blaes, Stelian Coros, and Andreas Krause. Optimistic active exploration of dynamical systems. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp.\ 38122--38153. Curran Associates, Inc., 2023

  37. [37]

    Richard S Sutton and Andrew G. Barto. Reinforcement learning: an introduction. 2018

  38. [38]

    Deepmind control suite, 2018

    Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, Timothy Lillicrap, and Martin Riedmiller. Deepmind control suite, 2018

  39. [39]

    Mujoco: A physics engine for model-based control

    Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp.\ 5026--5033. IEEE, 2012

  40. [40]

    Dynamic horizon value estimation for model-based reinforcement learning

    Junjie Wang, Qichao Zhang, Dongbin Zhao, Mengchen Zhao, and Jianye Hao. Dynamic horizon value estimation for model-based reinforcement learning. arXiv preprint arXiv:2009.09593, 2020

  41. [41]

    Information theoretic mpc for model-based reinforcement learning

    Grady Williams, Nolan Wagener, Brian Goldfain, Paul Drews, James M Rehg, Byron Boots, and Evangelos A Theodorou. Information theoretic mpc for model-based reinforcement learning. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp.\ 1714--1721. IEEE, 2017

  42. [42]

    Mopo: Model-based offline policy optimization

    Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Y Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. Mopo: Model-based offline policy optimization. Advances in Neural Information Processing Systems (NeurIPS), 33: 0 14129--14142, 2020

  43. [43]

    Dino-wm: World models on pre-trained visual features enable zero-shot planning

    Gaoyue Zhou, Hengkai Pan, Yann LeCun, and Lerrel Pinto. Dino-wm: World models on pre-trained visual features enable zero-shot planning. International Conference on Machine Learning (ICML), 2024

  44. [44]

    Bridging imagination and reality for model-based deep reinforcement learning

    Guangxiang Zhu, Minghao Zhang, Honglak Lee, and Chongjie Zhang. Bridging imagination and reality for model-based deep reinforcement learning. Advances in Neural Information Processing Systems (NeurIPS), 33: 0 8993--9006, 2020

  45. [45]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...