Recognition: unknown
Mitigating Value Hallucination in Dyna Planning via Multistep Predecessor Models
read the original abstract
Dyna-style reinforcement learning (RL) agents improve sample efficiency over model-free RL agents by updating the value function with simulated experience generated by an environment model. However, it is often difficult to learn accurate models of environment dynamics, and even small errors may result in failure of Dyna agents. In this paper, we highlight that one potential cause of that failure is bootstrapping off of the values of simulated states, and introduce a new Dyna algorithm to avoid this failure. We discuss a design space of Dyna algorithms, based on using successor or predecessor models -- simulating forwards or backwards -- and using one-step or multi-step updates. Three of the variants have been explored, but surprisingly the fourth variant has not: using predecessor models with multi-step updates. We present the \emph{Hallucinated Value Hypothesis} (HVH): updating the values of real states towards values of simulated states can result in misleading action values which adversely affect the control policy. We discuss and evaluate all four variants of Dyna amongst which three update real states toward simulated states -- so potentially toward hallucinated values -- and our proposed approach, which does not. The experimental results provide evidence for the HVH, and suggest that using predecessor models with multi-step updates is a promising direction toward developing Dyna algorithms that are more robust to model error.
This paper has not been read by Pith yet.
Forward citations
Cited by 2 Pith papers
-
Advantage-Guided Diffusion for Model-Based Reinforcement Learning
Advantage-guided diffusion (SAG and EAG) steers sampling in diffusion world models to higher-advantage trajectories, enabling policy improvement and better sample efficiency on MuJoCo tasks.
-
Safety, Security, and Cognitive Risks in World Models
World models enable efficient AI planning but create risks from adversarial corruption, goal misgeneralization, and human bias, demonstrated via attacks that amplify errors and reduce rewards on models like RSSM and D...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.