pith. machine review for the scientific record. sign in

arxiv: 1912.01603 · v3 · submitted 2019-12-03 · 💻 cs.LG · cs.AI· cs.RO

Recognition: 2 theorem links

· Lean Theorem

Dream to Control: Learning Behaviors by Latent Imagination

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:11 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.RO
keywords reinforcement learningworld modelslatent spacevisual controlmodel-based planninggradient optimization
0
0 comments X

The pith

Dreamer learns behaviors for visual control tasks by propagating gradients through imagined trajectories in a learned latent world model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an agent called Dreamer that first learns a world model from high-dimensional image inputs. It then derives behaviors by optimizing policies using analytic gradients of state values backpropagated through trajectories imagined entirely within the compact latent space of that model. This approach allows solving long-horizon tasks without direct interaction during behavior learning. On 20 challenging visual control tasks, it shows improvements over prior methods in how quickly it learns, how much computation it uses, and how well it performs at the end.

Core claim

Dreamer is a reinforcement learning agent that solves long-horizon tasks from images purely by latent imagination. Behaviors are learned efficiently by propagating analytic gradients of learned state values back through trajectories imagined in the compact state space of a learned world model.

What carries the argument

Latent imagination, the process of generating and optimizing trajectories inside the learned world model's state space to derive control policies.

If this is right

  • Learning in latent space reduces the need for real environment interactions, improving data efficiency.
  • Gradient propagation through imagined rollouts enables faster optimization compared to sampling-based methods.
  • The method achieves higher final performance on visual control tasks.
  • Computation time is reduced because planning happens in a compact latent representation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • World models that support long-horizon accuracy could enable planning in even more complex domains like robotics with high-dimensional sensors.
  • If the latent space captures dynamics well, this could reduce the sample complexity of reinforcement learning in general.
  • Extending the imagination horizon might require better uncertainty handling in the world model to prevent error accumulation.

Load-bearing premise

The learned world model must stay accurate enough over long imagined horizons for the optimized policies to work when executed in the actual environment.

What would settle it

Testing whether policies learned via latent imagination perform as well as expected when the world model's prediction error is measured and increased artificially over the planning horizon.

read the original abstract

Learned world models summarize an agent's experience to facilitate learning complex behaviors. While learning world models from high-dimensional sensory inputs is becoming feasible through deep learning, there are many potential ways for deriving behaviors from them. We present Dreamer, a reinforcement learning agent that solves long-horizon tasks from images purely by latent imagination. We efficiently learn behaviors by propagating analytic gradients of learned state values back through trajectories imagined in the compact state space of a learned world model. On 20 challenging visual control tasks, Dreamer exceeds existing approaches in data-efficiency, computation time, and final performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces Dreamer, a model-based RL agent that learns a recurrent state-space model (RSSM) from high-dimensional image observations and derives policies by propagating analytic gradients of learned state values through imagined trajectories in the compact latent space, without requiring real-environment rollouts during planning. It reports that this latent imagination approach yields better data efficiency, lower computation time, and higher final performance than prior methods across 20 visual control tasks.

Significance. If the central performance claims hold, the work provides strong empirical evidence that gradient-based optimization over long-horizon latent trajectories can produce transferable behaviors, advancing sample-efficient model-based RL for visual domains. Credit is due for the breadth of evaluation (20 diverse tasks, multiple baselines, ablation studies) and for supplying implementation details that support reproducibility of the world model and imagination procedure.

major comments (1)
  1. [§4 and Appendix] §4 (Experiments) and Appendix: the central claim that analytic gradients through long-horizon imagined trajectories produce policies that transfer to the real environment rests on the RSSM remaining sufficiently accurate; however, no separate quantitative evaluation of multi-step prediction MSE or horizon-length sensitivity is reported on held-out real trajectories independent of task success. This leaves open whether gains derive primarily from short-horizon fidelity plus the actor-critic rather than reliable long-horizon latent imagination.
minor comments (2)
  1. [§3.1] §3.1: the RSSM transition and observation model equations would benefit from an explicit statement of the exact loss terms used for each component to improve clarity for readers implementing the method.
  2. [Figure 4] Figure 4: the caption should specify the precise imagination horizon length and number of gradient steps used for the reported curves to allow direct comparison with the ablation results.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and positive recommendation of minor revision. The feedback helps strengthen the presentation of the latent imagination approach. We address the single major comment below.

read point-by-point responses
  1. Referee: [§4 and Appendix] §4 (Experiments) and Appendix: the central claim that analytic gradients through long-horizon imagined trajectories produce policies that transfer to the real environment rests on the RSSM remaining sufficiently accurate; however, no separate quantitative evaluation of multi-step prediction MSE or horizon-length sensitivity is reported on held-out real trajectories independent of task success. This leaves open whether gains derive primarily from short-horizon fidelity plus the actor-critic rather than reliable long-horizon latent imagination.

    Authors: We appreciate the referee's emphasis on isolating the contribution of long-horizon model accuracy. The empirical results across 20 tasks show Dreamer outperforming both model-free agents and prior model-based methods that lack comparable long-horizon latent planning; such gains would be difficult to achieve if the RSSM were limited to short-horizon fidelity. That said, we agree that explicit quantitative metrics would provide additional clarity. In the revised manuscript we will add multi-step prediction MSE evaluated on held-out real trajectories (independent of the RL objective) in the appendix, together with an expanded analysis of performance as a function of imagination horizon length. These additions will be presented separately from task success to directly address the concern. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation separates model learning from policy optimization via independent empirical validation.

full rationale

The paper's core chain learns an RSSM world model from real experience via variational inference, then optimizes actor-critic parameters by back-propagating value gradients through finite-horizon imagined latent trajectories. Neither the model parameters nor the policy objective reduce to a fitted input by construction; the imagined trajectories are generated from the learned dynamics and the final performance is measured on held-out real-environment rollouts across 20 tasks. Self-citations to prior RSSM work supply the model architecture but do not bear the load of the behavior-learning claim, which is tested externally rather than being tautological. No self-definitional equations, fitted-input predictions, or uniqueness theorems imported from overlapping authors appear in the derivation.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard MDP assumptions and the ability of the RSSM to learn useful dynamics; several architecture and optimization hyperparameters are tuned but are not load-bearing for the conceptual contribution.

free parameters (2)
  • imagination horizon length
    Chosen to trade off planning depth against computation; affects how far gradients are propagated.
  • RSSM and actor-critic network sizes and learning rates
    Standard deep RL hyperparameters tuned on validation tasks.
axioms (2)
  • domain assumption The environment dynamics can be captured by a latent state-space model that generalizes to imagined trajectories.
    Invoked throughout the world model training and imagination procedure.
  • domain assumption Gradients through the imagined model provide a useful learning signal for the policy.
    Core justification for backpropagating through latent rollouts instead of model-free updates.

pith-pipeline@v0.9.0 · 5394 in / 1376 out tokens · 44745 ms · 2026-05-12T01:11:22.310464+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 44 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. JEDI: Joint Embedding Diffusion World Model for Online Model-Based Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    JEDI is the first online end-to-end latent diffusion world model that trains latents from denoising loss rather than reconstruction, achieving competitive Atari100k results with 43% less VRAM and over 3x faster sampli...

  2. One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

    cs.CV 2026-05 conditional novelty 7.0

    Reducing visual input to one token per frame in VLA world models maintains or improves long-horizon performance on MetaWorld, LIBERO, and real-robot tasks.

  3. Operator-Guided Invariance Learning for Continuous Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    VPSD-RL discovers exact and approximate value-preserving Lie-group operators in continuous RL to stabilize learning via transition augmentation and consistency regularization.

  4. Latent State Design for World Models under Sufficiency Constraints

    cs.AI 2026-05 unverdicted novelty 7.0

    World models succeed when their latent states are built to meet task-specific sufficiency constraints rather than preserving the maximum amount of information.

  5. RopeDreamer: A Kinematic Recurrent State Space Model for Dynamics of Flexible Deformable Linear Objects

    cs.RO 2026-04 unverdicted novelty 7.0

    RopeDreamer uses quaternionic kinematic chains in a recurrent state space model with a dual decoder to cut open-loop prediction error by 40.52% over 50 steps on simulated DLO trajectories while preserving physical con...

  6. Mask World Model: Predicting What Matters for Robust Robot Policy Learning

    cs.RO 2026-04 unverdicted novelty 7.0

    Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization...

  7. Beyond Static Forecasting: Unleashing the Power of World Models for Mobile Traffic Extrapolation

    cs.NI 2026-04 unverdicted novelty 7.0

    MobiWM is a multimodal world model for mobile networks that learns state-action dynamics to enable unlimited-horizon counterfactual traffic simulations and optimization.

  8. MoRight: Motion Control Done Right

    cs.CV 2026-04 unverdicted novelty 7.0

    MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply ...

  9. Training Agents Inside of Scalable World Models

    cs.AI 2025-09 conditional novelty 7.0

    Dreamer 4 is the first agent to obtain diamonds in Minecraft from only offline data by reinforcement learning inside a scalable world model that accurately predicts game mechanics.

  10. Mastering Diverse Domains through World Models

    cs.AI 2023-01 unverdicted novelty 7.0

    DreamerV3 uses world models and robustness techniques to solve over 150 tasks across domains with a single configuration, including Minecraft diamond collection from scratch.

  11. Mastering Atari with Discrete World Models

    cs.LG 2020-10 accept novelty 7.0

    DreamerV2 reaches human-level performance on 55 Atari games by learning behaviors inside a separately trained discrete-latent world model.

  12. Zero-Shot Sim-to-Real Robot Learning: A Dexterous Manipulation Study on Reactive Catching

    cs.RO 2026-05 unverdicted novelty 6.0

    DRIS improves zero-shot sim-to-real transfer for reactive catching by maintaining and acting on sets of randomized dynamics instances instead of single instances per episode.

  13. One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

    cs.CV 2026-05 unverdicted novelty 6.0

    Reducing visual input to one token per frame in world models for vision-language-action policies maintains long-horizon performance while improving success rates on MetaWorld, LIBERO, and real-robot tasks.

  14. One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

    cs.CV 2026-05 unverdicted novelty 6.0

    Reducing visual input to one token per frame via adaptive attention pooling and a unified flow-matching objective improves long-horizon performance in VLA policies on MetaWorld, LIBERO, and real-robot tasks.

  15. LaWM: Least Action World Models for Long-Horizon Physical Consistency from Visual Observations

    cs.LG 2026-05 unverdicted novelty 6.0

    LaWM induces latent transitions from a learned discrete variational principle rather than an unconstrained neural predictor, yielding improved physical consistency on synthetic dynamics and robot benchmarks.

  16. Predictive but Not Plannable: RC-aux for Latent World Models

    cs.LG 2026-05 unverdicted novelty 6.0

    RC-aux corrects spatiotemporal mismatch in reconstruction-free latent world models by adding multi-horizon prediction and reachability supervision, improving planning performance on goal-conditioned pixel-control tasks.

  17. Learning to Theorize the World from Observation

    cs.LG 2026-05 unverdicted novelty 6.0

    NEO induces compositional latent programs as world theories from observations and executes them to enable explanation-driven generalization.

  18. TRAP: Tail-aware Ranking Attack for World-Model Planning

    cs.LG 2026-05 unverdicted novelty 6.0

    TRAP is a tail-aware ranking attack that plants a backdoor in world models so that a trigger causes the model to reorder a few critical imagined trajectories and redirect planning while preserving normal behavior on c...

  19. Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.

  20. Biased Dreams: Limitations to Epistemic Uncertainty Quantification in Latent Space Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Latent transitions in models like Dreamer are biased toward dense regions, creating attractors that hide true dynamics discrepancies and cause epistemic uncertainty to be unreliable while overestimating rewards.

  21. Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

    cs.CV 2026-04 unverdicted novelty 6.0

    OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.

  22. Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

    cs.CV 2026-04 unverdicted novelty 6.0

    OneVL is the first latent CoT method to exceed explicit CoT accuracy on four driving benchmarks while running at answer-only speed, by supervising latent tokens with a visual world model decoder.

  23. Learning Ad Hoc Network Dynamics via Graph-Structured World Models

    cs.LG 2026-04 unverdicted novelty 6.0

    G-RSSM learns per-node dynamics in wireless ad hoc networks via graph attention and trains clustering policies through imagined rollouts, generalizing from N=50 training to larger networks.

  24. Structured State-Space Regularization for Compact and Generation-Friendly Image Tokenization

    cs.CV 2026-04 unverdicted novelty 6.0

    A new regularizer transfers frequency awareness from state-space models into image tokenizers, yielding more compact latents that improve diffusion-model generation quality with little reconstruction penalty.

  25. Simple but Stable, Fast and Safe: Achieve End-to-end Control by High-Fidelity Differentiable Simulation

    cs.RO 2026-04 conditional novelty 6.0

    An end-to-end RL policy trained via high-fidelity differentiable simulation maps depth images straight to bodyrate commands, achieving top success rates, low jerk, and zero-shot real-world generalization up to 7.5 m/s...

  26. Zero-shot World Models Are Developmentally Efficient Learners

    cs.AI 2026-04 unverdicted novelty 6.0

    A zero-shot visual world model trained on one child's experience achieves broad competence on physical understanding benchmarks while matching developmental behavioral patterns.

  27. Behavior-Constrained Reinforcement Learning with Receding-Horizon Credit Assignment for High-Performance Control

    cs.RO 2026-04 unverdicted novelty 6.0

    A behavior-constrained RL framework with receding-horizon credit assignment learns high-performance control policies that stay aligned with expert behavior in race car simulation.

  28. Safety, Security, and Cognitive Risks in World Models

    cs.CR 2026-04 unverdicted novelty 6.0

    World models enable efficient AI planning but create risks from adversarial corruption, goal misgeneralization, and human bias, demonstrated via attacks that amplify errors and reduce rewards on models like RSSM and D...

  29. Dreamer-CDP: Improving Reconstruction-free World Models Via Continuous Deterministic Representation Prediction

    cs.LG 2026-03 unverdicted novelty 6.0

    Dreamer-CDP achieves reconstruction-free world modeling via a JEPA-style predictor on continuous deterministic representations and matches Dreamer's performance on Crafter.

  30. World Action Models are Zero-shot Policies

    cs.RO 2026-02 unverdicted novelty 6.0

    DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...

  31. Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

    cs.AI 2026-01 conditional novelty 6.0

    Single-stage fine-tuning of a video model to generate actions as latent frames plus future states and values yields state-of-the-art robot policy performance on LIBERO, RoboCasa, and bimanual tasks.

  32. V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    cs.AI 2025-06 unverdicted novelty 6.0

    V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 h...

  33. R3M: A Universal Visual Representation for Robot Manipulation

    cs.RO 2022-03 unverdicted novelty 6.0

    A visual encoder pre-trained on diverse human videos with contrastive and language objectives improves simulated robot manipulation success by over 20% versus training from scratch and enables real Franka arm tasks fr...

  34. Transferable Delay-Aware Reinforcement Learning via Implicit Causal Graph Modeling

    cs.LG 2026-05 unverdicted novelty 5.0

    A delay-aware RL approach learns transferable structured representations and dynamics via implicit causal graphs, outperforming baselines on delayed DMC tasks and accelerating adaptation to new tasks.

  35. Nautilus: From One Prompt to Plug-and-Play Robot Learning

    cs.RO 2026-05 unverdicted novelty 5.0

    NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.

  36. Neural Control: Adjoint Learning Through Equilibrium Constraints

    cs.RO 2026-05 unverdicted novelty 5.0

    Neural Control introduces adjoint-based differentiation through implicit equilibrium constraints to enable memory-efficient gradient computation and robust receding-horizon MPC for multi-stable deformable object manip...

  37. World-Value-Action Model: Implicit Planning for Vision-Language-Action Systems

    cs.RO 2026-04 unverdicted novelty 5.0

    The World-Value-Action model enables implicit planning for VLA systems by performing inference over a learned latent representation of high-value future trajectories instead of direct action prediction.

  38. CausalVAE as a Plug-in for World Models: Towards Reliable Counterfactual Dynamics

    cs.LG 2026-04 unverdicted novelty 5.0

    CausalVAE plug-in for world models preserves factual prediction and boosts counterfactual retrieval, with large gains on physics benchmarks and recovered physical interaction trends.

  39. Neural Computers

    cs.LG 2026-04 unverdicted novelty 5.0

    Neural Computers are introduced as a new machine form where computation, memory, and I/O are unified in a learned runtime state, with initial video-model experiments showing acquisition of basic interface primitives f...

  40. World Action Models: The Next Frontier in Embodied AI

    cs.RO 2026-05 unverdicted novelty 4.0

    The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

  41. OpenWorldLib: A Unified Codebase and Definition of Advanced World Models

    cs.CV 2026-04 unverdicted novelty 4.0

    OpenWorldLib offers a standardized codebase and definition for world models that combine perception, interaction, and memory to understand and predict the world.

  42. World Simulation with Video Foundation Models for Physical AI

    cs.CV 2025-10 unverdicted novelty 4.0

    Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.

  43. Cosmos World Foundation Model Platform for Physical AI

    cs.CV 2025-01 unverdicted novelty 3.0

    The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.

  44. Redefining End-of-Life: Intelligent Automation for Electronics Remanufacturing Systems

    eess.SY 2026-04 unverdicted novelty 2.0

    A literature review of intelligent automation approaches using robotics, AI, and control for disassembly, inspection, sorting, and reprocessing of end-of-life electronics.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 41 Pith papers · 10 internal anchors

  1. [1]

    A. A. Alemi, I. Fischer, J. V . Dillon, and K. Murphy. Deep variational information bottleneck.arXiv preprint arXiv:1612.00410,

  2. [2]

    Banijamali, R

    E. Banijamali, R. Shu, M. Ghavamzadeh, H. Bui, and A. Ghodsi. Robust locally-linear controllable embedding. arXiv preprint arXiv:1710.05373,

  3. [3]

    Distributed Distributional Deterministic Policy Gradients

    G. Barth-Maron, M. W. Hoffman, D. Budden, W. Dabney, D. Horgan, A. Muldal, N. Heess, and T. Lil- licrap. Distributed distributional deterministic policy gradients. arXiv preprint arXiv:1804.08617,

  4. [4]

    DeepMind Lab

    C. Beattie, J. Z. Leibo, D. Teplyashin, T. Ward, M. Wainwright, H. Küttler, A. Lefrancq, S. Green, V . Valdés, A. Sadik, et al. Deepmind lab.arXiv preprint arXiv:1612.03801,

  5. [5]

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

    Y . Bengio, N. Léonard, and A. Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432,

  6. [6]

    Learning and Querying Fast Generative Models for Reinforcement Learning

    L. Buesing, T. Weber, S. Racaniere, S. Eslami, D. Rezende, D. P. Reichert, F. Viola, F. Besse, K. Gregor, D. Hassabis, et al. Learning and querying fast generative models for reinforcement learning. arXiv preprint arXiv:1802.03006,

  7. [7]

    Imagined Value Gradients: Model-Based Policy Optimization With Transferable Latent Dynamics Models

    A. Byravan, J. T. Springenberg, A. Abdolmaleki, R. Hafner, M. Neunert, T. Lampe, N. Siegel, N. Heess, and M. Riedmiller. Imagined value gradients: Model-based policy optimization with transferable latent dynamics models. arXiv preprint arXiv:1910.04142,

  8. [8]

    P. S. Castro, S. Moitra, C. Gelada, S. Kumar, and M. G. Bellemare. Dopamine: A research framework for deep reinforcement learning. arXiv preprint arXiv:1812.06110,

  9. [9]

    Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)

    D.-A. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289,

  10. [10]

    J. V . Dillon, I. Langmore, D. Tran, E. Brevdo, S. Vasudevan, D. Moore, B. Patton, A. Alemi, M. Hoffman, and R. A. Saurous. Tensorflow distributions. arXiv preprint arXiv:1711.10604,

  11. [11]

    Probabilistic Recurrent State-Space Models

    A. Doerr, C. Daniel, M. Schiegg, D. Nguyen-Tuong, S. Schaal, M. Toussaint, and S. Trimpe. Probabilistic recurrent state-space models. arXiv preprint arXiv:1801.10395,

  12. [12]

    Self-Supervised Visual Planning with Temporal Skip Connections

    F. Ebert, C. Finn, A. X. Lee, and S. Levine. Self-supervised visual planning with temporal skip connections. arXiv preprint arXiv:1710.05268,

  13. [13]

    Espeholt, H

    L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V . Mnih, T. Ward, Y . Doron, V . Firoiu, T. Harley, I. Dunning, et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. arXiv preprint arXiv:1802.01561,

  14. [14]

    Model-based value estimation for efficient model-free reinforcement learning.arXiv preprint arXiv:1803.00101, 2018

    V . Feinberg, A. Wan, I. Stoica, M. I. Jordan, J. E. Gonzalez, and S. Levine. Model-based value estimation for efficient model-free reinforcement learning.arXiv preprint arXiv:1803.00101,

  15. [15]

    DeepMDP: Learning Continuous Latent Space Models for Representation Learning

    10 Published as a conference paper at ICLR 2020 C. Gelada, S. Kumar, J. Buckman, O. Nachum, and M. G. Bellemare. Deepmdp: Learning continuous latent space models for representation learning. arXiv preprint arXiv:1906.02736,

  16. [16]

    Gregor, D

    K. Gregor, D. J. Rezende, F. Besse, Y . Wu, H. Merzic, and A. v. d. Oord. Shaping belief states with generative environment models for rl. arXiv preprint arXiv:1906.09237,

  17. [17]

    Z. D. Guo, M. G. Azar, B. Piot, B. A. Pires, T. Pohlen, and R. Munos. Neural predictive belief representations. arXiv preprint arXiv:1811.06407,

  18. [18]

    World Models

    D. Ha and J. Schmidhuber. World models. arXiv preprint arXiv:1803.10122,

  19. [19]

    Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

    T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290,

  20. [20]

    Learning Latent Dynamics for Planning from Pixels

    D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson. Learning latent dynamics for planning from pixels. arXiv preprint arXiv:1811.04551,

  21. [22]

    Model-Based Planning with Discrete and Continuous Actions

    M. Henaff, W. F. Whitney, and Y . LeCun. Model-based planning with discrete and continuous actions. arXiv preprint arXiv:1705.07177,

  22. [23]

    Henaff, A

    M. Henaff, A. Canziani, and Y . LeCun. Model-predictive policy learning with uncertainty regulariza- tion for driving in dense traffic.arXiv preprint arXiv:1901.02705,

  23. [24]

    Reinforcement learning with unsupervised auxiliary tasks,

    M. Jaderberg, V . Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and K. Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397,

  24. [25]

    Model-Based Reinforcement Learning for Atari

    L. Kaiser, M. Babaeizadeh, P. Milos, B. Osinski, R. H. Campbell, K. Czechowski, D. Erhan, C. Finn, P. Kozakowski, S. Levine, et al. Model-based reinforcement learning for atari. arXiv preprint arXiv:1903.00374,

  25. [26]

    M. Karl, M. Soelch, J. Bayer, and P. van der Smagt. Deep variational bayes filters: Unsupervised learning of state space models from raw data. arXiv preprint arXiv:1605.06432,

  26. [27]

    D. P. Kingma and J. Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

  27. [28]

    D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,

  28. [29]

    R. G. Krishnan, U. Shalit, and D. Sontag. Deep kalman filters. arXiv preprint arXiv:1511.05121,

  29. [30]

    Model-Ensemble Trust-Region Policy Optimization

    T. Kurutach, I. Clavera, Y . Duan, A. Tamar, and P. Abbeel. Model-ensemble trust-region policy optimization. arXiv preprint arXiv:1802.10592,

  30. [31]

    LeCun, B

    11 Published as a conference paper at ICLR 2020 Y . LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541–551,

  31. [32]

    A. X. Lee, A. Nagabandi, P. Abbeel, and S. Levine. Stochastic latent actor-critic: Deep reinforcement learning with a latent variable model. arXiv preprint arXiv:1907.00953,

  32. [33]

    T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y . Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971,

  33. [34]

    Lowrey, A

    K. Lowrey, A. Rajeswaran, S. Kakade, E. Todorov, and I. Mordatch. Plan online, learn offline: Efficient learning and exploration via model-based control. arXiv preprint arXiv:1811.01848,

  34. [35]

    McAllester and K

    D. McAllester and K. Statos. Formal limitations on the measurement of mutual information. arXiv preprint arXiv:1811.04251,

  35. [36]

    V . Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928–1937,

  36. [37]

    A. v. d. Oord, Y . Li, and O. Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748,

  37. [38]

    PIPPS: Flexible Model-Based Policy Search Robust to the Curse of Chaos

    P. Parmas, C. E. Rasmussen, J. Peters, and K. Doya. Pipps: Flexible model-based policy search robust to the curse of chaos. arXiv preprint arXiv:1902.01240,

  38. [39]

    Piergiovanni, A

    A. Piergiovanni, A. Wu, and M. S. Ryoo. Learning real-world robot policies by dreaming. arXiv preprint arXiv:1805.07813,

  39. [40]

    arXiv preprint arXiv:1905.06922 , year=

    B. Poole, S. Ozair, A. v. d. Oord, A. A. Alemi, and G. Tucker. On variational bounds of mutual information. arXiv preprint arXiv:1905.06922,

  40. [41]

    D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082,

  41. [42]

    arXiv preprint arXiv:1911.08265 , year =

    J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre, S. Schmitt, A. Guez, E. Lockhart, D. Hassabis, T. Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model. arXiv preprint arXiv:1911.08265,

  42. [43]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

  43. [44]

    Universal Planning Networks

    12 Published as a conference paper at ICLR 2020 A. Srinivas, A. Jabri, P. Abbeel, S. Levine, and C. Finn. Universal planning networks. arXiv preprint arXiv:1804.00645,

  44. [45]

    DeepMind Control Suite

    Y . Tassa, Y . Doron, A. Muldal, T. Erez, Y . Li, D. d. L. Casas, D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancq, et al. Deepmind control suite. arXiv preprint arXiv:1801.00690,

  45. [46]

    The information bottleneck method

    N. Tishby, F. C. Pereira, and W. Bialek. The information bottleneck method. arXiv preprint physics/0004057,

  46. [47]

    Exploring Model-based Planning with Policy Networks

    T. Wang and J. Ba. Exploring model-based planning with policy networks. arXiv preprint arXiv:1906.08649,

  47. [48]

    T. Wang, X. Bao, I. Clavera, J. Hoang, Y . Wen, E. Langlois, S. Zhang, G. Zhang, P. Abbeel, and J. Ba. Benchmarking model-based reinforcement learning. CoRR, abs/1907.02057,

  48. [49]

    Imagination-Augmented Agents for Deep Reinforcement Learning

    T. Weber, S. Racanière, D. P. Reichert, L. Buesing, A. Guez, D. J. Rezende, A. P. Badia, O. Vinyals, N. Heess, Y . Li, et al. Imagination-augmented agents for deep reinforcement learning. arXiv preprint arXiv:1707.06203,

  49. [50]

    (2018), and implement all other functions as three dense layers of size 300 with ELU activations (Clevert et al., 2015)

    13 Published as a conference paper at ICLR 2020 A H YPER PARAMETERS Model components We use the convolutional encoder and decoder networks from Ha and Schmid- huber (2018), the RSSM of Hafner et al. (2018), and implement all other functions as three dense layers of size 300 with ELU activations (Clevert et al., 2015). Distributions in latent space are 30-...

  50. [51]

    The imagination horizon is H = 15 and the same trajectories are used to update both action and value models

    but clip them below 3 free nats as in PlaNet. The imagination horizon is H = 15 and the same trajectories are used to update both action and value models. We compute the Vλ targets with γ = 0.99 and λ = 0.95. We did not find latent overshooting for learning the model, an entropy bonus for the action model, or target networks for the value model necessary. ...

  51. [52]

    for latent dynamics models, max I(s1:T ; (o1:T , r1:T ) | a1:T ) − β I(s1:T , i1:T | a1:T ), (13) where β is scalar and it are dataset indices that determine the observations p(ot | it) .= δ(ot − ¯ot) as in Alemi et al. (2016). Maximizing the objective leads to model states that can predict the sequence of observations and rewards while limiting the amoun...

  52. [53]

    and DeepMind Lab (Beattie et al., 2016). While agents that purely learn through world models are not yet competitive in these domains (Kaiser et al., 2019), the tasks offer a diverse test bed with visual complexity, sparse rewards, and early termination. Agents observe 64 × 64 × 3 images and select one of between 3 and 18 actions. For Atari, we follow the...