pith. sign in

arxiv: 2605.04709 · v1 · submitted 2026-05-06 · 💻 cs.LG · cs.RO· cs.SY· eess.SY

ELVIS: Ensemble-Calibrated Latent Imagination for Long-Horizon Visual MPC

Pith reviewed 2026-05-08 16:31 UTC · model grok-4.3

classification 💻 cs.LG cs.ROcs.SYeess.SY
keywords visual model predictive controllatent imaginationensemble criticsmodel-based reinforcement learninglong-horizon planningGaussian mixture MPPIrecurrent state space modeluncertainty-aware returns
0
0 comments X

The pith

ELVIS stabilizes long-horizon visual planning by maintaining multiple coherent hypotheses with Gaussian-mixture MPPI and calibrating returns via an ensemble of latent critics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ELVIS to solve the problem of brittle long rollouts in visual model-based reinforcement learning, where learned latent dynamics produce branching futures and compounding errors that grow quickly under visual occlusions. It replaces ordinary unimodal trajectory sampling with a Gaussian-mixture version of model predictive path integral control that keeps several distinct action sequences alive across many time steps instead of averaging them together. In parallel, an ensemble of critics supplies an uncertainty-aware score that dynamically adjusts the balance between bootstrapped values and full-horizon returns, using the same signal both to train the actor-critic and to rank candidate plans. The method is demonstrated on fourteen standard visual control benchmarks and on a physical robot task involving heavy occlusions. A sympathetic reader would care because reliable multi-step imagination is required before model-based agents can be trusted in real environments where observations are incomplete and horizons are long.

Core claim

ELVIS plans inside a Dreamer-style recurrent state space model by replacing standard unimodal MPPI with Gaussian-mixture MPPI that maintains multiple coherent hypotheses over long horizons and thereby avoids mode averaging under branching rollouts. In parallel it stabilizes deep imagination with a shared uncertainty-aware lambda-return: an ensemble of latent critics supplies an upper-confidence-bound score that gates a time-varying lambda, adaptively trading bootstrapping against look-ahead to limit compounding model error. The identical return is used both to train the actor-critic prior on imagined data and to score trajectories inside the mixture planner, aligning the reinforcement-learn-

What carries the argument

Gaussian-mixture MPPI guided by an ensemble-derived uncertainty-aware lambda-return inside a recurrent state space model

If this is right

  • Multiple coherent hypotheses are preserved across extended rollouts instead of collapsing to a single averaged mode.
  • The time-varying lambda trades off bootstrapping and look-ahead to contain compounding error from visual occlusions.
  • Training and planning objectives remain aligned because the same ensemble return serves both the actor-critic and the trajectory scorer.
  • State-of-the-art results appear on fourteen DeepMind Control Suite visual tasks relative to TD-MPC2 and DreamerV3.
  • Zero-shot transfer succeeds to a real-world sand-spraying task, improving surface-quality metrics under severe occlusions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same ensemble calibration could be applied to other latent planners that suffer from distribution shift over long horizons.
  • Explicit mixture components may provide interpretability by revealing which future modes the planner judges most valuable.
  • Uncertainty quantification in the critics may serve as a general regularizer for deep model predictive control beyond the visual case.
  • The real-robot transfer suggests the method can shrink the simulation-to-reality gap without task-specific fine-tuning.

Load-bearing premise

The ensemble upper-confidence-bound score on lambda-returns limits compounding model error during extended planning without introducing instabilities into the Gaussian-mixture sampler.

What would settle it

A controlled experiment on a new visual task with strong branching or occlusions in which the planner produces visibly inconsistent trajectory samples or falls below the performance of TD-MPC2 and DreamerV3 would falsify the claim that the combined machinery reliably contains error accumulation.

Figures

Figures reproduced from arXiv: 2605.04709 by Pinhao Song, Renaud Detry, Yurui Du, Yutong Hu.

Figure 2
Figure 2. Figure 2: GMM-based long-horizon MPPI for multimodal trajectory distributions. Over long horizons, sampled rollouts diverge and form multiple distinct high-reward modes. We fit a Gaussian Mixture Model (GMM) to the trajectory (or action￾sequence) samples to capture this multimodality, then perform MPPI-style weighting and control extraction per mode before aggregating into a single action. This reduces mode collapse… view at source ↗
Figure 1
Figure 1. Figure 1: RSSM world model learning under partial observ￾ability. An encoder infers stochastic latents zt from obser￾vations conditioned on a recurrent memory state ht, which is updated by a deterministic transition given actions. The learned prior predicts future latents, while decoders reconstruct observations and rewards, yielding a compact belief state sˆt = (ht, zt) for latent imagination and downstream plannin… view at source ↗
Figure 3
Figure 3. Figure 3: Imaginary TD learning with UCB-gated λ-returns. We train actor–critic priors from RSSM-imagined rollouts using an ensemble-UCB score to set a time-varying λt in the λ-return targets. High-UCB states induce smaller λt (greater bootstrapping), while low-UCB states induce larger λt (deeper look-ahead), yielding stable yet exploratory value and policy learning for MPPI warm-starting. b) Soft truncation with ti… view at source ↗
Figure 4
Figure 4. Figure 4: DMC visual control learning curves. Per-task learning curves on 14 DeepMind Control (DMC) visual control benchmarks, together with an aggregated score that reports the mean episodic return averaged across all 14 tasks at each environment step. Shaded regions denote 95% confidence intervals over 5 random seeds. ELVIS achieves the strongest overall performance, ranking first or second on every task. formance… view at source ↗
Figure 5
Figure 5. Figure 5: Ablations of ELVIS. Aggregated learning curves on the same 14 DMC visual control tasks, where the score is the mean episodic return averaged across tasks at each environment step. Shaded regions denote 95% confidence intervals over 5 random seeds. Removing either GMM, un￾certainty awareness or long-horizon planning degrades sample efficiency and final performance, indicating that all three components contr… view at source ↗
Figure 6
Figure 6. Figure 6: Sand-spraying testbed used for zero-shot sim-to-real view at source ↗
Figure 8
Figure 8. Figure 8: Zero-shot evaluation with real-world sand spray task. For each method, the first row shows grayscale scene images for visualization only, while the policy itself acts on the corresponding heightmaps shown in the second row. In the sim-to-real experiment, ELVIS is most robust to partial observability caused by dust and sensory noise (unobservable white parts of the heightmaps) and achieves the best surface … view at source ↗
read the original abstract

A central challenge of visual control with model-based reinforcement learning (RL) is reliable long-horizon planning: long rollouts with learned latent dynamics exhibit branching futures and multi-modal action-value distributions. In addition, compounding model errors amplified by visual occlusions make deep imagination brittle. We present ELVIS, a latent model predictive controller (MPC) designed to make long-horizon planning practical. ELVIS plans in a Dreamer-style recurrent state space model (RSSM) and replaces standard unimodal model predictive path integral (MPPI) with a Gaussian-mixture MPPI that maintains multiple coherent hypotheses over long horizons, avoiding mode averaging under branching rollouts. In parallel, ELVIS stabilizes deep imagination with a shared uncertainty-aware lambda-return: an ensemble of latent critics defines an upper-confidence-bound (UCB) score that gates a time-varying lambda, adaptively trading off bootstrapping versus look-ahead to limit compounding error during planning. The same return is used both to train an actor-critic prior from imagined rollouts and to score candidate trajectories inside GMM-MPPI, aligning RL objectives with the planner's long-horizon optimization. On fourteen DeepMind Control Suite visual tasks, ELVIS establishes state-of-the-art performance compared with TD-MPC2 and DreamerV3. Finally, ELVIS transfers zero-shot to a real-world sand-spraying task with severe occlusions, improving surface-quality metrics and demonstrating robustness beyond simulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces ELVIS, a latent MPC method for visual control that operates in a Dreamer-style RSSM. It replaces standard MPPI with a Gaussian-mixture MPPI to preserve multiple coherent hypotheses over long horizons and introduces a shared uncertainty-aware lambda-return computed from an ensemble of latent critics via UCB gating; this return is used both for actor-critic training on imagined trajectories and for scoring plans inside the GMM-MPPI. The paper reports state-of-the-art performance on fourteen DeepMind Control Suite visual tasks relative to TD-MPC2 and DreamerV3, together with zero-shot transfer to a real-world sand-spraying task under severe occlusions.

Significance. If the empirical results and the claimed mechanism hold, ELVIS would constitute a practical engineering advance for long-horizon visual planning by directly addressing branching futures and compounding model error. The alignment of the same return signal between RL training and MPC scoring is a clean design choice that could influence subsequent work on uncertainty-aware imagination.

major comments (3)
  1. [§3.2] §3.2 (Ensemble-calibrated lambda-return): The central claim that the UCB-gated, time-varying lambda sufficiently bounds compounding RSSM error rests on the assumption that the latent-critic ensemble variance tracks visual-occlusion-induced model error; the manuscript provides no diagnostic plots or correlation analysis between UCB scores and actual rollout prediction error, leaving the load-bearing link between ensemble calibration and long-horizon stability unverified.
  2. [§3.3] §3.3 (GMM-MPPI): The assertion that the Gaussian-mixture MPPI maintains coherent hypotheses over extended rollouts is load-bearing for the multi-modal planning claim, yet the paper reports no quantitative metric (e.g., mode-diversity entropy or hypothesis-separation distance) or ablation against unimodal MPPI on long-horizon tasks; without such evidence the improvement over standard MPPI remains unsubstantiated.
  3. [§5] §5 (Experiments): The SOTA and zero-shot transfer claims are presented without reported ablations on ensemble size, UCB coefficient, or lambda bounds, nor any analysis of lambda trajectories during planning; these omissions make it impossible to assess whether the proposed components are necessary or whether the performance gains could be obtained by simpler baselines.
minor comments (2)
  1. Notation for the time-varying lambda and the UCB score should be introduced with explicit equations rather than prose descriptions to improve reproducibility.
  2. The real-world sand-spraying experiment would benefit from a short description of the observation model and any domain-randomization steps used during training.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review, which highlights both the potential of ELVIS and areas where additional evidence would strengthen the manuscript. We address each major comment below and will incorporate the suggested revisions.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Ensemble-calibrated lambda-return): The central claim that the UCB-gated, time-varying lambda sufficiently bounds compounding RSSM error rests on the assumption that the latent-critic ensemble variance tracks visual-occlusion-induced model error; the manuscript provides no diagnostic plots or correlation analysis between UCB scores and actual rollout prediction error, leaving the load-bearing link between ensemble calibration and long-horizon stability unverified.

    Authors: We agree that direct diagnostic evidence would strengthen the mechanistic claim. In the revised manuscript we will add (i) scatter plots and correlation coefficients between per-timestep UCB scores and RSSM prediction error on held-out trajectories, and (ii) example lambda trajectories under varying occlusion levels. These additions will explicitly verify that ensemble variance tracks model error. The zero-shot real-world transfer under severe occlusions already provides supporting empirical evidence that the calibration improves long-horizon stability. revision: yes

  2. Referee: [§3.3] §3.3 (GMM-MPPI): The assertion that the Gaussian-mixture MPPI maintains coherent hypotheses over extended rollouts is load-bearing for the multi-modal planning claim, yet the paper reports no quantitative metric (e.g., mode-diversity entropy or hypothesis-separation distance) or ablation against unimodal MPPI on long-horizon tasks; without such evidence the improvement over standard MPPI remains unsubstantiated.

    Authors: We acknowledge that quantitative support for hypothesis coherence is currently missing. We will add an ablation of GMM-MPPI versus standard unimodal MPPI on the longest-horizon tasks, together with two new metrics: (1) entropy of the mixture-component weights as a measure of mode diversity, and (2) average pairwise distance between sampled trajectories from different components as a measure of hypothesis separation. These results will be reported in §5 to substantiate the multi-modal planning benefit. revision: yes

  3. Referee: [§5] §5 (Experiments): The SOTA and zero-shot transfer claims are presented without reported ablations on ensemble size, UCB coefficient, or lambda bounds, nor any analysis of lambda trajectories during planning; these omissions make it impossible to assess whether the proposed components are necessary or whether the performance gains could be obtained by simpler baselines.

    Authors: We agree that more extensive ablations are needed to isolate the contribution of each component. In the revision we will report performance for ensemble sizes {3,5,10}, a sweep of the UCB coefficient, and different lambda bounds. We will also include plots of lambda trajectories during planning to illustrate the adaptive trade-off. These ablations will clarify necessity relative to simpler baselines. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical engineering contribution with independent benchmark validation.

full rationale

The paper presents ELVIS as a practical combination of RSSM dynamics, ensemble latent critics for UCB-gated lambda-returns, and GMM-MPPI planning. No equations, derivations, or self-citations are shown that reduce the SOTA or zero-shot claims to fitted inputs, self-definitions, or prior author results by construction. Performance is established via direct comparison on 14 DeepMind Control Suite tasks and real-world transfer, with the central assumptions (ensemble calibration limiting compounding error, GMM coherence) left as empirical engineering choices rather than tautological. This is the normal case of a self-contained applied method.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, the central claim rests on the unstated assumption that the underlying Dreamer-style RSSM provides sufficiently accurate latent dynamics for the ensemble calibration to be effective; no explicit free parameters, axioms, or invented entities are enumerated.

pith-pipeline@v0.9.0 · 5573 in / 1235 out tokens · 46054 ms · 2026-05-08T16:31:29.440037+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages

  1. [1]

    Gavrila, and Barys Shy- rokau

    Alberto Bertipaglia, Dariu M. Gavrila, and Barys Shy- rokau. Multi-Modal Model Predictive Path Integral Control for Collision Avoidance, 2025

  2. [2]

    Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models

    Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models. InAdvances in Neural Information Processing Systems 31 (NeurIPS). Curran Associates, Inc., 2018

  3. [3]

    AREPO: Uncertainty-Aware Robot Ensemble Learning Under Extreme Partial Observability.IEEE Robotics and Automation Letters, 10(6), 2025

    Yurui Du, Louis Hanut, Herman Bruyninckx, and Renaud Detry. AREPO: Uncertainty-Aware Robot Ensemble Learning Under Extreme Partial Observability.IEEE Robotics and Automation Letters, 10(6), 2025

  4. [4]

    Trust the Model Where It Trusts Itself: Model-Based Actor-Critic with Uncertainty-Aware Rollout Adaption

    Bernd Frauenknecht, Artur Eisele, Devdutt Subhasish, Friedrich Solowjow, and Sebastian Trimpe. Trust the Model Where It Trusts Itself: Model-Based Actor-Critic with Uncertainty-Aware Rollout Adaption. InProceed- ings of the 41st International Conference on Machine Learning (ICML), volume 235 ofProceedings of Ma- chine Learning Research. PMLR, 2024

  5. [5]

    Learning Latent Dynamics for Planning from Pixels

    Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning Latent Dynamics for Planning from Pixels. In Proceedings of the 36th International Conference on Machine Learning (ICML), volume 97 ofProceedings of Machine Learning Research, pages 2555–2565. PMLR, 2019

  6. [6]

    Dream to Control: Learning Behaviors by Latent Imagination

    Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mo- hammad Norouzi. Dream to Control: Learning Behaviors by Latent Imagination. InInternational Conference on Learning Representations (ICLR), 2020

  7. [7]

    Mastering Diverse Control Tasks through World Models.Nature, 2025

    Danijar Hafner et al. Mastering Diverse Control Tasks through World Models.Nature, 2025. DreamerV3 journal version

  8. [8]

    TD- MPC2: Scalable, Robust World Models for Continuous Control

    Nicklas Hansen, Hao Su, and Xiaolong Wang. TD- MPC2: Scalable, Robust World Models for Continuous Control. InInternational Conference on Learning Rep- resentations (ICLR), 2024. Spotlight

  9. [9]

    Hansen, Hao Su, and Xiaolong Wang

    Nicklas A. Hansen, Hao Su, and Xiaolong Wang. Tempo- ral Difference Learning for Model Predictive Control. In Proceedings of the 39th International Conference on Ma- chine Learning (ICML), volume 162 ofProceedings of Machine Learning Research, pages 8387–8406. PMLR, 2022

  10. [10]

    Stein Variational Guided Model Predictive Path Integral Control: Proposal and Experiments with Fast Maneuvering Vehicles

    Kohei Honda, Naoki Akai, Kosuke Suzuki, Mizuho Aoki, Hirotaka Hosogaya, Hiroyuki Okuda, and Tatsuya Suzuki. Stein Variational Guided Model Predictive Path Integral Control: Proposal and Experiments with Fast Maneuvering Vehicles. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 7020–7026, 2024

  11. [11]

    Filter- aware model-predictive control

    Kayalibay, Baris and Mirchev, Atanas and Agha, Ahmed and van der Smagt, Patrick and Bayer, Justin. Filter- aware model-predictive control. InProceedings of The 5th Annual Learning for Dynamics and Control Confer- ence, volume 211 ofProceedings of Machine Learning Research, pages 1441–1454, 2023

  12. [12]

    Deep Variational Rein- forcement Learning for POMDPs

    Maximilian Igl, Luisa Zintgraf, Tuan Anh Le, Frank Wood, and Shimon Whiteson. Deep Variational Rein- forcement Learning for POMDPs. InProceedings of the 35th International Conference on Machine Learning (ICML), volume 80 ofProceedings of Machine Learning Research, pages 2117–2126. PMLR, 2018

  13. [13]

    When to Trust Your Model: Model-Based Policy Optimization

    Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to Trust Your Model: Model-Based Policy Optimization. InAdvances in Neural Information Processing Systems (NeurIPS), 2019

  14. [14]

    SUNRISE: A Simple Unified Framework for Ensemble Learning in Deep Reinforcement Learning

    Kimin Lee, Michael Laskin, Aravind Srinivas, and Pieter Abbeel. SUNRISE: A Simple Unified Framework for Ensemble Learning in Deep Reinforcement Learning. In Proceedings of the 38th International Conference on Ma- chine Learning (ICML), volume 139 ofProceedings of Machine Learning Research, pages 6131–6141. PMLR, 2021

  15. [15]

    Plan Online, Learn Offline: Efficient Learning and Exploration via Model-Based Control, 2018

    Kendall Lowrey, Aravind Rajeswaran, Sham Kakade, Emanuel Todorov, and Igor Mordatch. Plan Online, Learn Offline: Efficient Learning and Exploration via Model-Based Control, 2018

  16. [16]

    Human-level Control through Deep Reinforcement Learning.Nature, 518(7540):529–533, 2015

    V olodymyr Mnih, Koray Kavukcuoglu, David Silver, et al. Human-level Control through Deep Reinforcement Learning.Nature, 518(7540):529–533, 2015

  17. [17]

    Fearing, and Sergey Levine

    Anusha Nagabandi, Gregory Kahn, Ronald S. Fearing, and Sergey Levine. Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model- Free Fine-Tuning. In2018 IEEE International Confer- ence on Robotics and Automation (ICRA), pages 7559–

  18. [18]

    Variational Inference MPC for Bayesian Model-based Reinforcement Learning

    Masashi Okada and Tadahiro Taniguchi. Variational Inference MPC for Bayesian Model-based Reinforcement Learning. InProceedings of the 3rd Conference on Robot Learning (CoRL), volume 100 ofProceedings of Machine Learning Research, pages 258–272. PMLR, 2020

  19. [19]

    Plan- ning to Explore via Self-Supervised World Models

    Ramanan Sekar, Oleh Rybkin, Kostas Daniilidis, Pieter Abbeel, Danijar Hafner, and Deepak Pathak. Plan- ning to Explore via Self-Supervised World Models. In Proceedings of the 37th International Conference on Machine Learning (ICML), volume 119 ofProceedings of Machine Learning Research. PMLR, 2020

  20. [20]

    Uncertainty-aware Latent Safety Filters for Avoiding Out-of-Distribution Failures, 2025

    Junwon Seo, Kensuke Nakamura, and Andrea Bajcsy. Uncertainty-aware Latent Safety Filters for Avoiding Out-of-Distribution Failures, 2025

  21. [21]

    The distracting con- trol suite–a challenging benchmark for reinforcement learning from pixels.arXiv preprint arXiv:2101.02722, 2021

    Austin Stone, Oscar Ramirez, Kurt Konolige, and Rico Jonschkowski. The Distracting Control Suite – A Chal- lenging Benchmark for Reinforcement Learning from Pixels.arXiv preprint arXiv:2101.02722, 2021

  22. [22]

    Bootstrapped Model Predictive Con- trol

    Yuhang Wang, Hanwei Guo, Sizhe Wang, Long Qian, and Xuguang Lan. Bootstrapped Model Predictive Con- trol. InInternational Conference on Learning Represen- tations (ICLR), 2025

  23. [23]

    Rehg, and Evangelos A

    Grady Williams, Paul Drews, Brian Goldfain, James M. Rehg, and Evangelos A. Theodorou. Aggressive Driving with Model Predictive Path Integral Control. In2016 IEEE International Conference on Robotics and Automa- tion (ICRA), pages 1433–1440. IEEE, 2016

  24. [24]

    Multi-Modal MPPI and Active Inference for Reactive Task and Motion Planning.IEEE Robotics and Automation Letters, 9(9):7461–7468, 2024

    Yuezhe Zhang, Corrado Pezzato, Elia Trevisan, Chadi Salmi, Carlos Hernandez Corbato, and Javier Alonso- Mora. Multi-Modal MPPI and Active Inference for Reactive Task and Motion Planning.IEEE Robotics and Automation Letters, 9(9):7461–7468, 2024