Valdi: Value Diffusion World Models

Christopher Lindenberg; Kashyap Chitta

arxiv: 2607.00917 · v1 · pith:FSPAKF62new · submitted 2026-07-01 · 💻 cs.LG · cs.AI

Valdi: Value Diffusion World Models

Christopher Lindenberg , Kashyap Chitta This is my paper

Pith reviewed 2026-07-02 15:44 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords world modelsdiffusion modelsmodel predictive controllatent dynamicsreinforcement learningCarRacing

0 comments

The pith

Valdi trains a latent diffusion world model end-to-end so one diffusion step suffices for model-predictive control.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Valdi as a way to make diffusion-based world models usable for online model predictive control. It trains the latent diffusion dynamics model end-to-end so that a single diffusion step at both training and inference produces predictions fast enough for control loops. In CarRacing experiments this single-step version reaches the same performance as a deterministic MLP baseline. The setup also reveals a direct trade-off between how multimodal the predictions are and how well the resulting controller performs.

Core claim

Valdi combines end-to-end online training for MPC with a latent diffusion dynamics model. In preliminary experiments on the CarRacing environment, Valdi using a single diffusion step at both training and inference matches a deterministic MLP baseline, while exposing a trade-off between predictive multimodality and control performance.

What carries the argument

latent diffusion dynamics model trained end-to-end for model predictive control

If this is right

A single diffusion step at inference is sufficient to reach deterministic-level control performance in this environment.
End-to-end training removes the need for separate pre-training of the diffusion model before it can be used in MPC.
Increasing the number of diffusion steps to capture more multimodality can degrade closed-loop control performance.
The same architecture supports both fast deterministic-like planning and uncertain dynamics modeling depending on step count.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The single-step regime may extend to other continuous-control benchmarks where latency constraints dominate.
The observed multimodality-control trade-off suggests tuning the diffusion schedule specifically for value estimation rather than full trajectory distribution matching.
If the latent diffusion step count can be made adaptive at runtime, the same model could switch between fast and high-fidelity regimes without retraining.

Load-bearing premise

A single diffusion step in the latent space is expressive enough to support effective model-predictive control when the model is trained end-to-end.

What would settle it

An experiment in CarRacing where the single-step Valdi controller achieves lower cumulative reward than the deterministic MLP baseline under identical training and evaluation conditions.

Figures

Figures reproduced from arXiv: 2607.00917 by Christopher Lindenberg, Kashyap Chitta.

**Figure 1.** Figure 1: Valdi is a diffusion model that predicts action-sequence conditioned rewards and values, enabling Model Predictive Control. We show its predictions for a good and bad action sequence in the CarRacing environment (Towers et al., 2024). In parallel, diffusion models (Ho et al., 2020) have emerged as a strong candidate for world modeling, accurately capturing complex distributions over long horizons (Agarwal… view at source ↗

**Figure 2.** Figure 2: Method. Valdi encodes states, conditions diffusion dynamics on an action sequence and noised latents from a target encoder, and uses the denoised latent trajectory for reward and value prediction. where z, z τ , and zˆ denote clean, noisy, and denoised latents, and a bar marks quantities encoded by an Exponential Moving Average (EMA) target encoder Eθ¯. We then apply a TD-MPC-style reward error Lrew and … view at source ↗

**Figure 3.** Figure 3: Performance and multimodality. Across two runs, more inference diffusion steps do not improve control over our singlestep default, but substantially increase the visual variety (LPIPS) among generated futures. Performance and Multimodality. We compare Valdi against a baseline that swaps the diffusion dynamics for a deterministic one-step MLP, similar to TD-MPC, with all other parameters identical. Each… view at source ↗

**Figure 4.** Figure 4: The returns that our system and our baseline obtain during training time (left). The [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Diverse trajectories generated from the same starting state with [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

read the original abstract

World models can enable Model Predictive Control (MPC), but this requires dynamics prediction that is both fast enough for online use and expressive enough to represent uncertain futures. Diffusion models offer a natural mechanism for modeling uncertain dynamics, yet their iterative inference procedure makes them difficult to use for low-latency latent planning. We bridge this gap with Value Diffusion World Models (Valdi), combining end-to-end online training for MPC with a latent diffusion dynamics model. In preliminary experiments on the CarRacing environment, we show that Valdi, using a single diffusion step at both training and inference, matches a deterministic MLP baseline. Our experiments expose a trade-off between predictive multimodality and control performance in this setup. Code is available at https://github.com/Kit115/ValueDiffusionWorldModels.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Valdi's single-step diffusion matches an MLP baseline in CarRacing, which is unsurprising and leaves the uncertainty-modeling claim untested.

read the letter

The main takeaway is that Valdi trains a latent diffusion dynamics model end-to-end with value-based MPC and reports that a single diffusion step at train and test time reaches the same CarRacing performance as a plain deterministic MLP. That result is new in its specific combination of components, and the paper is honest about exposing a multimodality-versus-control trade-off.

The work does a few things cleanly. It releases code, frames the latency problem for diffusion-based world models in MPC, and shows that end-to-end training is feasible. The abstract positions the method as a practical bridge between expressive uncertain dynamics and online planning.

The soft spot is that the reported match to the MLP baseline undercuts the diffusion premise. A single diffusion step is essentially a conditional mean predictor under standard schedules, so parity with a deterministic network is expected rather than informative. The abstract gives no training curves, variance numbers, or comparison to multi-step diffusion, so we cannot tell whether the model ever uses its capacity for multimodality or whether the trade-off is quantified in a reproducible way. Without those details the central empirical claim stays preliminary.

This paper is for people already working on diffusion world models or low-latency MPC. A reader who wants to see whether diffusion can actually deliver usable uncertainty inside control loops will find the current evidence too thin. It is worth sending to peer review so the full methods, ablations, and statistical support can be checked; the idea is coherent enough to merit that step even if the present experiments need strengthening.

Referee Report

2 major / 1 minor

Summary. The paper introduces Value Diffusion World Models (Valdi), which combine end-to-end online training for MPC with a latent diffusion dynamics model to achieve both low-latency inference and expressive uncertain dynamics. The central claim, based on preliminary experiments in the CarRacing environment, is that Valdi using a single diffusion step at both training and inference matches a deterministic MLP baseline while exposing a trade-off between predictive multimodality and control performance.

Significance. If the result holds under more detailed validation, it would indicate that diffusion-based world models can be made compatible with real-time MPC via single-step inference. However, the reported parity with a deterministic baseline suggests the diffusion mechanism may not be contributing the intended multimodality or uncertainty modeling, which would limit the significance unless the method is shown to provide advantages in settings where uncertainty matters for control.

major comments (2)

[Abstract] Abstract: the claim that Valdi 'bridges this gap' with diffusion is undercut by the single-step result, as a one-step diffusion process at inference is equivalent to a standard conditional predictor (or mean regressor under common noise schedules) and does not demonstrate the iterative denoising or multimodality that distinguishes diffusion models.
[Experiments] Experiments (preliminary CarRacing results): no details are provided on training curves, statistical significance, exact architecture, how the multimodality-control trade-off was quantified, or whether the single-step model was compared against a multi-step diffusion variant, so the support for the performance match cannot be assessed and the weakest assumption (that single-step latent diffusion suffices for effective MPC) remains untested.

minor comments (1)

[Abstract] The public code link is a strength for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive feedback on this preliminary work. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that Valdi 'bridges this gap' with diffusion is undercut by the single-step result, as a one-step diffusion process at inference is equivalent to a standard conditional predictor (or mean regressor under common noise schedules) and does not demonstrate the iterative denoising or multimodality that distinguishes diffusion models.

Authors: We agree that single-step inference does not leverage the iterative denoising process that defines diffusion models and is functionally closer to a conditional predictor. The manuscript's contribution centers on an end-to-end MPC training framework that incorporates a latent diffusion dynamics model while meeting real-time constraints via single-step sampling. The reported result is the observed trade-off between predictive multimodality and control performance under this constraint. We will revise the abstract to remove the phrasing 'bridges this gap' and instead emphasize the single-step feasibility result together with the multimodality-control trade-off. revision: yes
Referee: [Experiments] Experiments (preliminary CarRacing results): no details are provided on training curves, statistical significance, exact architecture, how the multimodality-control trade-off was quantified, or whether the single-step model was compared against a multi-step diffusion variant, so the support for the performance match cannot be assessed and the weakest assumption (that single-step latent diffusion suffices for effective MPC) remains untested.

Authors: The current manuscript presents only high-level preliminary findings. We will expand the experiments section in revision to include training curves, exact architecture specifications, statistical significance across multiple random seeds, a precise description of how the multimodality-control trade-off was quantified (by varying the diffusion timestep and sampling procedure), and, where compute permits, a direct comparison against a multi-step diffusion variant. These additions will allow readers to assess the performance match and test the single-step assumption more rigorously. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical performance comparison with no derivation chain

full rationale

The paper reports a preliminary experimental result: Valdi with one diffusion step matches an MLP baseline in CarRacing MPC. No equations, fitted parameters, uniqueness theorems, or self-citations are invoked to derive or predict this outcome; the match is presented as an observed fact from end-to-end training and evaluation. The noted trade-off between multimodality and control is likewise an experimental observation. Because the central claim is a direct empirical comparison rather than a reduction of any quantity to its inputs by construction, no circularity patterns apply.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract; the work appears to rely on standard diffusion and RL components.

pith-pipeline@v0.9.1-grok · 5650 in / 969 out tokens · 29270 ms · 2026-07-02T15:44:55.687705+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 7 canonical work pages · 6 internal anchors

[1]

Cosmos World Foundation Model Platform for Physical AI

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, Daniel Dworakowski, Jiaojiao Fan, Michele Fenzi, Francesco Ferroni, Sanja Fidler, Dieter Fox, Songwei Ge, Yunhao Ge, Jinwei Gu, Siddharth Gururani, Ethan He, Jiahui Huang, Jacob Samuel Huffman, Pooya Jannaty, Jingyi...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Diffusion for world modeling: Visual details matter in atari

Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and Fran c ois Fleuret. Diffusion for world modeling: Visual details matter in atari. NeurIPS, 2024

2024
[3]

Lejepa: Provable and scalable self-supervised learning without the heuristics

Randall Balestriero and Yann LeCun. Lejepa: Provable and scalable self-supervised learning without the heuristics. arXiv.org, 2025

2025
[4]

End-to-end autonomous driving: Challenges and frontiers

Li Chen, Penghao Wu, Kashyap Chitta, Bernhard Jaeger, Andreas Geiger, and Hongyang Li. End-to-end autonomous driving: Challenges and frontiers. PAMI, 2024

2024
[5]

Vista: A generalizable driving world model with high fidelity and versatile controllability

Shenyuan Gao, Jiazhi Yang, Li Chen, Kashyap Chitta, Yihang Qiu, Andreas Geiger, Jun Zhang, and Hongyang Li. Vista: A generalizable driving world model with high fidelity and versatile controllability. In NeurIPS, 2024

2024
[6]

Model predictive control: Theory and practice—a survey

Carlos E Garcia, David M Prett, and Manfred Morari. Model predictive control: Theory and practice—a survey. Automatica, 1989

1989
[7]

World Models

David Ha and J \"u rgen Schmidhuber. World models. arXiv.org, 1803.10122, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[8]

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In ICML, 2018 a

2018
[9]

Soft Actor-Critic Algorithms and Applications

Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, and Sergey Levine. Soft actor-critic algorithms and applications. arXiv.org, 1812.05905, 2018 b

work page internal anchor Pith review Pith/arXiv arXiv 2018
[10]

Learning latent dynamics for planning from pixels

Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. In ICML, pp.\ 2555--2565. PMLR, 2019

2019
[11]

Lillicrap, Jimmy Ba, and Mohammad Norouzi

Danijar Hafner, Timothy P. Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. In ICLR, 2020

2020
[12]

Mastering diverse domains through world models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. arXiv.org, 2023

2023
[13]

Training Agents Inside of Scalable World Models

Danijar Hafner, Wilson Yan, and Timothy Lillicrap. Training agents inside of scalable world models. arXiv.org, 2509.24527, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

arXiv preprint arXiv:2203.04955 , year=

Nicklas Hansen, Xiaolong Wang, and Hao Su. Temporal difference learning for model predictive control. arXiv.org, 2203.04955, 2022

work page arXiv 2022
[15]

Td-mpc2: Scalable, robust world models for continuous control

Nicklas Hansen, Hao Su, and Xiaolong Wang. Td-mpc2: Scalable, robust world models for continuous control. arXiv.org, 2023

2023
[16]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In NeurIPS, 2020

2020
[17]

Video diffusion models

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. In NIPS, 2022

2022
[18]

Lillicrap, Jonathan J

Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In ICLR, 2016

2016
[19]

LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels

Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, and Randall Balestriero. LeWorldModel : Stable end-to-end joint-embedding predictive architecture from pixels. arXiv.org, 2603.19312, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[20]

The cross-entropy method: a unified approach to combinatorial optimization, monte-carlo simulation, and machine learning, 2004

Reuven Y Rubinstein and Dirk P Kroese. The cross-entropy method: a unified approach to combinatorial optimization, monte-carlo simulation, and machine learning, 2004

2004
[21]

Progressive distillation for fast sampling of diffusion models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. arXiv.org, 2022

2022
[22]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv.org, 2010.02502, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[23]

Learning to predict by the methods of temporal differences

Richard S Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3 0 (1): 0 9--44, 1988

1988
[24]

Gymnasium: A standard interface for reinforcement learning environments

Mark Towers, Ariel Kwiatkowski, Jordan Terry, John U Balis, Gianluca De Cola, Tristan Deleu, Manuel Goul \ a o, Andreas Kallinteris, Markus Krimmel, Arjun KG, et al. Gymnasium: A standard interface for reinforcement learning environments. arXiv.org, 2024

2024
[25]

Model predictive path integral control: From theory to parallel computation

Grady Williams, Andrew Aldrich, and Evangelos A Theodorou. Model predictive path integral control: From theory to parallel computation. Journal of Guidance, Control, and Dynamics, 2017

2017
[26]

Resim: Reliable world simulation for autonomous driving

Jiazhi Yang, Kashyap Chitta, Shenyuan Gao, Long Chen, Yuqian Shao, Xiaosong Jia, Hongyang Li, Andreas Geiger, Xiangyu Yue, and Li Chen. Resim: Reliable world simulation for autonomous driving. In NeurIPS, 2025

2025
[27]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, pp.\ 586--595, 2018

2018

[1] [1]

Cosmos World Foundation Model Platform for Physical AI

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, Daniel Dworakowski, Jiaojiao Fan, Michele Fenzi, Francesco Ferroni, Sanja Fidler, Dieter Fox, Songwei Ge, Yunhao Ge, Jinwei Gu, Siddharth Gururani, Ethan He, Jiahui Huang, Jacob Samuel Huffman, Pooya Jannaty, Jingyi...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Diffusion for world modeling: Visual details matter in atari

Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and Fran c ois Fleuret. Diffusion for world modeling: Visual details matter in atari. NeurIPS, 2024

2024

[3] [3]

Lejepa: Provable and scalable self-supervised learning without the heuristics

Randall Balestriero and Yann LeCun. Lejepa: Provable and scalable self-supervised learning without the heuristics. arXiv.org, 2025

2025

[4] [4]

End-to-end autonomous driving: Challenges and frontiers

Li Chen, Penghao Wu, Kashyap Chitta, Bernhard Jaeger, Andreas Geiger, and Hongyang Li. End-to-end autonomous driving: Challenges and frontiers. PAMI, 2024

2024

[5] [5]

Vista: A generalizable driving world model with high fidelity and versatile controllability

Shenyuan Gao, Jiazhi Yang, Li Chen, Kashyap Chitta, Yihang Qiu, Andreas Geiger, Jun Zhang, and Hongyang Li. Vista: A generalizable driving world model with high fidelity and versatile controllability. In NeurIPS, 2024

2024

[6] [6]

Model predictive control: Theory and practice—a survey

Carlos E Garcia, David M Prett, and Manfred Morari. Model predictive control: Theory and practice—a survey. Automatica, 1989

1989

[7] [7]

World Models

David Ha and J \"u rgen Schmidhuber. World models. arXiv.org, 1803.10122, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[8] [8]

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In ICML, 2018 a

2018

[9] [9]

Soft Actor-Critic Algorithms and Applications

Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, and Sergey Levine. Soft actor-critic algorithms and applications. arXiv.org, 1812.05905, 2018 b

work page internal anchor Pith review Pith/arXiv arXiv 2018

[10] [10]

Learning latent dynamics for planning from pixels

Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. In ICML, pp.\ 2555--2565. PMLR, 2019

2019

[11] [11]

Lillicrap, Jimmy Ba, and Mohammad Norouzi

Danijar Hafner, Timothy P. Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. In ICLR, 2020

2020

[12] [12]

Mastering diverse domains through world models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. arXiv.org, 2023

2023

[13] [13]

Training Agents Inside of Scalable World Models

Danijar Hafner, Wilson Yan, and Timothy Lillicrap. Training agents inside of scalable world models. arXiv.org, 2509.24527, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

arXiv preprint arXiv:2203.04955 , year=

Nicklas Hansen, Xiaolong Wang, and Hao Su. Temporal difference learning for model predictive control. arXiv.org, 2203.04955, 2022

work page arXiv 2022

[15] [15]

Td-mpc2: Scalable, robust world models for continuous control

Nicklas Hansen, Hao Su, and Xiaolong Wang. Td-mpc2: Scalable, robust world models for continuous control. arXiv.org, 2023

2023

[16] [16]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In NeurIPS, 2020

2020

[17] [17]

Video diffusion models

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. In NIPS, 2022

2022

[18] [18]

Lillicrap, Jonathan J

Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In ICLR, 2016

2016

[19] [19]

LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels

Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, and Randall Balestriero. LeWorldModel : Stable end-to-end joint-embedding predictive architecture from pixels. arXiv.org, 2603.19312, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[20] [20]

The cross-entropy method: a unified approach to combinatorial optimization, monte-carlo simulation, and machine learning, 2004

Reuven Y Rubinstein and Dirk P Kroese. The cross-entropy method: a unified approach to combinatorial optimization, monte-carlo simulation, and machine learning, 2004

2004

[21] [21]

Progressive distillation for fast sampling of diffusion models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. arXiv.org, 2022

2022

[22] [22]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv.org, 2010.02502, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[23] [23]

Learning to predict by the methods of temporal differences

Richard S Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3 0 (1): 0 9--44, 1988

1988

[24] [24]

Gymnasium: A standard interface for reinforcement learning environments

Mark Towers, Ariel Kwiatkowski, Jordan Terry, John U Balis, Gianluca De Cola, Tristan Deleu, Manuel Goul \ a o, Andreas Kallinteris, Markus Krimmel, Arjun KG, et al. Gymnasium: A standard interface for reinforcement learning environments. arXiv.org, 2024

2024

[25] [25]

Model predictive path integral control: From theory to parallel computation

Grady Williams, Andrew Aldrich, and Evangelos A Theodorou. Model predictive path integral control: From theory to parallel computation. Journal of Guidance, Control, and Dynamics, 2017

2017

[26] [26]

Resim: Reliable world simulation for autonomous driving

Jiazhi Yang, Kashyap Chitta, Shenyuan Gao, Long Chen, Yuqian Shao, Xiaosong Jia, Hongyang Li, Andreas Geiger, Xiangyu Yue, and Li Chen. Resim: Reliable world simulation for autonomous driving. In NeurIPS, 2025

2025

[27] [27]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, pp.\ 586--595, 2018

2018