StressDream: Steering Video World Models for Robust Policy Evaluation and Improvement

Andrea Bajcsy; Apoorva Sharma; Edward Schmerling; Junwon Seo; Karen Leung; Marco Pavone; Ran Tian; Sushant Veer; Wenhao Ding

arxiv: 2606.00267 · v1 · pith:U22USM47new · submitted 2026-05-29 · 💻 cs.CV · cs.AI· cs.LG· cs.RO

StressDream: Steering Video World Models for Robust Policy Evaluation and Improvement

Junwon Seo , Sushant Veer , Ran Tian , Wenhao Ding , Apoorva Sharma , Karen Leung , Edward Schmerling , Marco Pavone

show 1 more author

Andrea Bajcsy

This is my paper

Pith reviewed 2026-06-28 22:44 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LGcs.RO

keywords video world modelsdiffusion modelspolicy evaluationroboticsautonomous drivingsteering generationstress testing

0 comments

The pith

Optimizing initial noise in diffusion video models steers imaginations to high-impact events like task failures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces StressDream to steer video world models toward specified high-impact yet plausible futures by optimizing the starting noise in diffusion processes. This addresses the limitation that standard sampling from world models often misses critical outcomes, such as failures in driving or manipulation tasks. By combining a vision-language model for semantic guidance with a plausibility constraint, the method generates targeted imaginations at inference time without retraining. A sympathetic reader would care because this could allow more reliable testing and improvement of robot policies by revealing actions that have undesirable but realistic possible outcomes.

Core claim

StressDream optimizes the high-dimensional initial noise of diffusion-based video world models using a semantic objective from a vision-language model and a plausibility objective to produce imaginations that match text-specified high-impact events while remaining in-distribution, thereby enabling robust policy evaluation by identifying risky actions whose plausible futures include failures.

What carries the argument

The StressDream optimization procedure, which adjusts diffusion initial noise guided by VLM semantic scores and a plausibility regularizer to steer generated video sequences toward target outcomes.

If this is right

Policy evaluation can identify actions that lead to undesirable outcomes in some plausible futures without drawing many samples.
Policy improvement can prioritize actions that avoid futures containing specified failures.
Video world models can be used for targeted stress-testing of autonomous systems in driving and manipulation domains.
Steering happens at inference time without modifying the underlying world model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar steering techniques might apply to other generative models beyond video, such as for text or audio generation in safety-critical applications.
The approach could be extended to optimize for multiple conflicting objectives simultaneously in policy testing.
Real-world validation would involve deploying the evaluated policies on physical robots to check if the identified risks correlate with actual failures.

Load-bearing premise

That gradients from the vision-language model can effectively guide high-dimensional noise optimization to produce nuanced scene-specific events without causing the generated videos to become implausible or out-of-distribution.

What would settle it

Generate steered videos for a specified failure event and check if independent human or VLM raters confirm that the event occurs in the video at rates significantly above unsteered samples, while also confirming visual plausibility.

Figures

Figures reproduced from arXiv: 2606.00267 by Andrea Bajcsy, Apoorva Sharma, Edward Schmerling, Junwon Seo, Karen Leung, Marco Pavone, Ran Tian, Sushant Veer, Wenhao Ding.

**Figure 1.** Figure 1: Overview of STRESSDREAM. (Top): The initial noise of diffusion-based WMs is optimized to steer imaginations toward a target event specified by an inference-time prompt. While nominal imaginations may miss high-impact outcomes of robot actions, such as spilling, STRESSDREAM uses VLM guidance to steer imaginations toward such outcomes while preventing the high-dimensional noise from drifting into OOD, where … view at source ↗

**Figure 2.** Figure 2: Results: Naughty Dubins Car. (a): TPR and TNR for detecting possible failures in WM imaginations. (b): Overlaid nominal and steered imaginations. STRESSDREAM steers WM imaginations, detecting plausible failures that nominal imagination misses. See Appendix C & Website for more results and details. Experiment Setup. We aim to detect action sequences that have a possibility of entering the failure set u… view at source ↗

**Figure 3.** Figure 3: Nominal vs. Steered Imaginations. Top texts show the inference-time prompts describing target outcomes. STRESSDREAM steers WM imaginations toward specified high-impact outcomes that nominal generation misses. Imaginations are grounded in plausible outcomes: when target outcomes are not supported by the WM distribution, e.g., spilling sticky candies or from a closed bag, STRESSDREAM does not imagine them. … view at source ↗

**Figure 4.** Figure 4: Steering Driving Video World Models. STRESSDREAM steers WM imaginations toward the target events more effectively than random sampling (Best-of-N), while C pla helps preserve video quality. Evaluation Setup. We generate imaginations from an initial observation o0 and action sequence a, and evaluate whether the generated video matches the inference-time text specification l of the target event. For drivin… view at source ↗

**Figure 5.** Figure 5: Steering Ctrl-World: Detecting task failures in WM imaginations. STRESSDREAM robustly detects high-impact outcomes in imaginations [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Steering to collision: STRESSDREAM can induce target events only when they are plausible. Steering is grounded in plausible outcomes. We study whether steering remains grounded in plausible outcomes supported by the WM’s distribution, rather than hallucinating implausible futures. To test this, we steer generations toward collision events using both the collisionfinetuned Vista model and the base Vista … view at source ↗

**Figure 7.** Figure 7: Policy Improvement via STRESSDREAM. Fine-tuning π0.5 with steered WM imaginations favors robust actions that remain successful under worst-case plausible outcomes, whereas nominal fine-tuning can propose actions where failures are plausible under the outcome distribution. Videos available at the website. spilling for the open sticky candy bag or closed coffee bag, but does so for the open coffee bag. Since… view at source ↗

**Figure 8.** Figure 8: π0.5 Rollout Results. STRESSDREAM improves policies by promoting robust actions [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Probability density vs. mass in a high-dimensional Gaussian distribution. Noise sampled from the typical set leads to plausible generations, whereas noise outside the typical set–despite having high probability density–produces implausible generations, e.g., humans becoming blurry or transforming into vehicles. A common way to encourage typicality is to regularize the norm of the optimized noise, since Gau… view at source ↗

**Figure 10.** Figure 10: Examples of atypical Gaussian noise. Left: typical noise sampled from a Gaussian prior. Others: noises perturbed to violate norm concentration, isotropy, and spectral whiteness, respectively [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

**Figure 11.** Figure 11: VLM score for driving scenes. The three videos are generated from the same scene but exhibit different temporal relationships to the truck in the ego lane. Both the Qwen-VL-based model and X-CLIP correctly distinguish these temporal relationships. VLMs also provide semantically meaningful gradient signals: task-specific reward models trained on limited datasets may fail to cover the large space of possi… view at source ↗

**Figure 12.** Figure 12: Robotic Manipulation: VLM scores for video understanding.The plot shows the token probability of answering “Yes” to a given prompt. The VLM takes videos from three camera views simultaneously and assigns high scores when the specified events occur, correctly detecting task-relevant moments. video. We find that using all three camera views simultaneously is essential for reliable scoring: for the same scen… view at source ↗

**Figure 13.** Figure 13: Average VLM scores across optimization steps in Vista. With the gradient approximation, STRESSDREAM successfully increases the criterion values, demonstrating the effectiveness of the proposed approximation. Efficacy of Gradient Approximation for Noise Optimization. The noise-gradient approximation enables efficient gradient-based steering without backpropagating through the iterative denoising proc… view at source ↗

**Figure 14.** Figure 14: Noise optimization with approximate vs. exact gradients. For Ctrl-World [12], where the full noise gradient can be computed on an H100 GPU using gradient checkpointing [119], we compare noise optimization with and without the gradient approximation [PITH_FULL_IMAGE:figures/full_fig_p031_14.png] view at source ↗

**Figure 15.** Figure 15: Average minimum safety scores over imagined trajectories. Results: Pessimistic Steering [PITH_FULL_IMAGE:figures/full_fig_p033_15.png] view at source ↗

**Figure 16.** Figure 16: Ablation: Optimistic Steering of Naughty Dubins Car. While Sec. 5 focuses on pessimistic steering to detect potential entries into the failure set, we also ablate the opposite direction by performing optimistic steering. Specifically, we change the sign of the steering objective so that optimization seeks imaginations with higher safety scores [PITH_FULL_IMAGE:figures/full_fig_p034_16.png] view at source ↗

**Figure 17.** Figure 17: Additional Results: Nominal Generation vs. Steered Generation. Given the same observation history and action sequence, STRESSDREAM steers the video world model toward plausible task-failure events when failure is possible, whereas nominal generation often misses these failure modes. Initial Frame Nominal imagination predicts the left vehicle yielding, whereas pessimistic imagination predicts it merging wi… view at source ↗

**Figure 18.** Figure 18: Robust Policy Evaluation with Video World Models. Steering the video world model enables robust detection of safety-critical or task-failure events in imagined futures that nominal generation may miss, thereby enabling the selection of robust action sequences that remain safe even under worst-case imaginations. Robust Policy Evaluation with Pessimistic Imaginations [PITH_FULL_IMAGE:figures/full_fig_p039_… view at source ↗

**Figure 19.** Figure 19: Quantitative: Driving WM. Detailed Results: Driving Video World Model [PITH_FULL_IMAGE:figures/full_fig_p040_19.png] view at source ↗

**Figure 20.** Figure 20: Video generation quality of driving world-model generations evaluated with WorldLens [127]. Specifically, we evaluate Subject Consistency, Depth Discrepancy, and Temporal Consistency. Subject Consistency uses DINO features [128] to measure whether dynamic subjects maintain consistent texture, shape, and structure over time. Depth Discrepancy measures the temporal stability of depth representations inf… view at source ↗

**Figure 21.** Figure 21: Robometer scores for imagined evaluation trajectories. Left: Robometer task-progress. Right: Robometer task-success. STRESSDREAM effectively generates more pessimistic imaginations than the baselines, yielding lower task-progress and task-success scores for the same action inputs. 100% True Positive Rate 80% 60% 40% 20% 0% 🧱 Block Stack 🥄 Utensil Pick 🔪 Knife Put ☕⚖️ Coffee Bean ☕🛍️ Coffee Bag 🐻 Candy Ba… view at source ↗

**Figure 23.** Figure 23: Target-alignment score on long-tailed events in driving video world models. To evaluate this, we sample 5 evaluation examples from long-tailed driving events in the evaluation set, including pedestrian crossing, traffic-light change, and collision. As a baseline, we generate 40 random samples and report the best result among them, while STRESSDREAM performs 20 optimization steps. We evaluate target alig… view at source ↗

**Figure 24.** Figure 24: Imaginations near collision from the base model and the collision-fine-tuned world model [PITH_FULL_IMAGE:figures/full_fig_p042_24.png] view at source ↗

**Figure 25.** Figure 25: Qualitative example: steering toward reduced distance to the lead vehicle. STRESSDREAM produces a plausible outcome, whereas without a C pla , it leads to implausible hallucinations. Similarly, [PITH_FULL_IMAGE:figures/full_fig_p042_25.png] view at source ↗

**Figure 26.** Figure 26: Task-wise success rates of fine-tuned π0.5-DROID policies. Robust fine-tuning with pessimistic video world-model imaginations improves policy success rates across tasks compared to nominal fine-tuning [PITH_FULL_IMAGE:figures/full_fig_p042_26.png] view at source ↗

read the original abstract

Video world models (WMs) have shown promise for policy evaluation and improvement by imagining realistic future observations conditioned on ego-robot actions. While WMs can model distributions over futures, policy evaluation and improvement typically rely on nominal imaginations, which can miss high-impact outcomes of robot actions unless prohibitively many samples are drawn. To enable robust policy evaluation and improvement over WM imaginations, we propose StressDream, which steers imaginations toward high-impact yet plausible outcomes specified at inference time by optimizing the initial noise of diffusion-based WMs. However, optimizing high-dimensional noise is challenging: the optimization must reason about nuanced, scene-dependent target events in generated videos while avoiding out-of-distribution (OOD) noise that yields implausible imaginations. We address this with two complementary objectives: a semantic objective with a Vision-Language Model that provides informative gradients by reasoning about the generated video, and a plausibility objective that prevents the optimized noise from drifting OOD. With state-of-the-art video world models for autonomous driving and robotic manipulation, we show that StressDream effectively steers imaginations toward high-impact yet plausible outcomes specified by text at inference time, such as task failures, enabling robust policy evaluation and improvement by identifying actions whose plausible futures include undesirable outcomes. Video results are available at https://junwon.me/StressDream/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

StressDream shows a concrete way to steer diffusion video world models toward text-specified high-impact futures via initial-noise optimization, but the OOD risk in that high-dimensional search needs tighter validation.

read the letter

The main takeaway is that this paper gives a practical method for directing video world models at inference time to generate specific undesirable outcomes, like task failures, without retraining or drawing huge numbers of samples. It optimizes the starting noise of a diffusion-based model using a VLM-driven semantic loss that reasons over the output video plus a separate plausibility term meant to keep the result in-distribution.

What is new is the joint use of those two objectives for steering in the noise space. Earlier work on world models for robotics and driving mostly samples from the model or conditions on actions; this targets particular events directly through gradient-based search on the noise. The abstract reports that it works on current SOTA models for autonomous driving and manipulation, letting users identify actions whose plausible futures include bad events.

The paper does a solid job framing why nominal sampling falls short for safety evaluation and why inference-time control matters in deployed systems. The idea is straightforward and addresses a real bottleneck.

The soft spot is the one the stress-test raises. High-dimensional noise optimization can find points that satisfy the VLM objective through spurious correlations while landing far from the training manifold. The plausibility regularizer is described as a reconstruction or feature-matching penalty, but it is only a soft term with no certified bound. Without ablations on the balance between the two losses or metrics showing how often the optimized noise stays close to the data distribution, it is hard to know whether the generated futures are reliable enough for policy evaluation. The abstract claims success but supplies no numbers here.

This is aimed at robotics and AV teams already running video world models for policy testing. Readers working on inference-time control of generative models or robust evaluation pipelines would find the objectives and setup worth trying.

The thinking is clear and engages honestly with the limitations of current sampling practices. It deserves a serious referee to check the full experiments and the OOD behavior.

I would send it to peer review.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes StressDream, a technique for steering diffusion-based video world models at inference time by optimizing the initial noise vector. It combines a VLM-driven semantic objective (to target high-impact events such as task failures specified by text) with a plausibility objective (to keep the optimized noise in-distribution). Experiments on state-of-the-art WMs for autonomous driving and robotic manipulation are claimed to show that the steered imaginations enable more robust policy evaluation and improvement by surfacing undesirable yet plausible futures.

Significance. If the optimization reliably produces in-distribution samples that realize the specified high-impact events, the approach would meaningfully extend the utility of video WMs beyond nominal rollouts, allowing text-specified stress testing without exhaustive sampling. The combination of VLM semantic guidance and a plausibility regularizer is a reasonable design choice, but the absence of certified manifold proximity or strong empirical controls on OOD leakage limits the strength of the policy-evaluation guarantee.

major comments (2)

[Section 3.2, Eq. (4)] Section 3.2, Eq. (4): the joint semantic + plausibility objective is presented as sufficient to keep optimized noise inside the WM training distribution, yet the plausibility term is only a soft reconstruction/feature-matching penalty with no Lipschitz bound, distance-to-manifold certificate, or rejection sampling step. Gradient descent in the high-dimensional noise space can therefore satisfy the VLM captioning objective via spurious correlations while producing OOD z, directly undermining the claim that the resulting futures are valid WM samples for policy evaluation.
The central policy-evaluation claim (identifying actions whose plausible futures include undesirable outcomes) rests on the steered videos being both high-impact and in-distribution. No quantitative results, ablation tables, or distribution-distance metrics (e.g., FID, reconstruction error histograms, or classifier-based OOD scores) are referenced that would demonstrate the plausibility term actually prevents exploitation of VLM gradients; without such evidence the load-bearing assumption remains unverified.

minor comments (2)

The abstract states that video results are available at an external link; the main text should include at least one quantitative table summarizing success rates or failure-detection metrics across the driving and manipulation domains.
Notation for the two loss terms and the noise variable z should be introduced once in Section 3 and used consistently thereafter to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback on the optimization guarantees and empirical validation of the plausibility term. We address each major comment below, providing clarifications from the manuscript while agreeing to strengthen the presentation where evidence is indirect.

read point-by-point responses

Referee: [Section 3.2, Eq. (4)] Section 3.2, Eq. (4): the joint semantic + plausibility objective is presented as sufficient to keep optimized noise inside the WM training distribution, yet the plausibility term is only a soft reconstruction/feature-matching penalty with no Lipschitz bound, distance-to-manifold certificate, or rejection sampling step. Gradient descent in the high-dimensional noise space can therefore satisfy the VLM captioning objective via spurious correlations while producing OOD z, directly undermining the claim that the resulting futures are valid WM samples for policy evaluation.

Authors: We agree that Eq. (4) relies on a soft plausibility penalty without formal certificates such as Lipschitz continuity or manifold distance bounds. The design choice was motivated by the need for a differentiable regularizer compatible with gradient-based optimization of z; stronger constraints like rejection sampling would break differentiability. Section 3.2 explicitly frames the term as a practical regularizer rather than a certified guarantee. We will revise the text to more clearly state this limitation and avoid implying formal in-distribution guarantees. revision: yes
Referee: [—] The central policy-evaluation claim (identifying actions whose plausible futures include undesirable outcomes) rests on the steered videos being both high-impact and in-distribution. No quantitative results, ablation tables, or distribution-distance metrics (e.g., FID, reconstruction error histograms, or classifier-based OOD scores) are referenced that would demonstrate the plausibility term actually prevents exploitation of VLM gradients; without such evidence the load-bearing assumption remains unverified.

Authors: The manuscript presents qualitative comparisons and ablations in Section 4 showing that removing the plausibility term leads to visibly implausible generations while the combined objective produces coherent high-impact events. However, we acknowledge the absence of explicit quantitative distribution metrics such as FID or OOD classifier scores in the current version. We will add these metrics (computed on held-out validation sets) and corresponding ablation tables in the revised manuscript to directly quantify the effect of the plausibility term on OOD leakage. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper introduces StressDream as an inference-time optimization technique over diffusion noise using a VLM semantic loss and a plausibility regularizer. No equations, derivations, or predictions are shown that reduce by construction to fitted inputs, self-definitions, or self-citation chains. The method description relies on external pretrained components (diffusion WMs, VLMs) without renaming known results or importing uniqueness theorems from the authors' prior work. The central contribution is an empirical steering procedure whose validity is assessed via downstream policy evaluation experiments rather than any self-referential mathematical reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated or derivable from the provided text.

pith-pipeline@v0.9.1-grok · 5792 in / 1088 out tokens · 17741 ms · 2026-06-28T22:44:10.405385+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

NoiseTilt: Noise-Tilted Reverse Kernels for Diffusion Reward Alignment
cs.LG 2026-06 unverdicted novelty 6.0

NTRK uses a whitening operator to tilt the noise term in diffusion reverse kernels for reward guidance, outperforming baselines with 20x fewer steps on aesthetic tasks.

Reference graph

Works this paper leans on

134 extracted references · 1 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Agarwal, A

N. Agarwal, A. Ali, M. Bala, Y . Balaji, E. Barker, T. Cai, P. Chattopadhyay, Y . Chen, Y . Cui, Y . Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

Pith/arXiv arXiv 2025
[2]

T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025
[3]

Z. Mei, T. Yin, O. Shorinwa, A. Badithela, Z. Zheng, J. Bruno, M. Bland, L. Zha, A. Hancock, J. F. Fisac, et al. Video generation models in robotics-applications, research challenges, future directions.arXiv preprint arXiv:2601.07823, 2026

arXiv 2026
[4]

J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems (NeurIPS), 33:6840–6851, 2020

2020
[5]

Karras, M

T. Karras, M. Aittala, T. Aila, and S. Laine. Elucidating the design space of diffusion- based generative models.Advances in Neural Information Processing Systems (NeurIPS), 35:26565–26577, 2022

2022
[6]

Lipman, R

Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. InInternational Conference on Learning Representations (ICLR), 2023. URL https://openreview.net/forum?id=PqvMRDCJT9t

2023
[7]

Alonso, A

E. Alonso, A. Jelley, V . Micheli, A. Kanervisto, A. Storkey, T. Pearce, and F. Fleuret. Dif- fusion for world modeling: Visual details matter in atari.Advances in Neural Information Processing Systems (NeurIPS), 37:58757–58791, 2024

2024
[8]

B. Chen, D. Mart ´ı Mons´o, Y . Du, M. Simchowitz, R. Tedrake, and V . Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Informa- tion Processing Systems (NeurIPS), 37:24081–24125, 2024

2024
[9]

S. Gao, S. Zhou, Y . Du, J. Zhang, and C. Gan. Adaworld: Learning adaptable world models with latent actions. InInternational Conference on Machine Learning (ICML), pages 18744– 18771, 2025

2025
[10]

Kerssies, G

T. Kerssies, G. Berton, J. He, Q. Yu, W. Ma, D. de Geus, G. Dubbelman, and L.-C. Chen. A frame is worth one token: Efficient generative world modeling with delta tokens.arXiv preprint arXiv:2604.04913, 2026

Pith/arXiv arXiv 2026
[11]

S. Gao, J. Yang, L. Chen, K. Chitta, Y . Qiu, A. Geiger, J. Zhang, and H. Li. Vista: A generalizable driving world model with high fidelity and versatile controllability.Advances in Neural Information Processing Systems (NeurIPS), 37:91560–91596, 2024

2024
[12]

Y . Guo, L. X. Shi, J. Chen, and C. Finn. Ctrl-world: A controllable generative world model for robot manipulation. InInternational Conference on Learning Representations (ICLR),
[13]

URLhttps://openreview.net/forum?id=748bHL2BAv. 9
[14]

Y . Guo, T. Lee, L. X. Shi, J. Chen, P. Liang, and C. Finn. Vlaw: Iterative co-improvement of vision-language-action policy and world model.arXiv preprint arXiv:2602.12063, 2026

arXiv 2026
[15]

Zhang, Z

K. Zhang, Z. Tang, X. Hu, X. Pan, X. Guo, Y . Liu, J. Huang, L. Yuan, Q. Zhang, X.-X. Long, et al. Epona: Autoregressive diffusion world model for autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 27220–27230, 2025

2025
[16]

Quevedo, A

J. Quevedo, A. K. Sharma, Y . Sun, V . Suryavanshi, P. Liang, and S. Yang. Worldgym: World model as an environment for policy evaluation.arXiv preprint arXiv:2506.00613, 2025

arXiv 2025
[17]

G. R. Team, K. Choromanski, C. Devin, Y . Du, D. Dwibedi, R. Gao, A. Jindal, T. Kipf, S. Kirmani, I. Leal, et al. Evaluating gemini robotics policies in a veo world simulator.arXiv preprint arXiv:2512.10675, 2025

arXiv 2025
[18]

A. K. Sharma, Y . Sun, N. Lu, Y . Zhang, J. Liu, and S. Yang. World-gymnast: Training robots with reinforcement learning in a world model.arXiv preprint arXiv:2602.02454, 2026

arXiv 2026
[19]

Samuel, R

D. Samuel, R. Ben-Ari, N. Darshan, H. Maron, and G. Chechik. Norm-guided latent space exploration for text-to-image generation.Advances in Neural Information Processing Systems (NeurIPS), 36:57863–57875, 2023

2023
[20]

Z. Tang, J. Peng, J. Tang, M. Hong, F. Wang, and T.-H. Chang. Inference-time alignment of diffusion models with direct noise optimization. InInternational Conference on Machine Learning (ICML), 2025. URLhttps://openreview.net/forum?id=JpbqiD7n9r

2025
[21]

Intelligence, K

P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al.π 0.5: A vision-language-action model with open-world generaliza- tion.arXiv preprint arXiv:2504.16054, 2025

Pith/arXiv arXiv 2025
[22]

Y . Li, Y . Zhu, J. Wen, C. Shen, and Y . Xu. Worldeval: World model as real-world robot policies evaluator.arXiv preprint arXiv:2505.19017, 2025

arXiv 2025
[23]

M. J. Kim, Y . Gao, T.-Y . Lin, Y .-C. Lin, Y . Ge, G. Lam, P. Liang, S. Song, M.-Y . Liu, C. Finn, and J. Gu. Cosmos policy: Fine-tuning video models for visuomotor control and planning. InInternational Conference on Learning Representations (ICLR), 2026. URLhttps:// openreview.net/forum?id=wPEIStHxYH

2026
[24]

T. Yin, Z. Mei, Z. Zheng, M. Yamane, D. Wang, J. Sceats, S. M. Bateman, L. Zha, A. Ba- dithela, O. Shorinwa, et al. Playworld: Learning robot world models from autonomous play. arXiv preprint arXiv:2603.09030, 2026

Pith/arXiv arXiv 2026
[25]

J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models. InInternational Conference on Learning Representations (ICLR), 2021. URLhttps://openreview.net/ forum?id=St1giarCHLP

2021
[26]

Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based generative modeling through stochastic differential equations. InInternational Conference on Learning Representations (ICLR), 2021. URLhttps://openreview.net/forum?id= PxTIG12RRHS

2021
[27]

G. Zhou, H. Pan, Y . LeCun, and L. Pinto. DINO-WM: World models on pre-trained vi- sual features enable zero-shot planning. InInternational Conference on Machine Learning (ICML), 2025. URLhttps://openreview.net/forum?id=D5RNACOZEI

2025
[28]

Hansen, H

N. Hansen, H. Su, and X. Wang. Td-mpc2: Scalable, robust world models for continuous control. InInternational Conference on Learning Representations (ICLR), 2024. 10

2024
[29]

J. Moos, K. Hansel, H. Abdulsamad, S. Stark, D. Clever, and J. Peters. Robust reinforcement learning: A review of foundations and recent advances.Machine Learning and Knowledge Extraction, 4(1):276–315, 2022

2022
[30]

Akella, A

P. Akella, A. Dixit, M. Ahmadi, L. Lindemann, M. P. Chapman, G. J. Pappas, A. D. Ames, and J. W. Burdick. Risk-aware robotics: Tail risk measures in planning, control, and verification [focus on education].IEEE Control Systems, 45(4):46–78, 2025

2025
[31]

Karunratanakul, K

K. Karunratanakul, K. Preechakul, E. Aksan, T. Beeler, S. Suwajanakorn, and S. Tang. Opti- mizing diffusion noise can serve as universal motion priors. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1334–1345, 2024

2024
[32]

Eyring, S

L. Eyring, S. Karthik, K. Roth, A. Dosovitskiy, and Z. Akata. Reno: Enhancing one-step text- to-image models through reward-based noise optimization.Advances in Neural Information Processing Systems (NeurIPS), 37:125487–125519, 2024

2024
[33]

J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023

Pith/arXiv arXiv 2023
[34]

T. M. Cover.Elements of information theory. John Wiley & Sons, 1999

1999
[35]

Betancourt

M. Betancourt. A conceptual introduction to hamiltonian monte carlo.arXiv preprint arXiv:1701.02434, 2017

Pith/arXiv arXiv 2017
[36]

Nalisnick, A

E. Nalisnick, A. Matsukawa, Y . W. Teh, and B. Lakshminarayanan. Detecting out-of-distribution inputs to deep generative models using typicality.arXiv preprint arXiv:1906.02994, 2019

arXiv 1906
[37]

Samuel, R

D. Samuel, R. Ben-Ari, S. Raviv, N. Darshan, and G. Chechik. Generating images of rare concepts using pre-trained diffusion models. InProceedings of the AAAI Conference on Artificial Intelligence, 2024

2024
[38]

Harrington, A

A. Harrington, A. S. Koepke, S. Karthik, T. Darrell, and A. A. Efros. It’s never too late: Noise optimization for collapse recovery in trained diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

2026
[39]

Durall, M

R. Durall, M. Keuper, and J. Keuper. Watch your up-convolution: Cnn based generative deep neural networks are failing to reproduce spectral distributions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7890– 7899, 2020

2020
[40]

Stoica, R

P. Stoica, R. L. Moses, et al.Spectral analysis of signals, volume 452. Pearson Prentice Hall Upper Saddle River, NJ, 2005

2005
[41]

S. V . Vaseghi.Advanced digital signal processing and noise reduction. John Wiley & Sons, 2008

2008
[42]

Blattmann, T

A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y . Levi, Z. En- glish, V . V oleti, A. Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

Pith/arXiv arXiv 2023
[43]

Eyring, S

L. Eyring, S. Karthik, A. Dosovitskiy, N. Ruiz, and Z. Akata. Noise hypernetworks: Amor- tizing test-time compute in diffusion models. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

2025
[44]

D. Ahn, J. Kang, S. Lee, J. Min, M. Kim, W. Jang, H. Cho, S. Paul, S. Kim, E. Cha, et al. A noise is worth diffusion guidance. InInternational Conference on Learning Representations (ICLR), 2026. URLhttps://openreview.net/forum?id=xEWooSOgaz. 11

2026
[45]

Dhariwal and A

P. Dhariwal and A. Nichol. Diffusion models beat gans on image synthesis.Advances in Neural Information Processing Systems (NeurIPS), 34:8780–8794, 2021

2021
[46]

W. Li, X. Xu, X. Xiao, J. Liu, H. Yang, G. Li, Z. Wang, Z. Feng, Q. She, Y . Lyu, et al. Upaint- ing: Unified text-to-image diffusion generation with cross-modal guidance.arXiv preprint arXiv:2210.16031, 2022

arXiv 2022
[47]

PhysicalAI-Autonomous-Vehicles.https://huggingface.co/ datasets/nvidia/PhysicalAI-Autonomous-Vehicles, 2025

NVIDIA Corporation. PhysicalAI-Autonomous-Vehicles.https://huggingface.co/ datasets/nvidia/PhysicalAI-Autonomous-Vehicles, 2025

2025
[48]

Moura, S

D. Moura, S. Zhu, and O. Zvitia. Nexar dashcam collision prediction dataset and challenge. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2583–2591, 2025

2025
[49]

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

Pith/arXiv arXiv 2025
[50]

B. Li, L. Zhu, R. Tian, S. Tan, Y . Chen, Y . Lu, Y . Cui, S. Veer, M. Ehrlich, J. Philion, X. Weng, F. Xue, L. Fan, Y . Zhu, J. Kautz, A. Tao, M.-Y . Liu, S. Fidler, B. Ivanovic, T. Darrell, J. Malik, S. Han, and M. Pavone. Wolf: Dense video captioning with a world summarization framework.Transactions on Machine Learning Research, 2025. ISSN 2835-8856. U...

2025
[51]

Y . Ma, G. Xu, X. Sun, M. Yan, J. Zhang, and R. Ji. X-clip: End-to-end multi-grained con- trastive learning for video-text retrieval. InProceedings of the 30th ACM international con- ference on multimedia, pages 638–647, 2022

2022
[52]

Khazatsky, K

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

Pith/arXiv arXiv 2024
[53]

S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, W. Ge, Z. Guo, Q. Huang, J. Huang, F. Huang, B. Hui, S. Jiang, Z. Li, M. Li, M. Li, K. Li, Z. Lin, J. Lin, X. Liu, J. Liu, C. Liu, Y . Liu, D. Liu, S. Liu, D. Lu, R. Luo, C. Lv, R. Men, L. Meng, X. Ren, X. Ren, S. Song, Y . Sun, J. Tang, J. Tu, J. Wan, P. Wang, P. Wang,...

Pith/arXiv arXiv 2025
[54]

D. Li, Y . Fang, Y . Chen, S. Yang, S. Cao, J. Wong, M. Luo, X. Wang, H. Yin, J. Gonzalez, et al. Worldmodelbench: Judging video generation models as world models. InAdvances in Neural Information Processing Systems (NeurIPS), volume 38, 2025

2025
[55]

Hong, K.-C

Y . Hong, K.-C. Kao, H. Zhou, and C.-J. Hsieh. Understanding reward hacking in text-to- image reinforcement learning.arXiv preprint arXiv:2601.03468, 2026

arXiv 2026
[56]

H. Tan, S. Chen, Y . Xu, Z. Wang, Y . Ji, C. Chi, Y . Lyu, Z. Zhao, X. Chen, P. Co, et al. Robo- dopamine: General process reward modeling for high-precision robotic manipulation.arXiv preprint arXiv:2512.23703, 2025

arXiv 2025
[57]

Liang, Y

A. Liang, Y . Korkmaz, J. Zhang, M. Hwang, A. Anwar, S. Kaushik, A. Shah, A. S. Huang, L. Zettlemoyer, D. Fox, Y . Xiang, A. Li, A. Bobu, A. Gupta, S. Tu, E. Biyik, and J. Zhang. Robometer: Scaling general-purpose robotic reward models via trajectory comparisons. In Robotics: Science and Systems (RSS), 2026

2026
[58]

Y . Song, P. Dhariwal, M. Chen, and I. Sutskever. Consistency models. InInternational Conference on Machine Learning (ICML), pages 32211–32252, 2023. 12

2023
[59]

Kim, C.-H

D. Kim, C.-H. Lai, W. Liao, N. Murata, Y . Takida, T. Uesaka, Y . He, Y . Mitsufuji, and S. Er- mon. Consistency trajectory models: Learning probability flow ode trajectory of diffusion. In International Conference on Learning Representations (ICLR), volume 2024, pages 44493– 44525, 2024

2024
[60]

Esser, S

P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. M¨uller, H. Saini, Y . Levi, D. Lorenz, A. Sauer, F. Boesel, D. Podell, T. Dockhorn, Z. English, and R. Rombach. Scaling rectified flow trans- formers for high-resolution image synthesis. InInternational Conference on Machine Learn- ing (ICML), 2024. URLhttps://openreview.net/forum?id=FPnUhsQJ5B

2024
[61]

Finn and S

C. Finn and S. Levine. Deep visual foresight for planning robot motion. InIEEE International Conference on Robotics and Automation (ICRA), pages 2786–2793, 2017

2017
[62]

Ebert, C

F. Ebert, C. Finn, S. Dasari, A. Xie, A. Lee, and S. Levine. Visual foresight: Model- based deep reinforcement learning for vision-based robotic control.arXiv preprint arXiv:1812.00568, 2018

Pith/arXiv arXiv 2018
[63]

Hafner, T

D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson. Learning latent dynamics for planning from pixels. InInternational Conference on Machine Learning (ICML), 2019

2019
[64]

P. Wu, A. Escontrela, D. Hafner, P. Abbeel, and K. Goldberg. Daydreamer: World models for physical robot learning. InConference on Robot Learning (CoRL), pages 2226–2240, 2023

2023
[65]

Micheli, E

V . Micheli, E. Alonso, and F. Fleuret. Transformers are sample-efficient world models. In International Conference on Learning Representations (ICLR), 2023

2023
[66]

Bruce, M

J. Bruce, M. D. Dennis, A. Edwards, J. Parker-Holder, Y . Shi, E. Hughes, M. Lai, A. Mavalankar, R. Steigerwald, C. Apps, et al. Genie: Generative interactive environments. InInternational Conference on Machine Learning (ICML), 2024

2024
[67]

L. Maes, Q. L. Lidec, D. Scieur, Y . LeCun, and R. Balestriero. Leworldmodel: Stable end-to- end joint-embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312, 2026

Pith/arXiv arXiv 2026
[68]

Hafner, T

D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi. Dream to control: Learning behaviors by latent imagination. InInternational Conference on Learning Representations (ICLR), 2020

2020
[69]

Hafner, T

D. Hafner, T. P. Lillicrap, M. Norouzi, and J. Ba. Mastering atari with discrete world models. InInternational Conference on Learning Representations (ICLR), 2021

2021
[70]

Hafner, J

D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap. Mastering diverse control tasks through world models.Nature, 640(8059):647–653, Apr. 2025. ISSN 1476-4687

2025
[71]

Hafner, W

D. Hafner, W. Yan, and T. Lillicrap. Training agents inside of scalable world models.arXiv preprint arXiv:2509.24527, 2025

Pith/arXiv arXiv 2025
[72]

Nakamura, L

K. Nakamura, L. Peters, and A. Bajcsy. Generalizing safety beyond collision-avoidance via latent-space reachability analysis. InRobotics: Science and Systems (RSS), 2025

2025
[73]

Agrawal, J

S. Agrawal, J. Seo, K. Nakamura, R. Tian, and A. Bajcsy. Anysafe: Adapting latent safety filters at runtime via safety constraint parameterization in the latent space. InIEEE Interna- tional Conference on Robotics and Automation (ICRA), 2026

2026
[74]

Psenka, M

M. Psenka, M. Rabbat, A. Krishnapriyan, Y . LeCun, and A. Bar. Parallel stochastic gradient- based planning for world models.arXiv preprint arXiv:2602.00475, 2026. 13

arXiv 2026
[75]

Hassan, S

M. Hassan, S. Stapf, A. Rahimi, P. Rezende, Y . Haghighi, D. Br ¨uggemann, I. Katircioglu, L. Zhang, X. Chen, S. Saha, et al. Gem: A generalizable ego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene composition control. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2...

2025
[76]

Walker, C

J. Walker, C. Doersch, A. Gupta, and M. Hebert. An uncertain future: Forecasting from static images using variational autoencoders. InEuropean Conference on Computer Vision (ECCV), pages 835–851. Springer, 2016

2016
[77]

Zhang, G

W. Zhang, G. Wang, J. Sun, Y . Yuan, and G. Huang. Storm: Efficient stochastic transformer based world models for reinforcement learning.Advances in Neural Information Processing Systems (NeurIPS), 36:27147–27166, 2023

2023
[78]

Russell, A

L. Russell, A. Hu, L. Bertoni, G. Fedoseev, J. Shotton, E. Arani, and G. Corrado. Gaia-2: A controllable multi-view generative world model for autonomous driving.arXiv preprint arXiv:2503.20523, 2025

Pith/arXiv arXiv 2025
[79]

Bartoccioni, E

F. Bartoccioni, E. Ramzi, V . Besnier, S. Venkataramanan, T.-H. Vu, Y . Xu, L. Chambon, S. Gidaris, S. Odabas, D. Hurych, et al. Vavim and vavam: Autonomous driving through video generative modeling.arXiv preprint arXiv:2502.15672, 2025

arXiv 2025
[80]

J. Seo, K. Nakamura, and A. Bajcsy. Uncertainty-aware latent safety filters for avoiding out-of-distribution failures. InConference on Robot Learning (CoRL), 2025. URLhttps: //openreview.net/forum?id=CQKxhmLobo

2025

Showing first 80 references.

[1] [1]

Agarwal, A

N. Agarwal, A. Ali, M. Bala, Y . Balaji, E. Barker, T. Cai, P. Chattopadhyay, Y . Chen, Y . Cui, Y . Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

Pith/arXiv arXiv 2025

[2] [2]

T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025

[3] [3]

Z. Mei, T. Yin, O. Shorinwa, A. Badithela, Z. Zheng, J. Bruno, M. Bland, L. Zha, A. Hancock, J. F. Fisac, et al. Video generation models in robotics-applications, research challenges, future directions.arXiv preprint arXiv:2601.07823, 2026

arXiv 2026

[4] [4]

J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems (NeurIPS), 33:6840–6851, 2020

2020

[5] [5]

Karras, M

T. Karras, M. Aittala, T. Aila, and S. Laine. Elucidating the design space of diffusion- based generative models.Advances in Neural Information Processing Systems (NeurIPS), 35:26565–26577, 2022

2022

[6] [6]

Lipman, R

Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. InInternational Conference on Learning Representations (ICLR), 2023. URL https://openreview.net/forum?id=PqvMRDCJT9t

2023

[7] [7]

Alonso, A

E. Alonso, A. Jelley, V . Micheli, A. Kanervisto, A. Storkey, T. Pearce, and F. Fleuret. Dif- fusion for world modeling: Visual details matter in atari.Advances in Neural Information Processing Systems (NeurIPS), 37:58757–58791, 2024

2024

[8] [8]

B. Chen, D. Mart ´ı Mons´o, Y . Du, M. Simchowitz, R. Tedrake, and V . Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Informa- tion Processing Systems (NeurIPS), 37:24081–24125, 2024

2024

[9] [9]

S. Gao, S. Zhou, Y . Du, J. Zhang, and C. Gan. Adaworld: Learning adaptable world models with latent actions. InInternational Conference on Machine Learning (ICML), pages 18744– 18771, 2025

2025

[10] [10]

Kerssies, G

T. Kerssies, G. Berton, J. He, Q. Yu, W. Ma, D. de Geus, G. Dubbelman, and L.-C. Chen. A frame is worth one token: Efficient generative world modeling with delta tokens.arXiv preprint arXiv:2604.04913, 2026

Pith/arXiv arXiv 2026

[11] [11]

S. Gao, J. Yang, L. Chen, K. Chitta, Y . Qiu, A. Geiger, J. Zhang, and H. Li. Vista: A generalizable driving world model with high fidelity and versatile controllability.Advances in Neural Information Processing Systems (NeurIPS), 37:91560–91596, 2024

2024

[12] [12]

Y . Guo, L. X. Shi, J. Chen, and C. Finn. Ctrl-world: A controllable generative world model for robot manipulation. InInternational Conference on Learning Representations (ICLR),

[13] [13]

URLhttps://openreview.net/forum?id=748bHL2BAv. 9

[14] [14]

Y . Guo, T. Lee, L. X. Shi, J. Chen, P. Liang, and C. Finn. Vlaw: Iterative co-improvement of vision-language-action policy and world model.arXiv preprint arXiv:2602.12063, 2026

arXiv 2026

[15] [15]

Zhang, Z

K. Zhang, Z. Tang, X. Hu, X. Pan, X. Guo, Y . Liu, J. Huang, L. Yuan, Q. Zhang, X.-X. Long, et al. Epona: Autoregressive diffusion world model for autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 27220–27230, 2025

2025

[16] [16]

Quevedo, A

J. Quevedo, A. K. Sharma, Y . Sun, V . Suryavanshi, P. Liang, and S. Yang. Worldgym: World model as an environment for policy evaluation.arXiv preprint arXiv:2506.00613, 2025

arXiv 2025

[17] [17]

G. R. Team, K. Choromanski, C. Devin, Y . Du, D. Dwibedi, R. Gao, A. Jindal, T. Kipf, S. Kirmani, I. Leal, et al. Evaluating gemini robotics policies in a veo world simulator.arXiv preprint arXiv:2512.10675, 2025

arXiv 2025

[18] [18]

A. K. Sharma, Y . Sun, N. Lu, Y . Zhang, J. Liu, and S. Yang. World-gymnast: Training robots with reinforcement learning in a world model.arXiv preprint arXiv:2602.02454, 2026

arXiv 2026

[19] [19]

Samuel, R

D. Samuel, R. Ben-Ari, N. Darshan, H. Maron, and G. Chechik. Norm-guided latent space exploration for text-to-image generation.Advances in Neural Information Processing Systems (NeurIPS), 36:57863–57875, 2023

2023

[20] [20]

Z. Tang, J. Peng, J. Tang, M. Hong, F. Wang, and T.-H. Chang. Inference-time alignment of diffusion models with direct noise optimization. InInternational Conference on Machine Learning (ICML), 2025. URLhttps://openreview.net/forum?id=JpbqiD7n9r

2025

[21] [21]

Intelligence, K

P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al.π 0.5: A vision-language-action model with open-world generaliza- tion.arXiv preprint arXiv:2504.16054, 2025

Pith/arXiv arXiv 2025

[22] [22]

Y . Li, Y . Zhu, J. Wen, C. Shen, and Y . Xu. Worldeval: World model as real-world robot policies evaluator.arXiv preprint arXiv:2505.19017, 2025

arXiv 2025

[23] [23]

M. J. Kim, Y . Gao, T.-Y . Lin, Y .-C. Lin, Y . Ge, G. Lam, P. Liang, S. Song, M.-Y . Liu, C. Finn, and J. Gu. Cosmos policy: Fine-tuning video models for visuomotor control and planning. InInternational Conference on Learning Representations (ICLR), 2026. URLhttps:// openreview.net/forum?id=wPEIStHxYH

2026

[24] [24]

T. Yin, Z. Mei, Z. Zheng, M. Yamane, D. Wang, J. Sceats, S. M. Bateman, L. Zha, A. Ba- dithela, O. Shorinwa, et al. Playworld: Learning robot world models from autonomous play. arXiv preprint arXiv:2603.09030, 2026

Pith/arXiv arXiv 2026

[25] [25]

J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models. InInternational Conference on Learning Representations (ICLR), 2021. URLhttps://openreview.net/ forum?id=St1giarCHLP

2021

[26] [26]

Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based generative modeling through stochastic differential equations. InInternational Conference on Learning Representations (ICLR), 2021. URLhttps://openreview.net/forum?id= PxTIG12RRHS

2021

[27] [27]

G. Zhou, H. Pan, Y . LeCun, and L. Pinto. DINO-WM: World models on pre-trained vi- sual features enable zero-shot planning. InInternational Conference on Machine Learning (ICML), 2025. URLhttps://openreview.net/forum?id=D5RNACOZEI

2025

[28] [28]

Hansen, H

N. Hansen, H. Su, and X. Wang. Td-mpc2: Scalable, robust world models for continuous control. InInternational Conference on Learning Representations (ICLR), 2024. 10

2024

[29] [29]

J. Moos, K. Hansel, H. Abdulsamad, S. Stark, D. Clever, and J. Peters. Robust reinforcement learning: A review of foundations and recent advances.Machine Learning and Knowledge Extraction, 4(1):276–315, 2022

2022

[30] [30]

Akella, A

P. Akella, A. Dixit, M. Ahmadi, L. Lindemann, M. P. Chapman, G. J. Pappas, A. D. Ames, and J. W. Burdick. Risk-aware robotics: Tail risk measures in planning, control, and verification [focus on education].IEEE Control Systems, 45(4):46–78, 2025

2025

[31] [31]

Karunratanakul, K

K. Karunratanakul, K. Preechakul, E. Aksan, T. Beeler, S. Suwajanakorn, and S. Tang. Opti- mizing diffusion noise can serve as universal motion priors. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1334–1345, 2024

2024

[32] [32]

Eyring, S

L. Eyring, S. Karthik, K. Roth, A. Dosovitskiy, and Z. Akata. Reno: Enhancing one-step text- to-image models through reward-based noise optimization.Advances in Neural Information Processing Systems (NeurIPS), 37:125487–125519, 2024

2024

[33] [33]

J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023

Pith/arXiv arXiv 2023

[34] [34]

T. M. Cover.Elements of information theory. John Wiley & Sons, 1999

1999

[35] [35]

Betancourt

M. Betancourt. A conceptual introduction to hamiltonian monte carlo.arXiv preprint arXiv:1701.02434, 2017

Pith/arXiv arXiv 2017

[36] [36]

Nalisnick, A

E. Nalisnick, A. Matsukawa, Y . W. Teh, and B. Lakshminarayanan. Detecting out-of-distribution inputs to deep generative models using typicality.arXiv preprint arXiv:1906.02994, 2019

arXiv 1906

[37] [37]

Samuel, R

D. Samuel, R. Ben-Ari, S. Raviv, N. Darshan, and G. Chechik. Generating images of rare concepts using pre-trained diffusion models. InProceedings of the AAAI Conference on Artificial Intelligence, 2024

2024

[38] [38]

Harrington, A

A. Harrington, A. S. Koepke, S. Karthik, T. Darrell, and A. A. Efros. It’s never too late: Noise optimization for collapse recovery in trained diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

2026

[39] [39]

Durall, M

R. Durall, M. Keuper, and J. Keuper. Watch your up-convolution: Cnn based generative deep neural networks are failing to reproduce spectral distributions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7890– 7899, 2020

2020

[40] [40]

Stoica, R

P. Stoica, R. L. Moses, et al.Spectral analysis of signals, volume 452. Pearson Prentice Hall Upper Saddle River, NJ, 2005

2005

[41] [41]

S. V . Vaseghi.Advanced digital signal processing and noise reduction. John Wiley & Sons, 2008

2008

[42] [42]

Blattmann, T

A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y . Levi, Z. En- glish, V . V oleti, A. Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

Pith/arXiv arXiv 2023

[43] [43]

Eyring, S

L. Eyring, S. Karthik, A. Dosovitskiy, N. Ruiz, and Z. Akata. Noise hypernetworks: Amor- tizing test-time compute in diffusion models. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

2025

[44] [44]

D. Ahn, J. Kang, S. Lee, J. Min, M. Kim, W. Jang, H. Cho, S. Paul, S. Kim, E. Cha, et al. A noise is worth diffusion guidance. InInternational Conference on Learning Representations (ICLR), 2026. URLhttps://openreview.net/forum?id=xEWooSOgaz. 11

2026

[45] [45]

Dhariwal and A

P. Dhariwal and A. Nichol. Diffusion models beat gans on image synthesis.Advances in Neural Information Processing Systems (NeurIPS), 34:8780–8794, 2021

2021

[46] [46]

W. Li, X. Xu, X. Xiao, J. Liu, H. Yang, G. Li, Z. Wang, Z. Feng, Q. She, Y . Lyu, et al. Upaint- ing: Unified text-to-image diffusion generation with cross-modal guidance.arXiv preprint arXiv:2210.16031, 2022

arXiv 2022

[47] [47]

PhysicalAI-Autonomous-Vehicles.https://huggingface.co/ datasets/nvidia/PhysicalAI-Autonomous-Vehicles, 2025

NVIDIA Corporation. PhysicalAI-Autonomous-Vehicles.https://huggingface.co/ datasets/nvidia/PhysicalAI-Autonomous-Vehicles, 2025

2025

[48] [48]

Moura, S

D. Moura, S. Zhu, and O. Zvitia. Nexar dashcam collision prediction dataset and challenge. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2583–2591, 2025

2025

[49] [49]

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

Pith/arXiv arXiv 2025

[50] [50]

B. Li, L. Zhu, R. Tian, S. Tan, Y . Chen, Y . Lu, Y . Cui, S. Veer, M. Ehrlich, J. Philion, X. Weng, F. Xue, L. Fan, Y . Zhu, J. Kautz, A. Tao, M.-Y . Liu, S. Fidler, B. Ivanovic, T. Darrell, J. Malik, S. Han, and M. Pavone. Wolf: Dense video captioning with a world summarization framework.Transactions on Machine Learning Research, 2025. ISSN 2835-8856. U...

2025

[51] [51]

Y . Ma, G. Xu, X. Sun, M. Yan, J. Zhang, and R. Ji. X-clip: End-to-end multi-grained con- trastive learning for video-text retrieval. InProceedings of the 30th ACM international con- ference on multimedia, pages 638–647, 2022

2022

[52] [52]

Khazatsky, K

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

Pith/arXiv arXiv 2024

[53] [53]

S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, W. Ge, Z. Guo, Q. Huang, J. Huang, F. Huang, B. Hui, S. Jiang, Z. Li, M. Li, M. Li, K. Li, Z. Lin, J. Lin, X. Liu, J. Liu, C. Liu, Y . Liu, D. Liu, S. Liu, D. Lu, R. Luo, C. Lv, R. Men, L. Meng, X. Ren, X. Ren, S. Song, Y . Sun, J. Tang, J. Tu, J. Wan, P. Wang, P. Wang,...

Pith/arXiv arXiv 2025

[54] [54]

D. Li, Y . Fang, Y . Chen, S. Yang, S. Cao, J. Wong, M. Luo, X. Wang, H. Yin, J. Gonzalez, et al. Worldmodelbench: Judging video generation models as world models. InAdvances in Neural Information Processing Systems (NeurIPS), volume 38, 2025

2025

[55] [55]

Hong, K.-C

Y . Hong, K.-C. Kao, H. Zhou, and C.-J. Hsieh. Understanding reward hacking in text-to- image reinforcement learning.arXiv preprint arXiv:2601.03468, 2026

arXiv 2026

[56] [56]

H. Tan, S. Chen, Y . Xu, Z. Wang, Y . Ji, C. Chi, Y . Lyu, Z. Zhao, X. Chen, P. Co, et al. Robo- dopamine: General process reward modeling for high-precision robotic manipulation.arXiv preprint arXiv:2512.23703, 2025

arXiv 2025

[57] [57]

Liang, Y

A. Liang, Y . Korkmaz, J. Zhang, M. Hwang, A. Anwar, S. Kaushik, A. Shah, A. S. Huang, L. Zettlemoyer, D. Fox, Y . Xiang, A. Li, A. Bobu, A. Gupta, S. Tu, E. Biyik, and J. Zhang. Robometer: Scaling general-purpose robotic reward models via trajectory comparisons. In Robotics: Science and Systems (RSS), 2026

2026

[58] [58]

Y . Song, P. Dhariwal, M. Chen, and I. Sutskever. Consistency models. InInternational Conference on Machine Learning (ICML), pages 32211–32252, 2023. 12

2023

[59] [59]

Kim, C.-H

D. Kim, C.-H. Lai, W. Liao, N. Murata, Y . Takida, T. Uesaka, Y . He, Y . Mitsufuji, and S. Er- mon. Consistency trajectory models: Learning probability flow ode trajectory of diffusion. In International Conference on Learning Representations (ICLR), volume 2024, pages 44493– 44525, 2024

2024

[60] [60]

Esser, S

P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. M¨uller, H. Saini, Y . Levi, D. Lorenz, A. Sauer, F. Boesel, D. Podell, T. Dockhorn, Z. English, and R. Rombach. Scaling rectified flow trans- formers for high-resolution image synthesis. InInternational Conference on Machine Learn- ing (ICML), 2024. URLhttps://openreview.net/forum?id=FPnUhsQJ5B

2024

[61] [61]

Finn and S

C. Finn and S. Levine. Deep visual foresight for planning robot motion. InIEEE International Conference on Robotics and Automation (ICRA), pages 2786–2793, 2017

2017

[62] [62]

Ebert, C

F. Ebert, C. Finn, S. Dasari, A. Xie, A. Lee, and S. Levine. Visual foresight: Model- based deep reinforcement learning for vision-based robotic control.arXiv preprint arXiv:1812.00568, 2018

Pith/arXiv arXiv 2018

[63] [63]

Hafner, T

D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson. Learning latent dynamics for planning from pixels. InInternational Conference on Machine Learning (ICML), 2019

2019

[64] [64]

P. Wu, A. Escontrela, D. Hafner, P. Abbeel, and K. Goldberg. Daydreamer: World models for physical robot learning. InConference on Robot Learning (CoRL), pages 2226–2240, 2023

2023

[65] [65]

Micheli, E

V . Micheli, E. Alonso, and F. Fleuret. Transformers are sample-efficient world models. In International Conference on Learning Representations (ICLR), 2023

2023

[66] [66]

Bruce, M

J. Bruce, M. D. Dennis, A. Edwards, J. Parker-Holder, Y . Shi, E. Hughes, M. Lai, A. Mavalankar, R. Steigerwald, C. Apps, et al. Genie: Generative interactive environments. InInternational Conference on Machine Learning (ICML), 2024

2024

[67] [67]

L. Maes, Q. L. Lidec, D. Scieur, Y . LeCun, and R. Balestriero. Leworldmodel: Stable end-to- end joint-embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312, 2026

Pith/arXiv arXiv 2026

[68] [68]

Hafner, T

D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi. Dream to control: Learning behaviors by latent imagination. InInternational Conference on Learning Representations (ICLR), 2020

2020

[69] [69]

Hafner, T

D. Hafner, T. P. Lillicrap, M. Norouzi, and J. Ba. Mastering atari with discrete world models. InInternational Conference on Learning Representations (ICLR), 2021

2021

[70] [70]

Hafner, J

D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap. Mastering diverse control tasks through world models.Nature, 640(8059):647–653, Apr. 2025. ISSN 1476-4687

2025

[71] [71]

Hafner, W

D. Hafner, W. Yan, and T. Lillicrap. Training agents inside of scalable world models.arXiv preprint arXiv:2509.24527, 2025

Pith/arXiv arXiv 2025

[72] [72]

Nakamura, L

K. Nakamura, L. Peters, and A. Bajcsy. Generalizing safety beyond collision-avoidance via latent-space reachability analysis. InRobotics: Science and Systems (RSS), 2025

2025

[73] [73]

Agrawal, J

S. Agrawal, J. Seo, K. Nakamura, R. Tian, and A. Bajcsy. Anysafe: Adapting latent safety filters at runtime via safety constraint parameterization in the latent space. InIEEE Interna- tional Conference on Robotics and Automation (ICRA), 2026

2026

[74] [74]

Psenka, M

M. Psenka, M. Rabbat, A. Krishnapriyan, Y . LeCun, and A. Bar. Parallel stochastic gradient- based planning for world models.arXiv preprint arXiv:2602.00475, 2026. 13

arXiv 2026

[75] [75]

Hassan, S

M. Hassan, S. Stapf, A. Rahimi, P. Rezende, Y . Haghighi, D. Br ¨uggemann, I. Katircioglu, L. Zhang, X. Chen, S. Saha, et al. Gem: A generalizable ego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene composition control. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2...

2025

[76] [76]

Walker, C

J. Walker, C. Doersch, A. Gupta, and M. Hebert. An uncertain future: Forecasting from static images using variational autoencoders. InEuropean Conference on Computer Vision (ECCV), pages 835–851. Springer, 2016

2016

[77] [77]

Zhang, G

W. Zhang, G. Wang, J. Sun, Y . Yuan, and G. Huang. Storm: Efficient stochastic transformer based world models for reinforcement learning.Advances in Neural Information Processing Systems (NeurIPS), 36:27147–27166, 2023

2023

[78] [78]

Russell, A

L. Russell, A. Hu, L. Bertoni, G. Fedoseev, J. Shotton, E. Arani, and G. Corrado. Gaia-2: A controllable multi-view generative world model for autonomous driving.arXiv preprint arXiv:2503.20523, 2025

Pith/arXiv arXiv 2025

[79] [79]

Bartoccioni, E

F. Bartoccioni, E. Ramzi, V . Besnier, S. Venkataramanan, T.-H. Vu, Y . Xu, L. Chambon, S. Gidaris, S. Odabas, D. Hurych, et al. Vavim and vavam: Autonomous driving through video generative modeling.arXiv preprint arXiv:2502.15672, 2025

arXiv 2025

[80] [80]

J. Seo, K. Nakamura, and A. Bajcsy. Uncertainty-aware latent safety filters for avoiding out-of-distribution failures. InConference on Robot Learning (CoRL), 2025. URLhttps: //openreview.net/forum?id=CQKxhmLobo

2025