pith. sign in

arxiv: 2606.00267 · v1 · pith:U22USM47new · submitted 2026-05-29 · 💻 cs.CV · cs.AI· cs.LG· cs.RO

StressDream: Steering Video World Models for Robust Policy Evaluation and Improvement

Pith reviewed 2026-06-28 22:44 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LGcs.RO
keywords video world modelsdiffusion modelspolicy evaluationroboticsautonomous drivingsteering generationstress testing
0
0 comments X

The pith

Optimizing initial noise in diffusion video models steers imaginations to high-impact events like task failures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces StressDream to steer video world models toward specified high-impact yet plausible futures by optimizing the starting noise in diffusion processes. This addresses the limitation that standard sampling from world models often misses critical outcomes, such as failures in driving or manipulation tasks. By combining a vision-language model for semantic guidance with a plausibility constraint, the method generates targeted imaginations at inference time without retraining. A sympathetic reader would care because this could allow more reliable testing and improvement of robot policies by revealing actions that have undesirable but realistic possible outcomes.

Core claim

StressDream optimizes the high-dimensional initial noise of diffusion-based video world models using a semantic objective from a vision-language model and a plausibility objective to produce imaginations that match text-specified high-impact events while remaining in-distribution, thereby enabling robust policy evaluation by identifying risky actions whose plausible futures include failures.

What carries the argument

The StressDream optimization procedure, which adjusts diffusion initial noise guided by VLM semantic scores and a plausibility regularizer to steer generated video sequences toward target outcomes.

If this is right

  • Policy evaluation can identify actions that lead to undesirable outcomes in some plausible futures without drawing many samples.
  • Policy improvement can prioritize actions that avoid futures containing specified failures.
  • Video world models can be used for targeted stress-testing of autonomous systems in driving and manipulation domains.
  • Steering happens at inference time without modifying the underlying world model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar steering techniques might apply to other generative models beyond video, such as for text or audio generation in safety-critical applications.
  • The approach could be extended to optimize for multiple conflicting objectives simultaneously in policy testing.
  • Real-world validation would involve deploying the evaluated policies on physical robots to check if the identified risks correlate with actual failures.

Load-bearing premise

That gradients from the vision-language model can effectively guide high-dimensional noise optimization to produce nuanced scene-specific events without causing the generated videos to become implausible or out-of-distribution.

What would settle it

Generate steered videos for a specified failure event and check if independent human or VLM raters confirm that the event occurs in the video at rates significantly above unsteered samples, while also confirming visual plausibility.

Figures

Figures reproduced from arXiv: 2606.00267 by Andrea Bajcsy, Apoorva Sharma, Edward Schmerling, Junwon Seo, Karen Leung, Marco Pavone, Ran Tian, Sushant Veer, Wenhao Ding.

Figure 1
Figure 1. Figure 1: Overview of STRESSDREAM. (Top): The initial noise of diffusion-based WMs is optimized to steer imaginations toward a target event specified by an inference-time prompt. While nominal imaginations may miss high-impact outcomes of robot actions, such as spilling, STRESSDREAM uses VLM guidance to steer imaginations toward such outcomes while preventing the high-dimensional noise from drifting into OOD, where … view at source ↗
Figure 2
Figure 2. Figure 2: Results: Naughty Dubins Car. (a): TPR and TNR for detecting possible failures in WM imag￾inations. (b): Overlaid nominal and steered imagina￾tions. STRESSDREAM steers WM imaginations, detect￾ing plausible failures that nominal imagination misses. See Appendix C & Website for more results and details. Experiment Setup. We aim to detect ac￾tion sequences that have a possibility of enter￾ing the failure set u… view at source ↗
Figure 3
Figure 3. Figure 3: Nominal vs. Steered Imaginations. Top texts show the inference-time prompts describing target outcomes. STRESSDREAM steers WM imaginations toward specified high-impact outcomes that nominal gen￾eration misses. Imaginations are grounded in plausible outcomes: when target outcomes are not supported by the WM distribution, e.g., spilling sticky candies or from a closed bag, STRESSDREAM does not imagine them. … view at source ↗
Figure 4
Figure 4. Figure 4: Steering Driving Video World Models. STRESSDREAM steers WM imaginations toward the target events more effectively than random sampling (Best-of-N), while C pla helps preserve video quality. Evaluation Setup. We generate imagina￾tions from an initial observation o0 and action sequence a, and evaluate whether the generated video matches the inference-time text specifica￾tion l of the target event. For drivin… view at source ↗
Figure 5
Figure 5. Figure 5: Steering Ctrl-World: Detect￾ing task failures in WM imaginations. STRESSDREAM robustly detects high-impact outcomes in imaginations [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Steering to col￾lision: STRESSDREAM can induce target events only when they are plausible. Steering is grounded in plausible outcomes. We study whether steer￾ing remains grounded in plausible outcomes supported by the WM’s distribution, rather than hallucinating implausible futures. To test this, we steer generations toward collision events using both the collision￾finetuned Vista model and the base Vista … view at source ↗
Figure 7
Figure 7. Figure 7: Policy Improvement via STRESSDREAM. Fine-tuning π0.5 with steered WM imaginations favors robust actions that remain successful under worst-case plausible outcomes, whereas nominal fine-tuning can propose actions where failures are plausible under the outcome distribution. Videos available at the website. spilling for the open sticky candy bag or closed coffee bag, but does so for the open coffee bag. Since… view at source ↗
Figure 8
Figure 8. Figure 8: π0.5 Rollout Results. STRESSDREAM improves policies by promoting robust actions [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Probability density vs. mass in a high-dimensional Gaussian distribution. Noise sampled from the typical set leads to plausible generations, whereas noise outside the typical set–despite having high probability density–produces implausible generations, e.g., humans becoming blurry or transforming into vehicles. A common way to encourage typicality is to regularize the norm of the optimized noise, since Gau… view at source ↗
Figure 10
Figure 10. Figure 10: Examples of atypical Gaussian noise. Left: typical noise sampled from a Gaussian prior. Others: noises perturbed to violate norm concentration, isotropy, and spectral whiteness, respectively [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: VLM score for driving scenes. The three videos are generated from the same scene but exhibit different temporal relationships to the truck in the ego lane. Both the Qwen-VL-based model and X-CLIP cor￾rectly distinguish these temporal relationships. VLMs also provide semantically meaningful gradient signals: task-specific reward mod￾els trained on limited datasets may fail to cover the large space of possi… view at source ↗
Figure 12
Figure 12. Figure 12: Robotic Manipulation: VLM scores for video understanding.The plot shows the token probability of answering “Yes” to a given prompt. The VLM takes videos from three camera views simultaneously and assigns high scores when the specified events occur, correctly detecting task-relevant moments. video. We find that using all three camera views simultaneously is essential for reliable scoring: for the same scen… view at source ↗
Figure 13
Figure 13. Figure 13: Average VLM scores across opti￾mization steps in Vista. With the gradient approx￾imation, STRESSDREAM successfully increases the criterion values, demonstrating the effective￾ness of the proposed approximation. Efficacy of Gradient Approximation for Noise Optimization. The noise-gradient approxima￾tion enables efficient gradient-based steering with￾out backpropagating through the iterative denois￾ing proc… view at source ↗
Figure 14
Figure 14. Figure 14: Noise optimization with approx￾imate vs. exact gradients. For Ctrl-World [12], where the full noise gradient can be computed on an H100 GPU using gradient checkpoint￾ing [119], we compare noise optimization with and with￾out the gradient approximation [PITH_FULL_IMAGE:figures/full_fig_p031_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Average min￾imum safety scores over imagined trajectories. Results: Pessimistic Steering [PITH_FULL_IMAGE:figures/full_fig_p033_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Ablation: Optimistic Steering of Naughty Dubins Car. While Sec. 5 focuses on pessimistic steering to detect potential en￾tries into the failure set, we also ablate the opposite direction by performing optimistic steering. Specifically, we change the sign of the steering objective so that optimization seeks imaginations with higher safety scores [PITH_FULL_IMAGE:figures/full_fig_p034_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Additional Results: Nominal Generation vs. Steered Generation. Given the same observation history and action sequence, STRESSDREAM steers the video world model toward plausible task-failure events when failure is possible, whereas nominal generation often misses these failure modes. Initial Frame Nominal imagination predicts the left vehicle yielding, whereas pessimistic imagination predicts it merging wi… view at source ↗
Figure 18
Figure 18. Figure 18: Robust Policy Evaluation with Video World Models. Steering the video world model enables robust detection of safety-critical or task-failure events in imagined futures that nominal generation may miss, thereby enabling the selection of robust action sequences that remain safe even under worst-case imaginations. Robust Policy Evaluation with Pessimistic Imaginations [PITH_FULL_IMAGE:figures/full_fig_p039_… view at source ↗
Figure 19
Figure 19. Figure 19: Quantitative: Driving WM. Detailed Results: Driving Video World Model [PITH_FULL_IMAGE:figures/full_fig_p040_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Video generation quality of driving world-model gen￾erations evaluated with WorldLens [127]. Specifically, we evaluate Subject Consistency, Depth Discrepancy, and Temporal Consistency. Subject Con￾sistency uses DINO features [128] to measure whether dynamic subjects maintain consistent texture, shape, and structure over time. Depth Dis￾crepancy measures the temporal sta￾bility of depth representations inf… view at source ↗
Figure 21
Figure 21. Figure 21: Robometer scores for imagined evaluation trajectories. Left: Robometer task-progress. Right: Robometer task-success. STRESSDREAM effectively generates more pessimistic imaginations than the base￾lines, yielding lower task-progress and task-success scores for the same action inputs. 100% True Positive Rate 80% 60% 40% 20% 0% 🧱 Block Stack 🥄 Utensil Pick 🔪 Knife Put ☕⚖️ Coffee Bean ☕🛍️ Coffee Bag 🐻 Candy Ba… view at source ↗
Figure 23
Figure 23. Figure 23: Target-alignment score on long-tailed events in driving video world models. To evaluate this, we sample 5 evaluation examples from long-tailed driving events in the evaluation set, including pedestrian cross￾ing, traffic-light change, and collision. As a baseline, we generate 40 random samples and report the best result among them, while STRESSDREAM performs 20 optimization steps. We evaluate tar￾get alig… view at source ↗
Figure 24
Figure 24. Figure 24: Imaginations near collision from the base model and the collision-fine-tuned world model [PITH_FULL_IMAGE:figures/full_fig_p042_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Qualitative example: steering toward re￾duced distance to the lead vehicle. STRESSDREAM produces a plausible outcome, whereas without a C pla , it leads to implausible hallucinations. Similarly, [PITH_FULL_IMAGE:figures/full_fig_p042_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Task-wise success rates of fine-tuned π0.5-DROID policies. Robust fine-tuning with pessimistic video world-model imaginations improves policy success rates across tasks compared to nominal fine-tuning [PITH_FULL_IMAGE:figures/full_fig_p042_26.png] view at source ↗
read the original abstract

Video world models (WMs) have shown promise for policy evaluation and improvement by imagining realistic future observations conditioned on ego-robot actions. While WMs can model distributions over futures, policy evaluation and improvement typically rely on nominal imaginations, which can miss high-impact outcomes of robot actions unless prohibitively many samples are drawn. To enable robust policy evaluation and improvement over WM imaginations, we propose StressDream, which steers imaginations toward high-impact yet plausible outcomes specified at inference time by optimizing the initial noise of diffusion-based WMs. However, optimizing high-dimensional noise is challenging: the optimization must reason about nuanced, scene-dependent target events in generated videos while avoiding out-of-distribution (OOD) noise that yields implausible imaginations. We address this with two complementary objectives: a semantic objective with a Vision-Language Model that provides informative gradients by reasoning about the generated video, and a plausibility objective that prevents the optimized noise from drifting OOD. With state-of-the-art video world models for autonomous driving and robotic manipulation, we show that StressDream effectively steers imaginations toward high-impact yet plausible outcomes specified by text at inference time, such as task failures, enabling robust policy evaluation and improvement by identifying actions whose plausible futures include undesirable outcomes. Video results are available at https://junwon.me/StressDream/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes StressDream, a technique for steering diffusion-based video world models at inference time by optimizing the initial noise vector. It combines a VLM-driven semantic objective (to target high-impact events such as task failures specified by text) with a plausibility objective (to keep the optimized noise in-distribution). Experiments on state-of-the-art WMs for autonomous driving and robotic manipulation are claimed to show that the steered imaginations enable more robust policy evaluation and improvement by surfacing undesirable yet plausible futures.

Significance. If the optimization reliably produces in-distribution samples that realize the specified high-impact events, the approach would meaningfully extend the utility of video WMs beyond nominal rollouts, allowing text-specified stress testing without exhaustive sampling. The combination of VLM semantic guidance and a plausibility regularizer is a reasonable design choice, but the absence of certified manifold proximity or strong empirical controls on OOD leakage limits the strength of the policy-evaluation guarantee.

major comments (2)
  1. [Section 3.2, Eq. (4)] Section 3.2, Eq. (4): the joint semantic + plausibility objective is presented as sufficient to keep optimized noise inside the WM training distribution, yet the plausibility term is only a soft reconstruction/feature-matching penalty with no Lipschitz bound, distance-to-manifold certificate, or rejection sampling step. Gradient descent in the high-dimensional noise space can therefore satisfy the VLM captioning objective via spurious correlations while producing OOD z, directly undermining the claim that the resulting futures are valid WM samples for policy evaluation.
  2. The central policy-evaluation claim (identifying actions whose plausible futures include undesirable outcomes) rests on the steered videos being both high-impact and in-distribution. No quantitative results, ablation tables, or distribution-distance metrics (e.g., FID, reconstruction error histograms, or classifier-based OOD scores) are referenced that would demonstrate the plausibility term actually prevents exploitation of VLM gradients; without such evidence the load-bearing assumption remains unverified.
minor comments (2)
  1. The abstract states that video results are available at an external link; the main text should include at least one quantitative table summarizing success rates or failure-detection metrics across the driving and manipulation domains.
  2. Notation for the two loss terms and the noise variable z should be introduced once in Section 3 and used consistently thereafter to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback on the optimization guarantees and empirical validation of the plausibility term. We address each major comment below, providing clarifications from the manuscript while agreeing to strengthen the presentation where evidence is indirect.

read point-by-point responses
  1. Referee: [Section 3.2, Eq. (4)] Section 3.2, Eq. (4): the joint semantic + plausibility objective is presented as sufficient to keep optimized noise inside the WM training distribution, yet the plausibility term is only a soft reconstruction/feature-matching penalty with no Lipschitz bound, distance-to-manifold certificate, or rejection sampling step. Gradient descent in the high-dimensional noise space can therefore satisfy the VLM captioning objective via spurious correlations while producing OOD z, directly undermining the claim that the resulting futures are valid WM samples for policy evaluation.

    Authors: We agree that Eq. (4) relies on a soft plausibility penalty without formal certificates such as Lipschitz continuity or manifold distance bounds. The design choice was motivated by the need for a differentiable regularizer compatible with gradient-based optimization of z; stronger constraints like rejection sampling would break differentiability. Section 3.2 explicitly frames the term as a practical regularizer rather than a certified guarantee. We will revise the text to more clearly state this limitation and avoid implying formal in-distribution guarantees. revision: yes

  2. Referee: [—] The central policy-evaluation claim (identifying actions whose plausible futures include undesirable outcomes) rests on the steered videos being both high-impact and in-distribution. No quantitative results, ablation tables, or distribution-distance metrics (e.g., FID, reconstruction error histograms, or classifier-based OOD scores) are referenced that would demonstrate the plausibility term actually prevents exploitation of VLM gradients; without such evidence the load-bearing assumption remains unverified.

    Authors: The manuscript presents qualitative comparisons and ablations in Section 4 showing that removing the plausibility term leads to visibly implausible generations while the combined objective produces coherent high-impact events. However, we acknowledge the absence of explicit quantitative distribution metrics such as FID or OOD classifier scores in the current version. We will add these metrics (computed on held-out validation sets) and corresponding ablation tables in the revised manuscript to directly quantify the effect of the plausibility term on OOD leakage. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper introduces StressDream as an inference-time optimization technique over diffusion noise using a VLM semantic loss and a plausibility regularizer. No equations, derivations, or predictions are shown that reduce by construction to fitted inputs, self-definitions, or self-citation chains. The method description relies on external pretrained components (diffusion WMs, VLMs) without renaming known results or importing uniqueness theorems from the authors' prior work. The central contribution is an empirical steering procedure whose validity is assessed via downstream policy evaluation experiments rather than any self-referential mathematical reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated or derivable from the provided text.

pith-pipeline@v0.9.1-grok · 5792 in / 1088 out tokens · 17741 ms · 2026-06-28T22:44:10.405385+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. NoiseTilt: Noise-Tilted Reverse Kernels for Diffusion Reward Alignment

    cs.LG 2026-06 unverdicted novelty 6.0

    NTRK uses a whitening operator to tilt the noise term in diffusion reverse kernels for reward guidance, outperforming baselines with 20x fewer steps on aesthetic tasks.

Reference graph

Works this paper leans on

134 extracted references · 1 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    Agarwal, A

    N. Agarwal, A. Ali, M. Bala, Y . Balaji, E. Barker, T. Cai, P. Chattopadhyay, Y . Chen, Y . Cui, Y . Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025

  2. [2]

    T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  3. [3]

    Z. Mei, T. Yin, O. Shorinwa, A. Badithela, Z. Zheng, J. Bruno, M. Bland, L. Zha, A. Hancock, J. F. Fisac, et al. Video generation models in robotics-applications, research challenges, future directions.arXiv preprint arXiv:2601.07823, 2026

  4. [4]

    J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems (NeurIPS), 33:6840–6851, 2020

  5. [5]

    Karras, M

    T. Karras, M. Aittala, T. Aila, and S. Laine. Elucidating the design space of diffusion- based generative models.Advances in Neural Information Processing Systems (NeurIPS), 35:26565–26577, 2022

  6. [6]

    Lipman, R

    Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. InInternational Conference on Learning Representations (ICLR), 2023. URL https://openreview.net/forum?id=PqvMRDCJT9t

  7. [7]

    Alonso, A

    E. Alonso, A. Jelley, V . Micheli, A. Kanervisto, A. Storkey, T. Pearce, and F. Fleuret. Dif- fusion for world modeling: Visual details matter in atari.Advances in Neural Information Processing Systems (NeurIPS), 37:58757–58791, 2024

  8. [8]

    B. Chen, D. Mart ´ı Mons´o, Y . Du, M. Simchowitz, R. Tedrake, and V . Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Informa- tion Processing Systems (NeurIPS), 37:24081–24125, 2024

  9. [9]

    S. Gao, S. Zhou, Y . Du, J. Zhang, and C. Gan. Adaworld: Learning adaptable world models with latent actions. InInternational Conference on Machine Learning (ICML), pages 18744– 18771, 2025

  10. [10]

    Kerssies, G

    T. Kerssies, G. Berton, J. He, Q. Yu, W. Ma, D. de Geus, G. Dubbelman, and L.-C. Chen. A frame is worth one token: Efficient generative world modeling with delta tokens.arXiv preprint arXiv:2604.04913, 2026

  11. [11]

    S. Gao, J. Yang, L. Chen, K. Chitta, Y . Qiu, A. Geiger, J. Zhang, and H. Li. Vista: A generalizable driving world model with high fidelity and versatile controllability.Advances in Neural Information Processing Systems (NeurIPS), 37:91560–91596, 2024

  12. [12]

    Y . Guo, L. X. Shi, J. Chen, and C. Finn. Ctrl-world: A controllable generative world model for robot manipulation. InInternational Conference on Learning Representations (ICLR),

  13. [13]

    URLhttps://openreview.net/forum?id=748bHL2BAv. 9

  14. [14]

    Y . Guo, T. Lee, L. X. Shi, J. Chen, P. Liang, and C. Finn. Vlaw: Iterative co-improvement of vision-language-action policy and world model.arXiv preprint arXiv:2602.12063, 2026

  15. [15]

    Zhang, Z

    K. Zhang, Z. Tang, X. Hu, X. Pan, X. Guo, Y . Liu, J. Huang, L. Yuan, Q. Zhang, X.-X. Long, et al. Epona: Autoregressive diffusion world model for autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 27220–27230, 2025

  16. [16]

    Quevedo, A

    J. Quevedo, A. K. Sharma, Y . Sun, V . Suryavanshi, P. Liang, and S. Yang. Worldgym: World model as an environment for policy evaluation.arXiv preprint arXiv:2506.00613, 2025

  17. [17]

    G. R. Team, K. Choromanski, C. Devin, Y . Du, D. Dwibedi, R. Gao, A. Jindal, T. Kipf, S. Kirmani, I. Leal, et al. Evaluating gemini robotics policies in a veo world simulator.arXiv preprint arXiv:2512.10675, 2025

  18. [18]

    A. K. Sharma, Y . Sun, N. Lu, Y . Zhang, J. Liu, and S. Yang. World-gymnast: Training robots with reinforcement learning in a world model.arXiv preprint arXiv:2602.02454, 2026

  19. [19]

    Samuel, R

    D. Samuel, R. Ben-Ari, N. Darshan, H. Maron, and G. Chechik. Norm-guided latent space exploration for text-to-image generation.Advances in Neural Information Processing Systems (NeurIPS), 36:57863–57875, 2023

  20. [20]

    Z. Tang, J. Peng, J. Tang, M. Hong, F. Wang, and T.-H. Chang. Inference-time alignment of diffusion models with direct noise optimization. InInternational Conference on Machine Learning (ICML), 2025. URLhttps://openreview.net/forum?id=JpbqiD7n9r

  21. [21]

    Intelligence, K

    P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al.π 0.5: A vision-language-action model with open-world generaliza- tion.arXiv preprint arXiv:2504.16054, 2025

  22. [22]

    Y . Li, Y . Zhu, J. Wen, C. Shen, and Y . Xu. Worldeval: World model as real-world robot policies evaluator.arXiv preprint arXiv:2505.19017, 2025

  23. [23]

    M. J. Kim, Y . Gao, T.-Y . Lin, Y .-C. Lin, Y . Ge, G. Lam, P. Liang, S. Song, M.-Y . Liu, C. Finn, and J. Gu. Cosmos policy: Fine-tuning video models for visuomotor control and planning. InInternational Conference on Learning Representations (ICLR), 2026. URLhttps:// openreview.net/forum?id=wPEIStHxYH

  24. [24]

    T. Yin, Z. Mei, Z. Zheng, M. Yamane, D. Wang, J. Sceats, S. M. Bateman, L. Zha, A. Ba- dithela, O. Shorinwa, et al. Playworld: Learning robot world models from autonomous play. arXiv preprint arXiv:2603.09030, 2026

  25. [25]

    J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models. InInternational Conference on Learning Representations (ICLR), 2021. URLhttps://openreview.net/ forum?id=St1giarCHLP

  26. [26]

    Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based generative modeling through stochastic differential equations. InInternational Conference on Learning Representations (ICLR), 2021. URLhttps://openreview.net/forum?id= PxTIG12RRHS

  27. [27]

    G. Zhou, H. Pan, Y . LeCun, and L. Pinto. DINO-WM: World models on pre-trained vi- sual features enable zero-shot planning. InInternational Conference on Machine Learning (ICML), 2025. URLhttps://openreview.net/forum?id=D5RNACOZEI

  28. [28]

    Hansen, H

    N. Hansen, H. Su, and X. Wang. Td-mpc2: Scalable, robust world models for continuous control. InInternational Conference on Learning Representations (ICLR), 2024. 10

  29. [29]

    J. Moos, K. Hansel, H. Abdulsamad, S. Stark, D. Clever, and J. Peters. Robust reinforcement learning: A review of foundations and recent advances.Machine Learning and Knowledge Extraction, 4(1):276–315, 2022

  30. [30]

    Akella, A

    P. Akella, A. Dixit, M. Ahmadi, L. Lindemann, M. P. Chapman, G. J. Pappas, A. D. Ames, and J. W. Burdick. Risk-aware robotics: Tail risk measures in planning, control, and verification [focus on education].IEEE Control Systems, 45(4):46–78, 2025

  31. [31]

    Karunratanakul, K

    K. Karunratanakul, K. Preechakul, E. Aksan, T. Beeler, S. Suwajanakorn, and S. Tang. Opti- mizing diffusion noise can serve as universal motion priors. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1334–1345, 2024

  32. [32]

    Eyring, S

    L. Eyring, S. Karthik, K. Roth, A. Dosovitskiy, and Z. Akata. Reno: Enhancing one-step text- to-image models through reward-based noise optimization.Advances in Neural Information Processing Systems (NeurIPS), 37:125487–125519, 2024

  33. [33]

    J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023

  34. [34]

    T. M. Cover.Elements of information theory. John Wiley & Sons, 1999

  35. [35]

    Betancourt

    M. Betancourt. A conceptual introduction to hamiltonian monte carlo.arXiv preprint arXiv:1701.02434, 2017

  36. [36]

    Nalisnick, A

    E. Nalisnick, A. Matsukawa, Y . W. Teh, and B. Lakshminarayanan. Detecting out-of-distribution inputs to deep generative models using typicality.arXiv preprint arXiv:1906.02994, 2019

  37. [37]

    Samuel, R

    D. Samuel, R. Ben-Ari, S. Raviv, N. Darshan, and G. Chechik. Generating images of rare concepts using pre-trained diffusion models. InProceedings of the AAAI Conference on Artificial Intelligence, 2024

  38. [38]

    Harrington, A

    A. Harrington, A. S. Koepke, S. Karthik, T. Darrell, and A. A. Efros. It’s never too late: Noise optimization for collapse recovery in trained diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

  39. [39]

    Durall, M

    R. Durall, M. Keuper, and J. Keuper. Watch your up-convolution: Cnn based generative deep neural networks are failing to reproduce spectral distributions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7890– 7899, 2020

  40. [40]

    Stoica, R

    P. Stoica, R. L. Moses, et al.Spectral analysis of signals, volume 452. Pearson Prentice Hall Upper Saddle River, NJ, 2005

  41. [41]

    S. V . Vaseghi.Advanced digital signal processing and noise reduction. John Wiley & Sons, 2008

  42. [42]

    Blattmann, T

    A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y . Levi, Z. En- glish, V . V oleti, A. Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

  43. [43]

    Eyring, S

    L. Eyring, S. Karthik, A. Dosovitskiy, N. Ruiz, and Z. Akata. Noise hypernetworks: Amor- tizing test-time compute in diffusion models. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

  44. [44]

    D. Ahn, J. Kang, S. Lee, J. Min, M. Kim, W. Jang, H. Cho, S. Paul, S. Kim, E. Cha, et al. A noise is worth diffusion guidance. InInternational Conference on Learning Representations (ICLR), 2026. URLhttps://openreview.net/forum?id=xEWooSOgaz. 11

  45. [45]

    Dhariwal and A

    P. Dhariwal and A. Nichol. Diffusion models beat gans on image synthesis.Advances in Neural Information Processing Systems (NeurIPS), 34:8780–8794, 2021

  46. [46]

    W. Li, X. Xu, X. Xiao, J. Liu, H. Yang, G. Li, Z. Wang, Z. Feng, Q. She, Y . Lyu, et al. Upaint- ing: Unified text-to-image diffusion generation with cross-modal guidance.arXiv preprint arXiv:2210.16031, 2022

  47. [47]

    PhysicalAI-Autonomous-Vehicles.https://huggingface.co/ datasets/nvidia/PhysicalAI-Autonomous-Vehicles, 2025

    NVIDIA Corporation. PhysicalAI-Autonomous-Vehicles.https://huggingface.co/ datasets/nvidia/PhysicalAI-Autonomous-Vehicles, 2025

  48. [48]

    Moura, S

    D. Moura, S. Zhu, and O. Zvitia. Nexar dashcam collision prediction dataset and challenge. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2583–2591, 2025

  49. [49]

    S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  50. [50]

    B. Li, L. Zhu, R. Tian, S. Tan, Y . Chen, Y . Lu, Y . Cui, S. Veer, M. Ehrlich, J. Philion, X. Weng, F. Xue, L. Fan, Y . Zhu, J. Kautz, A. Tao, M.-Y . Liu, S. Fidler, B. Ivanovic, T. Darrell, J. Malik, S. Han, and M. Pavone. Wolf: Dense video captioning with a world summarization framework.Transactions on Machine Learning Research, 2025. ISSN 2835-8856. U...

  51. [51]

    Y . Ma, G. Xu, X. Sun, M. Yan, J. Zhang, and R. Ji. X-clip: End-to-end multi-grained con- trastive learning for video-text retrieval. InProceedings of the 30th ACM international con- ference on multimedia, pages 638–647, 2022

  52. [52]

    Khazatsky, K

    A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

  53. [53]

    S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, W. Ge, Z. Guo, Q. Huang, J. Huang, F. Huang, B. Hui, S. Jiang, Z. Li, M. Li, M. Li, K. Li, Z. Lin, J. Lin, X. Liu, J. Liu, C. Liu, Y . Liu, D. Liu, S. Liu, D. Lu, R. Luo, C. Lv, R. Men, L. Meng, X. Ren, X. Ren, S. Song, Y . Sun, J. Tang, J. Tu, J. Wan, P. Wang, P. Wang,...

  54. [54]

    D. Li, Y . Fang, Y . Chen, S. Yang, S. Cao, J. Wong, M. Luo, X. Wang, H. Yin, J. Gonzalez, et al. Worldmodelbench: Judging video generation models as world models. InAdvances in Neural Information Processing Systems (NeurIPS), volume 38, 2025

  55. [55]

    Hong, K.-C

    Y . Hong, K.-C. Kao, H. Zhou, and C.-J. Hsieh. Understanding reward hacking in text-to- image reinforcement learning.arXiv preprint arXiv:2601.03468, 2026

  56. [56]

    H. Tan, S. Chen, Y . Xu, Z. Wang, Y . Ji, C. Chi, Y . Lyu, Z. Zhao, X. Chen, P. Co, et al. Robo- dopamine: General process reward modeling for high-precision robotic manipulation.arXiv preprint arXiv:2512.23703, 2025

  57. [57]

    Liang, Y

    A. Liang, Y . Korkmaz, J. Zhang, M. Hwang, A. Anwar, S. Kaushik, A. Shah, A. S. Huang, L. Zettlemoyer, D. Fox, Y . Xiang, A. Li, A. Bobu, A. Gupta, S. Tu, E. Biyik, and J. Zhang. Robometer: Scaling general-purpose robotic reward models via trajectory comparisons. In Robotics: Science and Systems (RSS), 2026

  58. [58]

    Y . Song, P. Dhariwal, M. Chen, and I. Sutskever. Consistency models. InInternational Conference on Machine Learning (ICML), pages 32211–32252, 2023. 12

  59. [59]

    Kim, C.-H

    D. Kim, C.-H. Lai, W. Liao, N. Murata, Y . Takida, T. Uesaka, Y . He, Y . Mitsufuji, and S. Er- mon. Consistency trajectory models: Learning probability flow ode trajectory of diffusion. In International Conference on Learning Representations (ICLR), volume 2024, pages 44493– 44525, 2024

  60. [60]

    Esser, S

    P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. M¨uller, H. Saini, Y . Levi, D. Lorenz, A. Sauer, F. Boesel, D. Podell, T. Dockhorn, Z. English, and R. Rombach. Scaling rectified flow trans- formers for high-resolution image synthesis. InInternational Conference on Machine Learn- ing (ICML), 2024. URLhttps://openreview.net/forum?id=FPnUhsQJ5B

  61. [61]

    Finn and S

    C. Finn and S. Levine. Deep visual foresight for planning robot motion. InIEEE International Conference on Robotics and Automation (ICRA), pages 2786–2793, 2017

  62. [62]

    Ebert, C

    F. Ebert, C. Finn, S. Dasari, A. Xie, A. Lee, and S. Levine. Visual foresight: Model- based deep reinforcement learning for vision-based robotic control.arXiv preprint arXiv:1812.00568, 2018

  63. [63]

    Hafner, T

    D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson. Learning latent dynamics for planning from pixels. InInternational Conference on Machine Learning (ICML), 2019

  64. [64]

    P. Wu, A. Escontrela, D. Hafner, P. Abbeel, and K. Goldberg. Daydreamer: World models for physical robot learning. InConference on Robot Learning (CoRL), pages 2226–2240, 2023

  65. [65]

    Micheli, E

    V . Micheli, E. Alonso, and F. Fleuret. Transformers are sample-efficient world models. In International Conference on Learning Representations (ICLR), 2023

  66. [66]

    Bruce, M

    J. Bruce, M. D. Dennis, A. Edwards, J. Parker-Holder, Y . Shi, E. Hughes, M. Lai, A. Mavalankar, R. Steigerwald, C. Apps, et al. Genie: Generative interactive environments. InInternational Conference on Machine Learning (ICML), 2024

  67. [67]

    L. Maes, Q. L. Lidec, D. Scieur, Y . LeCun, and R. Balestriero. Leworldmodel: Stable end-to- end joint-embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312, 2026

  68. [68]

    Hafner, T

    D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi. Dream to control: Learning behaviors by latent imagination. InInternational Conference on Learning Representations (ICLR), 2020

  69. [69]

    Hafner, T

    D. Hafner, T. P. Lillicrap, M. Norouzi, and J. Ba. Mastering atari with discrete world models. InInternational Conference on Learning Representations (ICLR), 2021

  70. [70]

    Hafner, J

    D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap. Mastering diverse control tasks through world models.Nature, 640(8059):647–653, Apr. 2025. ISSN 1476-4687

  71. [71]

    Hafner, W

    D. Hafner, W. Yan, and T. Lillicrap. Training agents inside of scalable world models.arXiv preprint arXiv:2509.24527, 2025

  72. [72]

    Nakamura, L

    K. Nakamura, L. Peters, and A. Bajcsy. Generalizing safety beyond collision-avoidance via latent-space reachability analysis. InRobotics: Science and Systems (RSS), 2025

  73. [73]

    Agrawal, J

    S. Agrawal, J. Seo, K. Nakamura, R. Tian, and A. Bajcsy. Anysafe: Adapting latent safety filters at runtime via safety constraint parameterization in the latent space. InIEEE Interna- tional Conference on Robotics and Automation (ICRA), 2026

  74. [74]

    Psenka, M

    M. Psenka, M. Rabbat, A. Krishnapriyan, Y . LeCun, and A. Bar. Parallel stochastic gradient- based planning for world models.arXiv preprint arXiv:2602.00475, 2026. 13

  75. [75]

    Hassan, S

    M. Hassan, S. Stapf, A. Rahimi, P. Rezende, Y . Haghighi, D. Br ¨uggemann, I. Katircioglu, L. Zhang, X. Chen, S. Saha, et al. Gem: A generalizable ego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene composition control. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2...

  76. [76]

    Walker, C

    J. Walker, C. Doersch, A. Gupta, and M. Hebert. An uncertain future: Forecasting from static images using variational autoencoders. InEuropean Conference on Computer Vision (ECCV), pages 835–851. Springer, 2016

  77. [77]

    Zhang, G

    W. Zhang, G. Wang, J. Sun, Y . Yuan, and G. Huang. Storm: Efficient stochastic transformer based world models for reinforcement learning.Advances in Neural Information Processing Systems (NeurIPS), 36:27147–27166, 2023

  78. [78]

    Russell, A

    L. Russell, A. Hu, L. Bertoni, G. Fedoseev, J. Shotton, E. Arani, and G. Corrado. Gaia-2: A controllable multi-view generative world model for autonomous driving.arXiv preprint arXiv:2503.20523, 2025

  79. [79]

    Bartoccioni, E

    F. Bartoccioni, E. Ramzi, V . Besnier, S. Venkataramanan, T.-H. Vu, Y . Xu, L. Chambon, S. Gidaris, S. Odabas, D. Hurych, et al. Vavim and vavam: Autonomous driving through video generative modeling.arXiv preprint arXiv:2502.15672, 2025

  80. [80]

    J. Seo, K. Nakamura, and A. Bajcsy. Uncertainty-aware latent safety filters for avoiding out-of-distribution failures. InConference on Robot Learning (CoRL), 2025. URLhttps: //openreview.net/forum?id=CQKxhmLobo

Showing first 80 references.