pith. sign in

arxiv: 2605.28230 · v1 · pith:EA6RTEWOnew · submitted 2026-05-27 · 💻 cs.CV

Proprio: Latent Self-Scoring and Inference-Time Refinement for Physically Plausible Video Generation

Pith reviewed 2026-06-29 12:52 UTC · model grok-4.3

classification 💻 cs.CV
keywords video generationphysical plausibilityflow residualslatent perturbationsself-scoringinference-time refinementbest-of-N selectiongradient refinement
0
0 comments X

The pith

Frozen video generators contain internal flow-residual signals that can score and refine physical plausibility at inference time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Proprio, a training-free method that lets a frozen video model judge its own outputs for physical realism. It measures the model's flow residual under small controlled changes to the input latents; videos that fit the model's learned dynamics produce smaller and steadier residuals. These residuals are aggregated over time steps and perturbations, masked to moving regions, and then used either to pick the best sample from several candidates or to guide a gradient-based refinement step. The approach improves standard physical-plausibility benchmarks on both text-to-video and image-to-video tasks and beats scoring methods that rely on external vision-language models or separate world models. Human raters also prefer the Proprio-chosen or refined videos for physical correctness in roughly two-thirds of direct comparisons.

Core claim

Proprio treats the model's flow residual under controlled latent perturbations as a self-scoring signal. Samples that are better explained by the generator's learned dynamics induce smaller and more stable residuals. Aggregating this signal across timesteps and perturbations, focusing it on motion-relevant regions with a dynamic spatiotemporal mask, and using it for best-of-N search, gradient-based self-refinement, or both yields consistent gains in physical plausibility on text-to-video and image-to-video benchmarks.

What carries the argument

Flow residual under controlled latent perturbations, aggregated and masked as an internal self-scoring signal for physical alignment.

If this is right

  • Raises Physics-IQ from 32.2 to 37.5 and VideoPhy2-hard physical commonsense from 45.6 to 55.0 on the tested generators.
  • Outperforms both VLM-based scoring and external world-model baselines in several benchmark settings.
  • Produces videos that human raters judge more physically plausible in about two-thirds of pairwise comparisons.
  • Works on both text-to-video and image-to-video tasks without any additional training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same residual signal could be checked on generators trained on synthetic data lacking real physics to test whether the method still selects plausible outputs.
  • If the signal proves architecture-independent, it might serve as a lightweight internal consistency check for other generative tasks such as 3D scene synthesis.
  • Applying the perturbation schedule at different noise levels could reveal whether the physical signal is strongest at particular stages of the denoising process.
  • The method leaves open whether the residual signal can be used to steer sampling during generation rather than only after full samples are produced.

Load-bearing premise

That smaller and more stable flow residuals under latent perturbations indicate better alignment with real physical dynamics rather than the generator's own training biases or artifacts.

What would settle it

A test in which videos containing clear, measurable physical violations (such as object interpenetration or gravity reversal) consistently produce smaller residuals than physically correct videos.

Figures

Figures reproduced from arXiv: 2605.28230 by Alexandre Alahi, Kaouther Messaoud, Mariam Hassan, Wuyang Li.

Figure 1
Figure 1. Figure 1: Proprio enables a video generator to evaluate and improve its own generation. Left: qualitative refinement comparisons for TurboWan2.2. Right: Proprio computes a latent self-score from the model’s denoising residual and uses it for either sample selection or inference-time refinement. Abstract Modern video generative models produce visually impressive results, yet frequently violate basic physical principl… view at source ↗
Figure 2
Figure 2. Figure 2: Proprio method overview. For a generated latent video, Proprio computes a per-video self-scoring signal by perturbing the latent at multiple timesteps, measuring denoising residuals with the frozen generator, weighting timesteps by inverse variance, and emphasizing informative motion regions with a dynamic spatiotemporal mask. The resulting score can either be used for ranking and selection, self-refining … view at source ↗
Figure 3
Figure 3. Figure 3: Human preference assessing visual quality and physical plausibility. To complement benchmark-based evaluation, we con￾duct a pairwise human evaluation study to test whether Proprio’s self-score agrees with human judgments. We sample 64 video pairs for search and refinement settings for both T2V and I2V, and collect 330 responses. In the search setting, each pair compares a low-score and high￾score Proprio … view at source ↗
Figure 4
Figure 4. Figure 4: Effect of noise levels at different timestep ranges on Physics-IQ for Wan2.2 5b. VideoPhy2 Range Method SA PC Joint All noise (0.2–0.8) Sglobal 71.66 60.56 51.66 Smotion 69.44 61.66 51.11 Mid noise (0.3–0.6) Sglobal 67.04 64.25 51.40 Smotion 67.03 66.48 56.42 Sglobal + var 67.59 64.80 52.51 Smotion + var 66.48 68.16 55.31 [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Human preference results split by gener [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: A screenshot from the human evaluation study showing full instructions and layout used. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Overall qualitative results comparing our Proprio method against baseline TurboWan2.2. [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Overall qualitative results comparing our Proprio method against baseline TurboWan2.2. [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Overall qualitative results comparing our Proprio method against baseline TurboWan2.2. [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Overall qualitative results comparing our Proprio method against baseline TurboWan2.2. [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Correctly preferred diagnostic examples. Each pair shows a ground-truth Physics-IQ video and a generated Turbo Wan2.2 sample with a visible physical failure. These examples are drawn from the subset where the dynamic Proprio self-score correctly prefers the ground-truth video, illustrating that the generator-native residual can favor natural video dynamics over physically implausible generations. 23 [PIT… view at source ↗
Figure 12
Figure 12. Figure 12: Correctly preferred diagnostic examples. Each pair shows a ground-truth Physics-IQ video and a generated Turbo Wan2.2 sample with a visible physical failure. These examples are drawn from the subset where the Proprio self-score correctly prefers the ground-truth video, illustrating that the generator-native residual can favor natural video dynamics over its own physically implausible generations. 24 [PIT… view at source ↗
read the original abstract

Modern video generative models produce visually impressive results, yet frequently violate basic physical principles. We propose Proprio, a training-free framework that enables a frozen video generator to assess and improve the physical plausibility of its own outputs. Inspired by proprioception, the biological sense of one's own movement, Proprio treats the model's flow residual under controlled latent perturbations as a self-scoring signal. Samples that are better explained by the generator's learned dynamics induce smaller and more stable residuals. We aggregate this signal across timesteps and perturbations, focus it on motion-relevant regions with a dynamic spatiotemporal mask, and use it for best-of-N search, gradient-based self-refinement, or both. Across text-to-video and image-to-video benchmarks, Proprio consistently improves physical plausibility, outperforming VLM-based scoring, and external world-model baselines in several settings. With TurboWan2.2, Proprio improves Physics-IQ from 32.2 to 37.5 (+16.5%) and VideoPhy2-hard physical commonsense from 45.6 to 55.0 (+20.6%). Human evaluation further shows that raters prefer Proprio-selected or refined videos for physical plausibility in roughly two-thirds of comparisons. These results suggest that frozen video generators contain actionable internal signals for evaluating and improving the physical plausibility of their own outputs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes Proprio, a training-free framework for frozen video generators that uses the model's own flow residuals under controlled latent perturbations as a self-scoring signal for physical plausibility. Samples better aligned with the generator's learned dynamics produce smaller, more stable residuals; these are aggregated across timesteps and perturbations, masked to motion regions, and applied via best-of-N selection or gradient refinement. Experiments on text-to-video and image-to-video benchmarks report gains over VLM-based scoring and external world-model baselines, including Physics-IQ rising from 32.2 to 37.5 and VideoPhy2-hard from 45.6 to 55.0, plus human preference for physical plausibility in roughly two-thirds of pairwise comparisons.

Significance. If the residual signal genuinely tracks physical dynamics rather than generator-internal consistency, the work offers a practical inference-time route to more plausible video synthesis without retraining or external supervisors. The training-free design and use of internal dynamics are strengths, as are the reported benchmark lifts and human-study results. However, the absence of direct validation that residuals correlate with real physical violations (as opposed to training-distribution artifacts) limits the strength of the central claim.

major comments (3)
  1. [Abstract] Abstract: The central assertion that smaller and more stable flow residuals indicate better physical plausibility rests on the premise that the signal distinguishes real-world physics violations from in-distribution artifacts, yet no controlled test (e.g., synthetic videos with explicit gravity or collision errors) is reported to establish this correlation independent of the generator's biases.
  2. [Abstract] Quantitative results (abstract): Improvements such as Physics-IQ +16.5% and VideoPhy2-hard +20.6% are stated without error bars, statistical significance, or ablation tables isolating the contribution of the dynamic mask, perturbation schedule, or aggregation method; this weakens the claim of consistent outperformance.
  3. [§3] Method description (abstract and §3): The scoring signal is derived entirely from the frozen generator's internal flow dynamics under self-induced perturbations, creating a self-referential loop; the manuscript provides no external physics oracle or cross-generator transfer experiment to confirm the residual measures physical fidelity rather than model-specific consistency.
minor comments (1)
  1. [Abstract] The abstract mentions 'dynamic spatiotemporal mask' without specifying its exact formulation or sensitivity analysis; a brief equation or pseudocode would clarify how motion-relevant regions are isolated.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the training-free design and reported benchmark gains. We address each major comment below with honest assessment of the manuscript's current evidence and planned revisions. The responses focus on substance and aim to strengthen the central claims where possible.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central assertion that smaller and more stable flow residuals indicate better physical plausibility rests on the premise that the signal distinguishes real-world physics violations from in-distribution artifacts, yet no controlled test (e.g., synthetic videos with explicit gravity or collision errors) is reported to establish this correlation independent of the generator's biases.

    Authors: We agree that a controlled experiment using synthetic videos with explicit, isolated physics violations (e.g., gravity or collision errors) would provide the strongest direct evidence that residuals track physical fidelity rather than generator-specific artifacts. Our validation currently rests on Physics-IQ and VideoPhy2 benchmarks, which are constructed around physical commonsense violations, together with human preference results (approximately two-thirds favoring Proprio). We will add an explicit limitations paragraph in the revision discussing this distinction and the self-referential nature of the signal. A full synthetic-video experiment is not feasible within the current experimental budget but could be noted as future work. revision: partial

  2. Referee: [Abstract] Quantitative results (abstract): Improvements such as Physics-IQ +16.5% and VideoPhy2-hard +20.6% are stated without error bars, statistical significance, or ablation tables isolating the contribution of the dynamic mask, perturbation schedule, or aggregation method; this weakens the claim of consistent outperformance.

    Authors: The referee is correct that the abstract numbers are presented without accompanying error bars or significance tests. The full manuscript contains component ablations, but these are not summarized in the abstract and lack statistical reporting. In the revision we will (1) add error bars and significance statements to the abstract claims, (2) expand the ablation table to isolate the dynamic mask, perturbation schedule, and aggregation choices, and (3) ensure all quantitative statements in the abstract are directly supported by the main-text results. revision: yes

  3. Referee: [§3] Method description (abstract and §3): The scoring signal is derived entirely from the frozen generator's internal flow dynamics under self-induced perturbations, creating a self-referential loop; the manuscript provides no external physics oracle or cross-generator transfer experiment to confirm the residual measures physical fidelity rather than model-specific consistency.

    Authors: The self-referential design is deliberate: the method is intended to operate without any external physics oracle or additional trained models, which is the core practical advantage. Validation occurs via external benchmarks (Physics-IQ, VideoPhy2) and human raters who judge physical plausibility independently of the generator. A cross-generator transfer study would be informative but lies outside the stated scope of demonstrating utility on a single frozen model. We will revise §3 to more explicitly state this design choice and its relation to the training-free constraint. revision: no

Circularity Check

0 steps flagged

No significant circularity: internal signal validated on external benchmarks

full rationale

The derivation defines a scoring signal from the frozen model's own flow residuals under latent perturbations and uses it for selection or refinement. This is not circular because the central claim of improved physical plausibility is tested against independent external benchmarks (Physics-IQ, VideoPhy2-hard, human raters) rather than reducing to a self-definition or fitted input by construction. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the derivation chain. The assumption that the signal tracks real physics is an empirical hypothesis, not a tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on one key domain assumption and no free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption Flow residuals under latent perturbations serve as a proxy for physical plausibility because samples better explained by the generator's learned dynamics produce smaller residuals.
    This premise is invoked to justify both the scoring signal and its use for best-of-N search and gradient refinement.

pith-pipeline@v0.9.1-grok · 5779 in / 1257 out tokens · 43982 ms · 2026-06-29T12:52:26.843900+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 22 canonical work pages · 12 internal anchors

  1. [1]

    World Simulation with Video Foundation Models for Physical AI

    Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, et al. World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062, 2025

  2. [2]

    Improving geo-diversity of generated images with contextualized vendi score guidance

    Reyhane Askari Hemmat, Melissa Hall, Alicia Sun, Candace Ross, Michal Drozdzal, and Adriana Romero- Soriano. Improving geo-diversity of generated images with contextualized vendi score guidance. In Proceedings of the European Conference on Computer Vision (ECCV), 2024

  3. [3]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

  4. [4]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  5. [5]

    Videophy-2: A challenging action-centric physical commonsense evaluation in video generation.arXiv preprint arXiv:2503.06800, 2025

    Hritik Bansal, Clark Peng, Yonatan Bitton, Roman Goldenberg, Aditya Grover, and Kai-Wei Chang. Videophy-2: A challenging action-centric physical commonsense evaluation in video generation.arXiv preprint arXiv:2503.06800, 2025

  6. [6]

    Lumiere: A space-time diffusion model for video generation.arXiv preprint arXiv:2401.12945, 2024

    Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, Yuanzhen Li, Michael Rubinstein, Tomer Michaeli, Oliver Wang, Deqing Sun, Tali Dekel, and Inbar Mosseri. Lumiere: A space-time diffusion model for video generation.arXiv preprint arXiv:2401.12945, 2024

  7. [7]

    Revisiting Feature Prediction for Learning Visual Representations from Video

    Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video. arXiv preprint arXiv:2404.08471, 2024

  8. [8]

    Video generation models as world simulators

    Tim Brooks, William Peebles, Aditya Ramesh, et al. Video generation models as world simulators. OpenAI technical report, 2024. Online; accessed from OpenAI

  9. [9]

    PhyGDPO: Physics-Aware Groupwise Direct Preference Optimization for Physically Consistent Text-to-Video Generation

    Yuanhao Cai, Kunpeng Li, Menglin Jia, Jialiang Wang, Junzhe Sun, Feng Liang, Weifeng Chen, Felix Juefei-Xu, Chu Wang, Ali Thabet, et al. Phygdpo: Physics-aware groupwise direct preference optimization for physically consistent text-to-video generation.arXiv preprint arXiv:2512.24551, 2025

  10. [10]

    Increasing the utility of synthetic images through chamfer guidance

    Nicola Dall’Asen, Xiaofeng Zhang, Reyhane Askari Hemmat, Melissa Hall, Jakob Verbeek, Adriana Romero-Soriano, and Michal Drozdzal. Increasing the utility of synthetic images through chamfer guidance. arXiv preprint arXiv:2508.10631, 2025

  11. [11]

    Initno: Boosting text-to- image diffusion models via initial noise optimization

    Xiefan Guo, Jinlin Liu, Miaomiao Cui, Jiankai Li, Hongyu Yang, and Di Huang. Initno: Boosting text-to- image diffusion models via initial noise optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  12. [12]

    LTX-Video: Realtime Video Latent Diffusion

    Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weissbuch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, and Ofir Bibi. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2025

  13. [13]

    Factorized video generation: Decoupling scene construction and temporal synthesis in text-to-video diffusion models.arXiv preprint arXiv:2512.16371, 2025

    Mariam Hassan, Bastien Van Delft, Wuyang Li, and Alexandre Alahi. Factorized video generation: Decoupling scene construction and temporal synthesis in text-to-video diffusion models.arXiv preprint arXiv:2512.16371, 2025

  14. [14]

    Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

  15. [15]

    VBench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogniti...

  16. [16]

    Diffusion tree sampling: Scalable inference-time alignment of diffusion models

    Vineet Jain, Kusha Sareen, Mohammad Pedramfar, and Siamak Ravanbakhsh. Diffusion tree sampling: Scalable inference-time alignment of diffusion models. InSecond Workshop on Test-Time Adaptation: Putting Updates to the Test! at ICML 2025

  17. [17]

    Self-Refining Video Sampling

    Sangwon Jang, Taekyung Ki, Jaehyeong Jo, Saining Xie, Jaehong Yoon, and Sung Ju Hwang. Self-refining video sampling.arXiv preprint arXiv:2601.18577, 2026

  18. [18]

    How far is video generation from world model: A physical law perspective

    Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. How far is video generation from world model: A physical law perspective. InInternational Conference on Machine Learning, pages 28991–29017. PMLR, 2025

  19. [19]

    Videopoet: A large language model for zero-shot video generation

    Dan Kondratyuk, Lijun Yu, Xin Zhao, Mingxing Meng, Chenlin Yan, Alexandre Drouin, Maor Yi, Lijian Luo, Liangyu Gui, Zhenyao Wang, Wilson Sean, Cordelia Schmid, David Ross, Kihyuk Sohn, and Lu Jiang. Videopoet: A large language model for zero-shot video generation. InProceedings of the 41st International Conference on Machine Learning, pages 25105–25124, 2024

  20. [20]

    Your diffusion model is secretly a zero-shot classifier

    Alexander C Li, Mihir Prabhudesai, Shivam Duggal, Ellis Brown, and Deepak Pathak. Your diffusion model is secretly a zero-shot classifier. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2206–2217, 2023

  21. [21]

    Derivative-free guidance in continuous and discrete diffusion models with soft value-based decoding.arXiv preprint arXiv:2408.08252, 2024

    Xiner Li, Yulai Zhao, Chenyu Wang, Gabriele Scalia, Gokcen Eraslan, Surag Nair, Tommaso Biancalani, Shuiwang Ji, Aviv Regev, Sergey Levine, and Masatoshi Uehara. Derivative-free guidance in continuous and discrete diffusion models with soft value-based decoding.arXiv preprint arXiv:2408.08252, 2024

  22. [22]

    Dynamic search for inference-time alignment in diffusion models.arXiv preprint arXiv:2503.02039, 2025

    Xiner Li, Masatoshi Uehara, Xingyu Su, Gabriele Scalia, Tommaso Biancalani, Aviv Regev, Sergey Levine, and Shuiwang Ji. Dynamic search for inference-time alignment in diffusion models.arXiv preprint arXiv:2503.02039, 2025

  23. [23]

    Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling

    Gongye Liu, Bo Yang, Yida Zhi, Zhizhou Zhong, Lei Ke, Didan Deng, Han Gao, Yongxiang Huang, Kaihao Zhang, Hongbo Fu, et al. Beyond vlm-based rewards: Diffusion-native latent reward modeling. arXiv preprint arXiv:2602.11146, 2026

  24. [24]

    Physgen: Rigid-body physics- grounded image-to-video generation

    Shaowei Liu, Zhongzheng Ren, Saurabh Gupta, and Shenlong Wang. Physgen: Rigid-body physics- grounded image-to-video generation. InEuropean Conference on Computer Vision, pages 360–378. Springer, 2024

  25. [25]

    Flow straight and fast: Learning to generate and transfer data with rectified flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InThe Eleventh International Conference on Learning Representations (ICLR), 2023

  26. [26]

    Gpt4motion: Scripting physical motions in text-to-video generation via blender-oriented gpt planning

    Jiaxi Lv, Yi Huang, Mingfu Yan, Jiancheng Huang, Jianzhuang Liu, Yifan Liu, Yafei Wen, Xiaoxin Chen, and Shifeng Chen. Gpt4motion: Scripting physical motions in text-to-video generation via blender-oriented gpt planning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024

  27. [27]

    Scaling inference time compute for diffusion models

    Nanye Ma, Shangyuan Tong, Haolin Jia, Hexiang Hu, Yu-Chuan Su, Mingda Zhang, Xuan Yang, Yandong Li, Tommi Jaakkola, Xuhui Jia, and Saining Xie. Scaling inference time compute for diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2523–2534, 2025

  28. [28]

    Video generation models are good latent reward models.arXiv preprint arXiv:2511.21541, 2025

    Xiaoyue Mi, Wenqing Yu, Jiesong Lian, Shibo Jie, Ruizhe Zhong, Zijun Liu, Guozhen Zhang, Zixiang Zhou, Zhiyong Xu, Yuan Zhou, et al. Video generation models are good latent reward models.arXiv preprint arXiv:2511.21541, 2025

  29. [29]

    Do generative video mod- els understand physical principles? InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 948–958, 2026

    Saman Motamed, Laura Culp, Kevin Swersky, Priyank Jaini, and Robert Geirhos. Do generative video mod- els understand physical principles? InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 948–958, 2026

  30. [30]

    Extracting reward functions from diffusion models

    Felipe Nuti, Tim Franzmeyer, and João F Henriques. Extracting reward functions from diffusion models. Advances in Neural Information Processing Systems, 36:50196–50220, 2023

  31. [31]

    Movie Gen: A Cast of Media Foundation Models

    Adam Polyak, Yossi Adi, Leshem Choshen, Tavi Halperin, Tovi Ventura, Xi Zhang, Boliang Wang, Peng Yin, Dong Xu, Shir Gur, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024

  32. [32]

    Video motion transfer with diffusion transformers

    Alexander Pondaven, Aliaksandr Siarohin, Sergey Tulyakov, Philip Torr, and Fabio Pizzati. Video motion transfer with diffusion transformers. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22911–22921, 2025. 11

  33. [33]

    Inference- time alignment of diffusion models with direct noise optimization

    Zhiwei Tang, Jiangweizhi Peng, Jiasheng Tang, Mingyi Hong, Fan Wang, and Tsung-Hui Chang. Inference- time alignment of diffusion models with direct noise optimization. InForty-second International Confer- ence on Machine Learning

  34. [34]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  35. [35]

    Wisa: World simulator assistant for physics-aware text-to-video generation

    Jing Wang, Ao Ma, Ke Cao, Jun Zheng, Jiasong Feng, Zhanjie Zhang, Wanyuan Pang, and Xiaodan Liang. Wisa: World simulator assistant for physics-aware text-to-video generation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  36. [36]

    HunyuanVideo 1.5 Technical Report

    Bing Wu, Chang Zou, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Jack Peng, Jianbing Wu, Jiangfeng Xiong, Jie Jiang, et al. Hunyuanvideo 1.5 technical report.arXiv preprint arXiv:2511.18870, 2025

  37. [37]

    Vlipp: Towards physically plausible video generation with vision and language informed physical prior

    Xindi Yang, Baolu Li, Yiming Zhang, Zhenfei Yin, Lei Bai, Liqian Ma, Zhiyong Wang, Jianfei Cai, Tien-Tsin Wong, Huchuan Lu, et al. Vlipp: Towards physically plausible video generation with vision and language informed physical prior. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12360–12370, 2025

  38. [38]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

  39. [39]

    Tfg: Unified training-free guidance for diffusion models

    Haotian Ye, Haowei Lin, Jiaqi Han, Minkai Xu, Sheng Liu, Yitao Liang, Jianzhu Ma, James Zou, and Stefano Ermon. Tfg: Unified training-free guidance for diffusion models. InAdvances in Neural Information Processing Systems, 2024

  40. [40]

    Likephys: Evaluating intuitive physics understanding in video diffusion models via likelihood preference.arXiv preprint arXiv:2510.11512, 2025

    Jianhao Yuan, Fabio Pizzati, Francesco Pinto, Lars Kunze, Ivan Laptev, Paul Newman, Philip Torr, and Daniele De Martini. Likephys: Evaluating intuitive physics understanding in video diffusion models via likelihood preference.arXiv preprint arXiv:2510.11512, 2025

  41. [41]

    Inference-time physics alignment of video generative models with latent world models.arXiv preprint arXiv:2601.10553, 2026

    Jianhao Yuan, Xiaofeng Zhang, Felix Friedrich, Nicolas Beltran-Velez, Melissa Hall, Reyhane Askari- Hemmat, Xiaochuang Han, Nicolas Ballas, Michal Drozdzal, and Adriana Romero-Soriano. Inference-time physics alignment of video generative models with latent world models.arXiv preprint arXiv:2601.10553, 2026

  42. [42]

    Turbodiffusion: Accelerating video diffusion models by 100-200 times.arXiv preprint arXiv:2512.16093, 2025

    Jintao Zhang, Kaiwen Zheng, Kai Jiang, Haoxu Wang, Ion Stoica, Joseph E Gonzalez, Jianfei Chen, and Jun Zhu. Turbodiffusion: Accelerating video diffusion models by 100-200 times.arXiv preprint arXiv:2512.16093, 2025

  43. [43]

    Videorepa: Learning physics for video generation through relational alignment with foundation models

    Xiangdong Zhang, Jiaqi Liao, Shaofeng Zhang, Fanqing Meng, Xiangpeng Wan, Junchi Yan, and Yu Cheng. Videorepa: Learning physics for video generation through relational alignment with foundation models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,

  44. [44]

    Inference-time scaling of diffusion models through classical search

    XiangCheng Zhang, Haowei Lin, Haotian Ye, James Zou, Jianzhu Ma, Yitao Liang, and Yilun Du. Inference-time scaling of diffusion models through classical search. InNeurIPS 2025 Workshop on Structured Probabilistic Inference{\&}Generative Modeling, . 12 A Additional experiments and ablations A.1 Diagnostic Study: Preference for natural video dynamics To tes...