pith. sign in

arxiv: 2604.08503 · v3 · pith:QVJRKQ34new · submitted 2026-04-09 · 💻 cs.CV

Phantom: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics

Pith reviewed 2026-05-21 09:14 UTC · model grok-4.3

classification 💻 cs.CV
keywords video generationphysical consistencylatent dynamicsphysics-infused modelinggenerative modelscomputer visionvideo prediction
0
0 comments X

The pith

Phantom produces videos that are both visually realistic and physically consistent by jointly modeling visual frames and latent physical dynamics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Phantom to address the physical inconsistencies that arise when video generation models are trained only on visual data. It does so by learning a shared physics-aware video representation that embeds latent physical states and then using that representation to predict both future frames and their dynamics at the same time. The approach requires no hand-specified physics equations or object properties. A sympathetic reader would care because many practical uses of generated video, from simulation to planning, break down when motion violates basic physical rules. The central bet is that a single learned embedding can carry enough information about hidden dynamics to enforce consistency during generation.

Core claim

Phantom jointly models visual content and latent physical dynamics by conditioning on observed frames and inferred physical states, then simultaneously predicts future latent dynamics and generates the corresponding video frames. It relies on a physics-aware video representation that functions as an abstract embedding of the underlying physics, enabling this joint prediction without any explicit specification of physical laws or properties.

What carries the argument

A physics-aware video representation that serves as an abstract yet informative embedding of underlying physics and supports joint prediction of latent dynamics and video frames.

If this is right

  • Video sequences produced by the model will adhere more closely to physical laws than those from standard generative baselines.
  • The same representation used for generation can be queried to recover latent physical quantities such as velocity or forces.
  • Performance gains appear on both perceptual quality metrics and physics-specific benchmarks without trading one for the other.
  • The joint modeling loop allows the system to correct its own future predictions using the evolving latent physical state.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same joint-modeling pattern could be applied to other modalities where consistency matters, such as 3D scene synthesis or audio generation.
  • Long-horizon video prediction might improve because the latent dynamics provide a memory that prevents drift from physical rules.
  • The learned representation could serve as a drop-in physics prior for downstream tasks like robotic planning or video editing.

Load-bearing premise

A single learned physics-aware video representation can capture enough information about hidden physical dynamics to support accurate joint prediction of future states and frames without any explicit physical model.

What would settle it

Generated video sequences in which objects visibly violate conservation laws or collision rules even though the model reports consistent latent dynamics.

Figures

Figures reproduced from arXiv: 2604.08503 by Ismini Lourentzou, Jerry Xiong, Tianjiao Yu, Ying Shen.

Figure 1
Figure 1. Figure 1: Comparison between the base model Wan2.2-TI2V [38] and our Phantom model in both (a) text-to-video and (b) text/image-to-video generation. Red boxes mark the conditioning frame. In (a), Wan2.2-TI2V fails to model the correct bouncing dynamics of the falling ball: after hitting the ground, the ball unnaturally loses all momentum and stops abruptly. In contrast, Phantom produces a physically plausible sequen… view at source ↗
Figure 2
Figure 2. Figure 2: Phantom Overview. Phantom consists of two parallel latent flow-matching branches: the video branch and physics branch. These branches jointly model future visual and physical dynamics, i.e., the video branch (white) predicts future visual trajectories, while the physics branch (teal) predicts the evolution of latent physical states. Dual cross-attention layers tightly couple these branches, allowing physic… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison between Wan2.2-TI2V [38] and our Phantom across diverse text-to-video and text/image-to-video scenarios. Red boxes indicate the conditioning frames. For prompts involving diverse physical processes, such as deformation, pouring, buoyancy, and viscous flow, Phantom produces motion that matches the requested behavior, while Wan2.2-TI2V often fails to follow the prompt or violates basic… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative Comparison on Text-/Image-to-Video Generation. The conditional frame is marked in red box [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative Comparison on Text-to-Video Generation [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Examples of Force-conditioned Video Generation using Phantom. The conditional frame is marked in red box [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
read the original abstract

Recent advances in generative video modeling, driven by large-scale datasets and powerful architectures, have yielded remarkable visual realism. However, emerging evidence suggests that simply scaling data and model size does not endow these systems with an understanding of the underlying physical laws that govern real-world dynamics. Existing approaches often fail to capture or enforce such physical consistency, resulting in unrealistic motion and dynamics. In his work, we investigate whether integrating the inference of latent physical properties directly into the video generation process can equip models with the ability to produce physically plausible videos. To this end, we propose Phantom, a Physics-Infused Video Generation model that jointly models the visual content and latent physical dynamics. Conditioned on observed video frames and inferred physical states, Phantom jointly predicts latent physical dynamics and generates future video frames. Phantom leverages a physics-aware video representation that serves as an abstract yet informaive embedding of the underlying physics, facilitating the joint prediction of physical dynamics alongside video content without requiring an explicit specification of a complex set of physical dynamics and properties. By integrating the inference of physical-aware video representation directly into the video generation process, Phantom produces video sequences that are both visually realistic and physically consistent. Quantitative and qualitative results on both standard video generation and physics-aware benchmarks demonstrate that Phantom not only outperforms existing methods in terms of adherence to physical dynamics but also delivers competitive perceptual fidelity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Phantom, a Physics-Infused Video Generation model that jointly models visual content and latent physical dynamics. Conditioned on observed video frames and inferred physical states, it uses a physics-aware video representation as an abstract embedding of underlying physics to jointly predict latent physical dynamics and generate future frames, claiming to produce videos that are both visually realistic and physically consistent while outperforming prior methods on standard and physics-aware benchmarks.

Significance. If the central claims hold with proper validation, the work could meaningfully advance generative video models by embedding implicit physical consistency without explicit equations or supervision, offering a scalable alternative to physics-engine hybrids for applications in robotics simulation and realistic animation.

major comments (2)
  1. [Abstract] Abstract: The claim of quantitative and qualitative superiority on physics-aware benchmarks is asserted without any reported metrics, tables, error bars, dataset splits, or baseline comparisons, rendering the central empirical claim unevaluable from the manuscript.
  2. [Method] Method description: The physics-aware video representation is positioned as the key mechanism that enables joint prediction of dynamics and frames while enforcing physical consistency, yet no architecture details, loss functions, training objectives, or inductive biases are specified that would separate genuine physical encoding (e.g., conservation or causality) from richer visual feature learning; this is load-bearing for the claim that the model goes beyond standard next-frame prediction.
minor comments (1)
  1. [Abstract] Abstract: Typo 'In his work' should read 'In this work'; 'informaive' should read 'informative'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments. We address each major comment below and describe the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim of quantitative and qualitative superiority on physics-aware benchmarks is asserted without any reported metrics, tables, error bars, dataset splits, or baseline comparisons, rendering the central empirical claim unevaluable from the manuscript.

    Authors: We agree that the abstract's claim would be more readily evaluable with supporting details. The full manuscript reports quantitative results in the Experiments section, including tables with metrics on physics-aware benchmarks, baseline comparisons, dataset splits, and error bars from repeated runs. To address the concern directly, we will revise the abstract to include a brief summary of the key quantitative findings supporting the superiority claim. revision: yes

  2. Referee: [Method] Method description: The physics-aware video representation is positioned as the key mechanism that enables joint prediction of dynamics and frames while enforcing physical consistency, yet no architecture details, loss functions, training objectives, or inductive biases are specified that would separate genuine physical encoding (e.g., conservation or causality) from richer visual feature learning; this is load-bearing for the claim that the model goes beyond standard next-frame prediction.

    Authors: We thank the referee for this important observation. The manuscript's Method section introduces the physics-aware video representation and its role in joint modeling. To more explicitly separate the physical encoding from standard visual feature learning, we will expand this section with additional details on the architecture of the latent physical state inference module, the specific loss functions (visual reconstruction, dynamics prediction, and physical consistency terms), the overall training objective, and the inductive biases such as implicit causality and conservation constraints enforced through the joint latent-visual prediction. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes an architectural model (Phantom) that learns a physics-aware video representation to jointly predict latent dynamics and future frames. The abstract and description contain no equations, fitted parameters, or self-citations that reduce the central claim to its inputs by construction. The representation is learned end-to-end from data with the joint objective stated explicitly as the modeling goal; this is a standard inductive modeling step rather than a tautological redefinition or renamed prediction. No load-bearing uniqueness theorem or ansatz is imported from prior self-work in the provided text. The derivation chain is therefore self-contained as a generative modeling contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review is abstract-only; the ledger is therefore minimal and reflects only what is explicitly invoked in the abstract.

axioms (1)
  • domain assumption A latent physics-aware video representation can encode underlying physical dynamics sufficiently to enable joint prediction of future states and frames without explicit physical equations.
    This premise is stated in the abstract as the key enabler of the method.
invented entities (1)
  • physics-aware video representation no independent evidence
    purpose: Abstract embedding that captures latent physical properties for joint visual and dynamic prediction
    Introduced in the abstract as the central mechanism; no independent evidence or falsifiable prediction is given.

pith-pipeline@v0.9.0 · 5777 in / 1337 out tokens · 49839 ms · 2026-05-21T09:14:01.171781+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. NEWTON: Agentic Planning for Physically Grounded Video Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    NEWTON improves physical accuracy in video generation by deploying a trainable planner that coordinates physics-aware tools and a verifier, raising joint accuracy on VideoPhy-2 without altering the base generators.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 1 Pith paper · 6 internal anchors

  1. [1]

    Cosmos World Foundation Model Platform for Physical AI

    Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575,

  2. [2]

    Meta movie gen: Ai-powered movie generation,

    Meta AI. Meta movie gen: Ai-powered movie generation,

  3. [3]

    Accessed: 2024-11-24. 2

  4. [4]

    Building normalizing flows with stochastic interpolants

    Michael Samuel Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. InInterna- tional Conference on Learning Representations (ICLR), 2023. 3

  5. [5]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self- supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025. 2, 4, 1

  6. [6]

    Videophy: Evaluating physical commonsense for video generation

    Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chenfanfu Jiang, Yizhou Sun, Kai- Wei Chang, and Aditya Grover. Videophy: Evaluating physical commonsense for video generation. InInternational Con- ference on Learning Representations (ICLR), 2025. 3, 6, 1

  7. [7]

    Videophy-2: A challenging action-centric physical commonsense evaluation in video generation

    Hritik Bansal, Clark Peng, Yonatan Bitton, Roman Golden- berg, Aditya Grover, and Kai-Wei Chang. Videophy-2: A challenging action-centric physical commonsense evaluation in video generation. InInternational Conference on Learning Representations (ICLR), 2026. 3, 6, 1

  8. [8]

    Lumiere: A space-time diffusion model for video generation

    Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. Lumiere: A space-time diffusion model for video generation. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024. 7

  9. [9]

    Video- jam: Joint appearance-motion representations for enhanced motion generation in video models

    Hila Chefer, Uriel Singer, Amit Zohar, Yuval Kirstain, Adam Polyak, Yaniv Taigman, Lior Wolf, and Shelly Sheynin. Video- jam: Joint appearance-motion representations for enhanced motion generation in video models. InInternational Confer- ence on Machine Learning (ICML). PMLR, 2025. 3

  10. [10]

    Videocrafter2: Overcoming data limitations for high-quality video diffusion models

    Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7310–7320, 2024. 6, 1

  11. [11]

    Flow matching on general geometries

    Ricky TQ Chen and Yaron Lipman. Flow matching on general geometries. InInternational Conference on Learning Representations (ICLR), 2024. 2, 3, 5

  12. [12]

    Veo2: Our state-of-the-art video generation model,

    DeepMind. Veo2: Our state-of-the-art video generation model,

  13. [13]

    Accessed: 2025-01-09. 2

  14. [14]

    Intuitive physics understanding emerges from self-supervised pretraining on natural videos.arXiv preprint arXiv:2502.11831, 2025

    Quentin Garrido, Nicolas Ballas, Mahmoud Assran, Adrien Bardes, Laurent Najman, Michael Rabbat, Emmanuel Dupoux, and Yann LeCun. Intuitive physics understanding emerges from self-supervised pretraining on natural videos.arXiv preprint arXiv:2502.11831, 2025. 2, 4, 1

  15. [15]

    Recurrent world models facilitate policy evolution.Advances in Neural Information Processing Systems (NeurIPS), 31, 2018

    David Ha and J¨ urgen Schmidhuber. Recurrent world models facilitate policy evolution.Advances in Neural Information Processing Systems (NeurIPS), 31, 2018. 2

  16. [16]

    Denoising diffu- sion probabilistic models.Advances in Neural Information Processing Systems (NeurIPS), 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models.Advances in Neural Information Processing Systems (NeurIPS), 33:6840–6851, 2020. 2, 3

  17. [17]

    Video diffusion models.Advances in Neural Information Processing Systems (NeurIPS), 35:8633–8646, 2022

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models.Advances in Neural Information Processing Systems (NeurIPS), 35:8633–8646, 2022. 2

  18. [18]

    Vchain: Chain-of-visual-thought for rea- soning in video generation.arXiv preprint arXiv:2510.05094,

    Ziqi Huang, Ning Yu, Gordon Chen, Haonan Qiu, Paul De- bevec, and Ziwei Liu. Vchain: Chain-of-visual-thought for rea- soning in video generation.arXiv preprint arXiv:2510.05094,

  19. [19]

    How far is video generation from world model: A physical law perspective

    Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. How far is video generation from world model: A physical law perspective. InInternational Conference on Machine Learning (ICML),

  20. [20]

    Videopoet: A large language model for zero-shot video generation

    Dan Kondratyuk, Lijun Yu, Xiuye Gu, Jose Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birod- kar, Jimmy Yan, Ming-Chang Chiu, et al. Videopoet: A large language model for zero-shot video generation. InIn- ternational Conference on Machine Learning (ICML), pages 25105–25124. PMLR, 2024. 7

  21. [21]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 3, 6, 1

  22. [22]

    Boosting generative image modeling via joint image-feature synthe- sis

    Theodoros Kouzelis, Efstathios Karypidis, Ioannis Kakoge- orgiou, Spyros Gidaris, and Nikos Komodakis. Boosting generative image modeling via joint image-feature synthe- sis. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. 3

  23. [23]

    A path towards autonomous machine intelligence version 0.9

    Yann LeCun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27.Open Review, 62(1):1–62, 2022. 2

  24. [24]

    Flow matching for generative model- ing

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative model- ing. InInternational Conference on Learning Representations (ICLR), 2023. 3

  25. [25]

    Physgen: Rigid-body physics-grounded image-to- video generation

    Shaowei Liu, Zhongzheng Ren, Saurabh Gupta, and Shenlong Wang. Physgen: Rigid-body physics-grounded image-to- video generation. InEuropean Conference on Computer Vision (ECCV), pages 360–378. Springer, 2024. 3

  26. [26]

    Flow straight and fast: Learning to generate and transfer data with rectified flow

    Xingchao Liu, Chengyue Gong, et al. Flow straight and fast: Learning to generate and transfer data with rectified flow. InInternational Conference on Learning Representations (ICLR), 2023. 3

  27. [27]

    To- wards world simulator: Crafting physical commonsense-based benchmark for video generation

    Fanqing Meng, Jiaqi Liao, Xinyu Tan, Quanfeng Lu, Wenqi Shao, Kaipeng Zhang, Yu Cheng, Dianqi Li, and Ping Luo. To- wards world simulator: Crafting physical commonsense-based benchmark for video generation. InInternational Conference on Machine Learning (ICML), pages 43781–43806. PMLR,

  28. [28]

    Motioncraft: Physics-based zero-shot video generation.Advances in Neural Information Processing Systems (NeurIPS), 37:123155–123181, 2024

    Antonio Montanaro, Luca Savant Aira, Emanuele Aiello, Diego Valsesia, and Enrico Magli. Motioncraft: Physics-based zero-shot video generation.Advances in Neural Information Processing Systems (NeurIPS), 37:123155–123181, 2024. 3

  29. [29]

    Do generative video models understand physical principles?

    Saman Motamed, Laura Culp, Kevin Swersky, Priyank Jaini, and Robert Geirhos. Do generative video models learn physical principles from watching videos?arXiv preprint arXiv:2501.09038, 2025. 2, 3, 6, 1

  30. [30]

    Openvid-1m: A large-scale high-quality dataset for text-to- video generation

    Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhen- heng Yang, Zhijie Chen, Xiang Li, Jian Yang, and Ying Tai. Openvid-1m: A large-scale high-quality dataset for text-to- video generation. InInternational Conference on Learning Representations (ICLR), 2025. 5

  31. [31]

    Sora: Openai’s multimodal agent, 2024

    OpenAI. Sora: Openai’s multimodal agent, 2024. Accessed: 2024-11-24. 2, 3

  32. [32]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InInternational Conference on Computer Vision (ICCV), pages 4195–4205, 2023. 2

  33. [33]

    Rdpo: Real data preference optimization for physics consistency video generation.arXiv preprint arXiv:2506.18655, 2025

    Wenxu Qian, Chaoyue Wang, Hou Peng, Zhiyu Tan, Hao Li, and Anxiang Zeng. Rdpo: Real data preference optimization for physics consistency video generation.arXiv preprint arXiv:2506.18655, 2025. 7

  34. [34]

    Runway: Platform for AI-powered video editing and generative media creation

    Runway Team. Runway: Platform for AI-powered video editing and generative media creation. https://runwayml. com, 2024. Accessed: 2025-05-12. 7

  35. [35]

    Egoforge: Goal-directed egocen- tric world simulator.arXiv preprint arXiv:2603.20169, 2026

    Yifan Shen, Jiateng Liu, Xinzhuo Li, Yuanzhe Liu, Bingx- uan Li, Houze Yang, Wenqi Jia, Yijiang Li, Tianjiao Yu, James Matthew Rehg, et al. Egoforge: Goal-directed egocen- tric world simulator.arXiv preprint arXiv:2603.20169, 2026. 2

  36. [36]

    Denoising diffusion implicit models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. InInternational Conference on Learning Representations (ICLR), 2021. 3

  37. [37]

    Score-based generative modeling through stochastic differential equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. InInternational Conference on Learning Representations (ICLR), 2021. 3

  38. [38]

    Pyratok: Language-aligned pyramidal tokenizer for video understanding and generation

    Onkar Susladkar, Tushar Prakash, Adheesh Juvekar, Kiet A Nguyen, Dong-Hwan Jang, Inderjit S Dhillon, and Ismini Lourentzou. Pyratok: Language-aligned pyramidal tokenizer for video understanding and generation. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2026. 2

  39. [39]

    Attention is all you need.Advances in Neural Information Processing Systems (NeurIPS), 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in Neural Information Processing Systems (NeurIPS), 30, 2017. 2

  40. [40]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pan- deng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang,...

  41. [41]

    WISA: World simulator assistant for physics-aware text-to-video genera- tion

    Jing Wang, Ao Ma, Ke Cao, Jun Zheng, Jiasong Feng, Zhanjie Zhang, Wanyuan Pang, and Xiaodan Liang. WISA: World simulator assistant for physics-aware text-to-video genera- tion. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. 6, 1

  42. [42]

    Lavie: High-quality video generation with cascaded latent diffusion models.International Journal on Computer Vision (IJCV), 133(5):3059–3078, 2025

    Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video generation with cascaded latent diffusion models.International Journal on Computer Vision (IJCV), 133(5):3059–3078, 2025. 6, 1

  43. [43]

    Physanimator: Physics-guided generative cartoon animation

    Tianyi Xie, Yiwei Zhao, Ying Jiang, and Chenfanfu Jiang. Physanimator: Physics-guided generative cartoon animation. InIEEE Conference on Computer Vision and Pattern Recog- nition (CVPR), pages 10793–10804, 2025. 3

  44. [44]

    Phyt2v: Llm-guided iterative self-refinement for physics-grounded text-to-video generation

    Qiyao Xue, Xiangyu Yin, Boyuan Yang, and Wei Gao. Phyt2v: Llm-guided iterative self-refinement for physics-grounded text-to-video generation. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 18826–18836,

  45. [45]

    Cogvideox: Text-to-video dif- fusion models with an expert transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video dif- fusion models with an expert transformer. InInternational Conference on Learning Representations (ICLR), 2025. 6, 7, 1

  46. [46]

    Representa- tion alignment for generation: Training diffusion transformers is easier than you think

    Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representa- tion alignment for generation: Training diffusion transformers is easier than you think. InInternational Conference on Learning Representations (ICLR), 2025. 3

  47. [47]

    Think before you diffuse: Llms-guided physics-aware video generation.arXiv preprint arXiv:2505.21653, 2025

    Ke Zhang, Cihan Xiao, Yiqun Mei, Jiacong Xu, and Vishal M Patel. Think before you diffuse: Llms-guided physics-aware video generation.arXiv preprint arXiv:2505.21653, 2025. 3

  48. [48]

    VideoREPA: Learning physics for video generation through relational alignment with foundation models

    Xiangdong Zhang, Jiaqi Liao, Shaofeng Zhang, Fanqing Meng, Xiangpeng Wan, Junchi Yan, and Yu Cheng. VideoREPA: Learning physics for video generation through relational alignment with foundation models. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. 3, 6, 1, 2

  49. [49]

    VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

    Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei- Shi Zheng, et al. Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025. 5, 6, 1 Phantom: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics Sup...

  50. [50]

    Since most physics-focused baselines operate solely in the text-to-video setting, Figure 5 comparesPhantomonly with general-purpose T2V models. D. Physics-based Video Control To further evaluate the ability ofPhantomto model and re- spond to explicit physical control signals, we apply our frame- work to the Force-Prompting dataset 1. Force-Prompting provi...