pith. machine review for the scientific record. sign in

arxiv: 2512.05564 · v2 · submitted 2025-12-05 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

ProPhy: Progressive Physical Alignment for Dynamic World Simulation

Authors on Pith no claims yet

Pith reviewed 2026-05-17 01:01 UTC · model grok-4.3

classification 💻 cs.CV
keywords physics-aware video generationmixture of expertsphysical alignmentworld simulationvision-language model transferdynamic video generation
0
0 comments X

The pith

ProPhy produces more physically coherent videos by aligning generation progressively with semantic and token-level physical priors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing video generators often create physically inconsistent outputs for complex dynamics because they apply uniform responses to prompts and overlook localized physical signals. ProPhy tackles this limitation through a Progressive Physical Alignment Framework that supports explicit physics-aware conditioning. The core is a two-stage Mixture-of-Physics-Experts where Semantic Experts derive physical principles from text and Refinement Experts manage detailed dynamics at the token level. A strategy transfers physical reasoning from vision-language models to these refinement experts. Experiments confirm improved realism and coherence on relevant benchmarks.

Core claim

The central claim is that ProPhy, via its two-stage Mixture-of-Physics-Experts mechanism for discriminative physical prior extraction and a physical alignment strategy that transfers capabilities from vision-language models, enables anisotropic generation that better reflects physical laws and outperforms prior methods in producing realistic dynamic videos.

What carries the argument

Two-stage Mixture-of-Physics-Experts mechanism for semantic physical principles and token-level dynamics, with VLM-based physical alignment strategy.

If this is right

  • More realistic handling of large-scale and complex dynamics in generated videos.
  • Fine-grained alignment to localized physical cues rather than isotropic responses.
  • Video representations that more accurately reflect underlying physical laws.
  • More accurate depiction of dynamic physical phenomena through transferred reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Such a model could improve the fidelity of simulated environments used in training autonomous systems.
  • It might open paths to combining this with other sensory inputs for multimodal world models.
  • Testing on out-of-distribution physical events could reveal the true depth of the learned priors.

Load-bearing premise

The two-stage experts and VLM transfer successfully isolate and apply genuine physical knowledge instead of relying on data correlations that mimic physics.

What would settle it

Generate videos from prompts with novel physical setups, such as objects interacting under unencountered gravity or friction conditions, and verify if the motion trajectories align with real physics calculations.

Figures

Figures reproduced from arXiv: 2512.05564 by Hanhui Li, Jing Wang, Long Chen, Panwen Hu, Terry Jingchen Zhang, Xiaodan Liang, Yiqiang Yan, Yuhao Cheng, Zijun Wang, Zutao Jiang.

Figure 1
Figure 1. Figure 1: Top-left: Prior work typically relies on implicit alignment without explicit physical priors or uses video-level module routing as the source of physical awareness in video generation models. Top-right: Overview of our proposed ProPhy, a progressive alignment framework, which injects and aligns learnable physical priors and performs fine-grained token-level routing, enabling different experts to internaliz… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our proposed ProPhy framework. ProPhy uses a progressive physical alignment design, consisting of the [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Study of the attention localization capabilities of VDM [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: Pipeline for annotating token-level physical attributes [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison among ProPhy, CogVideoX, Wan2.1, and existing physics-aware methods. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Refinement router expert maps. High-activation re [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Physical attribute transfer via expert inversion. Flipping [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 6
Figure 6. Figure 6: Analysis of the semantic router. r represents the Pearson correlation coefficient calculated between different distributions. learns coherent physical semantics. For the REB, we visualize the projected logits of the re￾finement router [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 10
Figure 10. Figure 10: Details of the two types of user inputs used to obtain [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗
Figure 9
Figure 9. Figure 9: Principal component analysis of the activation distri [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative ablation analysis on the functional roles of [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Comparison between ProPhy with different backbones and previous methods, including the baseline. More generated examples [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Examples of videos generated by ProPhy in response to text prompts involving complex physical phenomena. [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗
read the original abstract

Recent advances in video generation have shown remarkable potential for constructing world simulators. However, current models still struggle to produce physically consistent results, particularly when handling large-scale or complex dynamics. This limitation arises primarily because existing approaches respond isotropically to physical prompts and neglect the fine-grained alignment between generated content and localized physical cues. To address these challenges, we propose ProPhy, a Progressive Physical Alignment Framework that enables explicit physics-aware conditioning and anisotropic generation. ProPhy employs a two-stage Mixture-of-Physics-Experts mechanism for discriminative physical prior extraction, where Semantic Experts infer semantic-level physical principles from textual descriptions, and Refinement Experts capture token-level physical dynamics. This mechanism allows the model to learn fine-grained, physics-aware video representations that better reflect underlying physical laws. Furthermore, we introduce a physical alignment strategy that transfers the physical reasoning capabilities of vision-language models into the Refinement Experts, facilitating a more accurate representation of dynamic physical phenomena. Extensive experiments on physics-aware video generation benchmarks demonstrate that ProPhy produces more realistic, dynamic, and physically coherent results than existing state-of-the-art methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces ProPhy, a Progressive Physical Alignment Framework for physics-aware video generation and dynamic world simulation. It uses a two-stage Mixture-of-Physics-Experts (MoPE) architecture with Semantic Experts inferring physical principles from textual prompts and Refinement Experts modeling token-level dynamics, plus a VLM transfer strategy to inject physical reasoning. The central claim is that this produces more realistic, dynamic, and physically coherent outputs than existing SOTA methods on physics-aware video generation benchmarks.

Significance. If substantiated, the explicit separation of semantic-level and token-level physical modeling could advance video-based world simulators beyond purely statistical generation. The approach targets a recognized limitation in current models and, if the priors are genuinely physical rather than correlational, would be a useful contribution to controllable simulation.

major comments (2)
  1. [Experiments] The central claim that ProPhy yields physically coherent results depends on the MoPE + VLM transfer extracting and enforcing genuine physical laws. However, the experiments rely on standard video metrics (FVD, CLIP similarity, human preference) that can be satisfied by visually plausible but physically invalid outputs; no law-specific metrics (momentum conservation, energy balance, or gravity consistency across frames) are reported to isolate the mechanism from data correlations.
  2. [Method] The two-stage MoPE description (Semantic Experts for textual principles, Refinement Experts for token dynamics) is presented at a high level without derivation or ablation showing that the experts capture physical priors rather than learned statistical regularities. This is load-bearing for the claim of 'better reflect underlying physical laws.'
minor comments (2)
  1. [Method] Notation for the experts and alignment loss should be defined more explicitly with equations to allow reproduction.
  2. [Experiments] Figure captions and benchmark descriptions could clarify which physical properties are being tested.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the major comments point by point below, making revisions to enhance the clarity and substantiation of our physical alignment claims.

read point-by-point responses
  1. Referee: [Experiments] The central claim that ProPhy yields physically coherent results depends on the MoPE + VLM transfer extracting and enforcing genuine physical laws. However, the experiments rely on standard video metrics (FVD, CLIP similarity, human preference) that can be satisfied by visually plausible but physically invalid outputs; no law-specific metrics (momentum conservation, energy balance, or gravity consistency across frames) are reported to isolate the mechanism from data correlations.

    Authors: We concur that relying solely on standard metrics leaves room for ambiguity regarding whether the improvements stem from genuine physical modeling or statistical correlations. To strengthen this aspect, we have incorporated law-specific evaluations in the revised manuscript. Specifically, we report metrics for gravity consistency by measuring vertical acceleration in falling objects and momentum conservation in collision scenarios across generated frames. These additions demonstrate that ProPhy better maintains physical invariants compared to baselines. revision: yes

  2. Referee: [Method] The two-stage MoPE description (Semantic Experts for textual principles, Refinement Experts for token dynamics) is presented at a high level without derivation or ablation showing that the experts capture physical priors rather than learned statistical regularities. This is load-bearing for the claim of 'better reflect underlying physical laws.'

    Authors: The two-stage design is derived from the observation that physical understanding operates at multiple scales, with semantic experts handling prompt-based rule inference and refinement experts focusing on per-token adjustments for dynamic consistency. We have expanded the method section with a more detailed derivation of the expert specialization losses and included comprehensive ablations in the experiments. These ablations compare full MoPE against ablated versions (e.g., semantic-only or refinement-only), showing superior performance in physical coherence tasks, which supports that the experts capture distinct physical priors beyond correlations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical benchmarks rather than self-referential derivations

full rationale

The paper proposes ProPhy as an architectural framework consisting of a two-stage Mixture-of-Physics-Experts (Semantic Experts for textual principles and Refinement Experts for token dynamics) plus VLM transfer for physics-aware conditioning. No equations, derivations, or first-principles results are presented that reduce any claimed output to the inputs by construction, such as fitting a parameter and then relabeling a related quantity as a prediction. The central claims of more realistic and physically coherent video generation are justified by reference to external benchmarks and standard training procedures, which constitute independent empirical evaluation rather than circular reduction. This is the normal case for an applied ML architecture paper whose value is assessed against held-out data and baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the unproven assumption that semantic experts can infer physical principles from text and that refinement experts can translate VLM reasoning into token-level dynamics; no explicit free parameters, axioms, or invented entities are detailed in the abstract.

pith-pipeline@v0.9.0 · 5513 in / 1112 out tokens · 78885 ms · 2026-05-17T01:01:05.546295+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ACWM-Phys: Investigating Generalized Physical Interaction in Action-Conditioned Video World Models

    cs.CV 2026-05 unverdicted novelty 6.0

    ACWM-Phys benchmark shows action-conditioned world models generalize on simple geometric interactions but drop sharply on deformable contacts, high-dimensional control, and complex articulated motion, indicating relia...

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · cited by 1 Pith paper · 6 internal anchors

  1. [1]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 2, 6

  2. [2]

    VideoPhy-2: A Challenging Action-Centric Physical Commonsense Evalua- tion in Video Generation.arXiv preprint arXiv:2503.06800,

    Hritik Bansal, Clark Peng, Yonatan Bitton, Roman Golden- berg, Aditya Grover, and Kai-Wei Chang. VideoPhy-2: A Challenging Action-Centric Physical Commonsense Evalua- tion in Video Generation.arXiv preprint arXiv:2503.06800,

  3. [3]

    Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luh- man, Eric Luhman, et al. Video generation models as world simulators.OpenAI Blog, 1(8):1, 2024. 2, 3

  4. [4]

    Jake Bruce, Michael D. Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, Yusuf Aytar, Sarah Maria Elisabeth Bechtle, Feryal Behbahani, Stephanie C. Y . Chan, Nicolas Heess, Lucy Gonzalez, Simon Osindero, Sherjil Ozair, Scott Reed, Jingwei Zhang, Konrad Zolna, Jeff Clune, Nan...

  5. [5]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. 2024. 3

  6. [6]

    Denoising dif- fusion probabilistic models.NeurIPS, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.NeurIPS, 33:6840–6851, 2020. 3

  7. [7]

    Storyagent: Cus- tomized storytelling video generation via multi-agent collab- oration.arXiv preprint arXiv:2411.04925, 2024

    Panwen Hu, Jin Jiang, Jianqi Chen, Mingfei Han, Shengcai Liao, Xiaojun Chang, and Xiaodan Liang. Storyagent: Cus- tomized storytelling video generation via multi-agent collab- oration.arXiv preprint arXiv:2411.04925, 2024. 2

  8. [8]

    VBench: Com- prehensive Benchmark Suite for Video Generative Models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Com- prehensive Benchmark Suite for Video Generative Models. InCVPR, pages 21807–21818, 2024. 6

  9. [9]

    Physmaster: Mastering physical representation for video generation via reinforcement learning.arXiv preprint arXiv:2510.13809, 2025

    Sihui Ji, Xi Chen, Xin Tao, Pengfei Wan, and Hengshuang Zhao. Physmaster: Mastering physical representation for video generation via reinforcement learning.arXiv preprint arXiv:2510.13809, 2025. 2, 3

  10. [10]

    VACE: All-in-One Video Creation and Editing

    Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. V ACE: All-in-One Video Creation and Editing.arXiv preprint arXiv:2503.07598, 2025. 2

  11. [11]

    How far is video generation from world model: A physical law perspective

    Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. How far is video generation from world model: A physical law perspective. arXiv preprint arXiv:2411.02385, 2024. 3

  12. [12]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, Kathrina Wu, Qin Lin, Junkun Yuan, Yanxin Long, Aladdin Wang, Andong Wang, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Kai Wang, Mengyang Liu, Pengyu Li, Shuai Li, ...

  13. [13]

    Kling.https://klingai.kuaishou

    Kuaishou. Kling.https://klingai.kuaishou. com/, 2024. 3

  14. [14]

    Flux.https://github.com/ black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, 2024. 3

  15. [15]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matthew Le. Flow Matching for Generative Modeling. InICLR, 2022. 3

  16. [16]

    Generative physical ai in vision: A survey.arXiv preprint arXiv:2501.10928, 2025

    Daochang Liu, Junyu Zhang, Anh-Dung Dinh, Eunbyung Park, Shichao Zhang, Ajmal Mian, Mubarak Shah, and Chang Xu. Generative physical ai in vision: A survey.arXiv preprint arXiv:2501.10928, 2025. 3

  17. [17]

    Physgen: Rigid-body physics-grounded image- to-video generation

    Shaowei Liu, Zhongzheng Ren, Saurabh Gupta, and Shen- long Wang. Physgen: Rigid-body physics-grounded image- to-video generation. InECCV, pages 360–378. Springer,

  18. [18]

    Latte: La- tent Diffusion Transformer for Video Generation.Transac- tions on Machine Learning Research, 2025

    Xin Ma, Yaohui Wang, Xinyuan Chen, Gengyun Jia, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: La- tent Diffusion Transformer for Video Generation.Transac- tions on Machine Learning Research, 2025. 3

  19. [19]

    Motioncraft: Physics- based zero-shot video generation

    Antonio Montanaro, Luca Savant Aira, Emanuele Aiello, Diego Valsesia, and Enrico Magli. Motioncraft: Physics- based zero-shot video generation. InNeurIPS, pages 123155–123181, 2024. 3

  20. [20]

    Do generative video models understand physical principles?arXiv preprint arXiv:2501.09038, 2025

    Saman Motamed, Laura Culp, Kevin Swersky, Priyank Jaini, and Robert Geirhos. Do generative video models understand physical principles?arXiv preprint arXiv:2501.09038, 2025. 3

  21. [21]

    Cosmos World Foundation Model Platform for Physical AI

    NVIDIA, Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, Daniel Dworakowski, Jiaojiao Fan, Michele Fenzi, Francesco Ferroni, Sanja Fidler, Dieter Fox, Songwei Ge, Yunhao Ge, Jinwei Gu, Siddharth Gururani, Ethan He, Jiahui Huang, Jacob Huffman, Pooya Jannaty, Jingy...

  22. [22]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, pages 4195–4205, 2023. 3

  23. [23]

    Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of- Expert Models.arXiv preprint arXiv:2501.11873, 2025

    Zihan Qiu, Zeyu Huang, Bo Zheng, Kaiyue Wen, Zekun Wang, Rui Men, Ivan Titov, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of- Expert Models.arXiv preprint arXiv:2501.11873, 2025. 4

  24. [24]

    High-Resolution Image Synthesis With Latent Diffusion Models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-Resolution Image Synthesis With Latent Diffusion Models. InCVPR, pages 10684–10695, 2022. 3

  25. [25]

    Scaling Vision with Sparse Mixture of Experts

    Carlos Riquelme Ruiz, Joan Puigcerver, Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, Andr ´e Susano Pinto, Daniel Keysers, and Neil Houlsby. Scaling Vision with Sparse Mixture of Experts. InNeurIPS, 2021. 4

  26. [26]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Noam Shazeer, *Azalia Mirhoseini, *Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. InICLR, 2017. 5

  27. [27]

    Denois- ing Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing Diffusion Implicit Models. InICLR, 2020. 3

  28. [28]

    A material point method for snow simulation.ACM TOG, 32(4):1–10, 2013

    Alexey Stomakhin, Craig Schroeder, Lawrence Chai, Joseph Teran, and Andrew Selle. A material point method for snow simulation.ACM TOG, 32(4):1–10, 2013. 3

  29. [29]

    Physmotion: Physics- grounded dynamics from a single image.arXiv preprint arXiv:2411.17189, 2024

    Xiyang Tan, Ying Jiang, Xuan Li, Zeshun Zong, Tianyi Xie, Yin Yang, and Chenfanfu Jiang. Physmotion: Physics- grounded dynamics from a single image.arXiv preprint arXiv:2411.17189, 2024. 3

  30. [30]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jin- gren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fan...

  31. [31]

    Dreamvideo: High- fidelity image-to-video generation with image retention and text guidance

    Cong Wang, Jiaxi Gu, Panwen Hu, Yuanfan Guo, Xiao Dong, Hang Xu, and Xiaodan Liang. Dreamvideo: High- fidelity image-to-video generation with image retention and text guidance. InICASSP, pages 1–5. IEEE, 2025. 2

  32. [32]

    WISA: World simulator assistant for physics-aware text-to-video genera- tion

    Jing Wang, Ao Ma, Ke Cao, Jun Zheng, Jiasong Feng, Zhan- jie Zhang, Wanyuan Pang, and Xiaodan Liang. WISA: World simulator assistant for physics-aware text-to-video genera- tion. InNeurIPS, 2025. 2, 3, 5, 6, 7

  33. [33]

    Physanimator: Physics-guided generative cartoon animation

    Tianyi Xie, Yiwei Zhao, Ying Jiang, and Chenfanfu Jiang. Physanimator: Physics-guided generative cartoon animation. InCVPR, pages 10793–10804, 2025. 3

  34. [34]

    Phyt2v: Llm-guided iterative self-refinement for physics- grounded text-to-video generation

    Qiyao Xue, Xiangyu Yin, Boyuan Yang, and Wei Gao. Phyt2v: Llm-guided iterative self-refinement for physics- grounded text-to-video generation. InCVPR, pages 18826– 18836, 2025. 3

  35. [35]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, Da Yin, Yuxuan.Zhang, Weihan Wang, Yean Cheng, Bin Xu, Xiaotao Gu, Yuxiao Dong, and Jie Tang. CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer. InICLR, 2024. 2, 3, 6, 7

  36. [36]

    Newtongen: Physics- consistent and controllable text-to-video generation via neu- ral newtonian dynamics.arXiv preprint arXiv:2509.21309,

    Yu Yuan, Xijun Wang, Tharindu Wickremasinghe, Zeeshan Nadir, Bole Ma, and Stanley H Chan. Newtongen: Physics- consistent and controllable text-to-video generation via neu- ral newtonian dynamics.arXiv preprint arXiv:2509.21309,

  37. [37]

    Physdreamer: Physics-based interac- tion with 3d objects via video generation

    Tianyuan Zhang, Hong-Xing Yu, Rundi Wu, Brandon Y Feng, Changxi Zheng, Noah Snavely, Jiajun Wu, and William T Freeman. Physdreamer: Physics-based interac- tion with 3d objects via video generation. InECCV, pages 388–406. Springer, 2024. 3

  38. [38]

    Vide- oREPA: Learning Physics for Video Generation through Re- lational Alignment with Foundation Models

    Xiangdong Zhang, Jiaqi Liao, Shaofeng Zhang, Fanqing Meng, Xiangpeng Wan, Junchi Yan, and Yu Cheng. Vide- oREPA: Learning Physics for Video Generation through Re- lational Alignment with Foundation Models. InNeurIPS,

  39. [39]

    RoboDreamer: Learning Compositional World Models for Robot Imagination

    Siyuan Zhou, Yilun Du, Jiaben Chen, Yandong Li, Dit-Yan Yeung, and Chuang Gan. RoboDreamer: Learning Compo- sitional World Models for Robot Imagination.arXiv preprint arXiv:2404.12377, 2024. 2

  40. [40]

    no obvious dynamic phenomenon

    Zheng Zhu, Xiaofeng Wang, Wangbo Zhao, Chen Min, Ni- anchen Deng, Min Dou, Yuqi Wang, Botian Shi, Kai Wang, Chi Zhang, et al. Is sora a world simulator? a comprehensive survey on general world models and beyond.arXiv preprint arXiv:2405.03520, 2024. 2 10 A. Implementation Details A.1. Model Architecture and Settings We build our model on top of Wan2.1-T2V...