PhyWorld: Physics-Faithful World Model for Video Generation
Pith reviewed 2026-05-20 07:24 UTC · model grok-4.3
The pith
PhyWorld post-trains video models with flow matching and physics preferences to generate more faithful scene continuations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PhyWorld produces temporally coherent and physically faithful scene continuations through two-stage post-training. The first stage applies flow matching fine-tuning to improve video-to-video continuation, encouraging stable visual attributes and coherent motion dynamics across frames. The second stage uses Direct Preference Optimization over physics preference pairs to align generated dynamics with physical principles. On standard benchmarks this yields an average VBench score of 0.769 compared with 0.756 or below for baselines, and on a dedicated physical-faithfulness benchmark it reaches an average score of 3.09 versus 2.99 for the strongest baseline.
What carries the argument
Two-stage post-training that first applies flow matching fine-tuning for video continuation stability then Direct Preference Optimization on physics preference pairs to enforce physical principles.
If this is right
- Large video generation models can be turned into usable world simulators through targeted post-training rather than full retraining.
- Video consistency and physical plausibility can be improved simultaneously using continuation signals and preference optimization.
- Per-law scoring on a custom benchmark provides a way to measure and guide adherence to specific physical principles.
- Post-trained models become more suitable for downstream tasks in Physical AI that require reliable future predictions.
Where Pith is reading between the lines
- The same two-stage recipe could be tested on longer video sequences or more complex multi-object interactions to check generalization.
- Automatically generating or expanding the physics preference pairs might reduce reliance on manual construction and improve coverage.
- Hybrid systems that combine the post-trained model with a lightweight physics engine could offer an additional check on outputs.
Load-bearing premise
The physics preference pairs used in the second stage correctly capture fundamental physical laws and that gains on the custom per-law benchmark extend to faithful behavior in unseen scenarios.
What would settle it
A generated video that clearly violates a basic physical law such as conservation of momentum or gravity in a scene type not represented in the preference pairs.
Figures
read the original abstract
World simulators can provide safe and scalable environments for training Physical AI systems before real-world deployment. Large video generation models are emerging as a promising basis for such simulators because they can generate diverse and realistic visual futures. However, using them as world simulators requires physically faithful video continuations, namely, generated videos that preserve the physical state implied by the conditioning input, and evolve in ways consistent with basic physical principles. We propose PhyWorld, a video generation world model designed to produce temporally coherent and physically faithful scene continuations through two-stage post-training. In the first stage, we improve video-to-video continuation with flow matching fine-tuning, encouraging stable visual attributes and coherent motion dynamics across frames. In the second stage, we align generated dynamics with physical principles using Direct Preference Optimization (DPO) over physics preference pairs, guiding the model toward outputs with higher physical plausibility. To evaluate PhyWorld, we use both standard video-quality benchmarks and a dedicated physical-faithfulness benchmark with per-law scoring. Experiments show that PhyWorld improves video consistency, achieving an average score of 0.769 on VBench compared with 0.756 or below for state-of-the-art baselines. PhyWorld also improves physical plausibility, reaching an average score of 3.09 on our physical-faithfulness benchmark compared with 2.99 for the strongest baseline. These results suggest that post-training large video generation models with continuation and physics-preference signals can make them more effective world simulators for Physical AI.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents PhyWorld, a two-stage post-training framework for large video generation models to produce temporally coherent and physically faithful scene continuations for use as world simulators. Stage one applies flow matching fine-tuning to improve video-to-video continuation with stable attributes and coherent motion. Stage two uses Direct Preference Optimization (DPO) over physics preference pairs to align dynamics with physical principles. Experiments report an average VBench score of 0.769 (vs. 0.756 or below for baselines) and an average physical-faithfulness benchmark score of 3.09 (vs. 2.99 for the strongest baseline), with per-law scoring on the custom benchmark.
Significance. If the central empirical claims hold after addressing the noted gaps, the work would be significant for Physical AI by showing that targeted post-training can improve physical plausibility in video-based world models. The two-stage design (flow matching followed by DPO) and the introduction of a per-law physical-faithfulness benchmark are practical contributions that could seed further research on alignment for simulators. Credit is given for the reproducible-style benchmark comparisons and the explicit focus on generalization beyond visual heuristics.
major comments (2)
- [Abstract] Abstract: The central claim of improved physical plausibility rests on the 3.09 vs. 2.99 lift on the custom physical-faithfulness benchmark, yet the abstract (and by extension the evaluation) provides no details on physics preference pair construction, data sources, per-law scoring rubric, statistical significance, or controls for confounds; this directly undermines assessment of whether gains reflect fundamental dynamics (e.g., conservation or contact forces) rather than benchmark-specific artifacts.
- [Method] Method (DPO stage): The weakest assumption—that the physics preference pairs accurately encode first-principles laws and that benchmark gains generalize to unseen scenarios and longer rollouts—is load-bearing; without explicit verification that the pairs are independent of the evaluation signals or that the flow-matching stage does not introduce distribution shifts that the DPO merely memorizes, the 0.1-point improvement cannot be taken as evidence of enhanced physical fidelity outside the training distribution.
minor comments (1)
- The notation and terminology around 'per-law scoring' and 'physics preference pairs' could be defined more precisely on first use to aid readers unfamiliar with the custom benchmark.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing clarifications from the manuscript and indicating revisions made to improve transparency and strengthen the presentation of our claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim of improved physical plausibility rests on the 3.09 vs. 2.99 lift on the custom physical-faithfulness benchmark, yet the abstract (and by extension the evaluation) provides no details on physics preference pair construction, data sources, per-law scoring rubric, statistical significance, or controls for confounds; this directly undermines assessment of whether gains reflect fundamental dynamics (e.g., conservation or contact forces) rather than benchmark-specific artifacts.
Authors: We agree that the abstract, being a concise summary, omits several methodological specifics that are elaborated in the full text. The physics preference pairs are constructed from an independent physics simulator enforcing first-principles rules (conservation of momentum, contact forces, gravity), with data sources detailed in Section 3.2; the per-law scoring rubric (0-5 scale per law with explicit criteria) appears in Section 4.2; and controls for confounds are implemented via matched visual-quality baselines. Statistical significance is supported by consistent gains across three random seeds, though we did not report p-values. To address the concern directly, we have revised the abstract to include a brief clause on pair construction and the per-law benchmark design. We maintain that the 0.1-point lift reflects improved dynamics rather than artifacts, as the benchmark isolates physical violations independent of visual fidelity. revision: yes
-
Referee: [Method] Method (DPO stage): The weakest assumption—that the physics preference pairs accurately encode first-principles laws and that benchmark gains generalize to unseen scenarios and longer rollouts—is load-bearing; without explicit verification that the pairs are independent of the evaluation signals or that the flow-matching stage does not introduce distribution shifts that the DPO merely memorizes, the 0.1-point improvement cannot be taken as evidence of enhanced physical fidelity outside the training distribution.
Authors: The preference pairs are generated from a separate physics engine (distinct from the evaluation benchmark scenes) to encode first-principles laws, with explicit disjointness stated in Section 3.2. Ablation results (Table 3) show that flow-matching primarily boosts temporal metrics while physical-faithfulness scores remain stable until the DPO stage, indicating limited distribution shift. Generalization is supported by held-out test scenarios and qualitative longer-rollout examples in the appendix. We have added a new paragraph in the revised Method section discussing these independence checks and potential memorization risks, along with references to the benchmark construction protocol. revision: partial
Circularity Check
No circularity: empirical post-training with independent benchmarks
full rationale
The paper presents PhyWorld as a two-stage empirical post-training procedure (flow-matching fine-tuning followed by DPO on physics preference pairs) evaluated on VBench and a separate per-law physical-faithfulness benchmark. No mathematical derivation, first-principles equations, or self-referential definitions are claimed; improvements are reported as measured outcomes rather than reductions to fitted inputs or self-citations. The custom benchmark is described as dedicated and per-law, with no evidence in the text that its scoring rubric or data sources are constructed from the same preference pairs used in training, preserving independence.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Preference optimization on author-constructed physics pairs can align video generation with physical principles.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
two-stage post-training … flow matching fine-tuning … Direct Preference Optimization (DPO) over physics preference pairs
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
250-prompt text/image-to-video benchmark organized under a taxonomy of physical laws … per-law scoring
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, et al. Sora: A review on background, technology, limitations, and opportunities of large vision models.arXiv preprint arXiv:2402.17177, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
Video models are zero-shot learners and reasoners
Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learners and reasoners.arXiv preprint arXiv:2509.20328, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Cosmos World Foundation Model Platform for Physical AI
Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
VBench: Comprehensive benchmark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
work page 2024
-
[7]
Jingtao Ding, Yunke Zhang, Yu Shang, Yuheng Zhang, Zefang Zong, Jie Feng, Yuan Yuan, Hongyuan Su, Nian Li, Nicholas Sukiennik, et al. Understanding world or predicting future? a comprehensive survey of world models.ACM Computing Surveys, 58(3):1–38, 2025
work page 2025
-
[8]
A comprehensive survey on world models for embodied ai.arXiv preprint arXiv:2510.16732, 2025
Xinqing Li, Xin He, Le Zhang, Min Wu, Xiaoli Li, and Yun Liu. A comprehensive survey on world models for embodied ai.arXiv preprint arXiv:2510.16732, 2025
-
[9]
Jingtong Yue, Ziqi Huang, Zhaoxi Chen, Xintao Wang, Pengfei Wan, and Ziwei Liu. Simulating the visual world with artificial intelligence: A roadmap.arXiv preprint arXiv:2511.08585, 2025
-
[10]
Xiaoxiao Long, Qingrui Zhao, Kaiwen Zhang, Zihao Zhang, Dingrui Wang, Yumeng Liu, Zhengjie Shu, Yi Lu, Shouzheng Wang, Xinzhe Wei, et al. A survey: Learning embodied intelligence from physical simulators and world models.arXiv preprint arXiv:2507.00917, 2025
-
[11]
Pu Zhao, Arash Akbari, Xuan Shen, Zhenglun Kong, Yixin Shen, Sung-En Chang, Timothy Rupprecht, Lei Lu, Enfu Nan, Changdi Yang, et al. Open-source multimodal moxin models with moxin-vlm and moxin-vla.arXiv preprint arXiv:2512.22208, 2025
-
[12]
Pu Zhao, Xuan Shen, Zhenglun Kong, Yixin Shen, et al. 7b fully open source moxin-llm/vlm–from pretraining to grpo-based reinforcement learning enhancement.arXiv preprint arXiv:2412.06845, 2024
-
[13]
Minghui Lin, Xiang Wang, Yishan Wang, Shu Wang, Fengqi Dai, Pengxiang Ding, Cunxiang Wang, Zhengrong Zuo, Nong Sang, Siteng Huang, et al. Exploring the evolution of physics cognition in video generation: A survey.arXiv preprint arXiv:2503.21765, 2025
-
[14]
Generative physical AI in vision: A survey
Daochang Liu, Junyu Zhang, Anh-Dung Dinh, Eunbyung Park, Shichao Zhang, Ajmal Mian, Mubarak Shah, and Chang Xu. Generative physical ai in vision: A survey.arXiv preprint arXiv:2501.10928, 2025
-
[15]
From specialist to generalist: A comprehensive survey on world models.Authorea Preprints, 2026
Kai Xu, Hang Zhao, Ruizhen Hu, Yuhang Huang, Ziqiao Zhou, Wancheng Feng, Yi Li, Sida Peng, Xing Liu, Zihao Liu, et al. From specialist to generalist: A comprehensive survey on world models.Authorea Preprints, 2026
work page 2026
-
[16]
Jiahua Dong, Qi Lyu, Baichen Liu, Xudong Wang, Wenqi Liang, Duzhen Zhang, Jiahang Tu, Hongliu Li, Hanbin Zhao, Henghui Ding, et al. Learning to model the world: A survey of world models in artificial intelligence.Authorea Preprints, 2026. 10
work page 2026
-
[17]
Squat: Quant small language models on the edge
Xuan Shen, Peiyan Dong, Zhenglun Kong, Yifan Gong, Changdi Yang, Zhaoyang Han, Yanyue Xie, Lei Lu, Cheng Lyu, Chao Wu, Yanzhi Wang, and Pu Zhao. Squat: Quant small language models on the edge. In2025 IEEE/ACM International Conference On Computer Aided Design (ICCAD), pages 1–9, 2025
work page 2025
-
[18]
Pruning foundation models for high accuracy without retraining
Pu Zhao, Fei Sun, Xuan Shen, Pinrui Yu, Zhenglun Kong, Yanzhi Wang, and Xue Lin. Pruning foundation models for high accuracy without retraining. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 9681–9694, Miami, Florida, USA, November 2024. Association for Computational Linguistics
work page 2024
-
[19]
Search for efficient large language models
Xuan Shen, Pu Zhao, Yifan Gong, Zhenglun Kong, Zheng Zhan, Yushu Wu, Ming Lin, Chao Wu, Xue Lin, and Yanzhi Wang. Search for efficient large language models. InNeurIPS, 2024
work page 2024
-
[20]
Quartdepth: Post-training quantization for real-time depth estimation on the edge
Xuan Shen, Weize Ma, Jing Liu, Changdi Yang, Rui Ding, Quanyi Wang, Henghui Ding, Wei Niu, Yanzhi Wang, Pu Zhao, Jun Lin, and Jiuxiang Gu. Quartdepth: Post-training quantization for real-time depth estimation on the edge. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11448–11460, June 2025
work page 2025
-
[21]
Nicklas Hansen, Jyothir SV , Vlad Sobal, Yann LeCun, Xiaolong Wang, and Hao Su. Hierarchical world models as visual whole-body humanoid controllers.arXiv preprint arXiv:2405.18418, 2024
-
[22]
Learning latent action world models in the wild.arXiv preprint arXiv:2601.05230,
Quentin Garrido, Tushar Nagarajan, Basile Terver, Nicolas Ballas, Yann LeCun, and Michael Rabbat. Learning latent action world models in the wild.arXiv preprint arXiv:2601.05230, 2026
-
[23]
Jianhao Yuan, Xiaofeng Zhang, Felix Friedrich, Nicolas Beltran-Velez, Melissa Hall, Reyhane Askari- Hemmat, Xiaochuang Han, Nicolas Ballas, Michal Drozdzal, and Adriana Romero-Soriano. Inference-time physics alignment of video generative models with latent world models.arXiv preprint arXiv:2601.10553, 2026
-
[24]
Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning
Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, et al. Cosmos policy: Fine-tuning video models for visuomotor control and planning.arXiv preprint arXiv:2601.16163, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[25]
Cambrian-S: Towards Spatial Supersensing in Video
Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis Brown, Zihao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, et al. Cambrian-s: Towards spatial supersensing in video.arXiv preprint arXiv:2511.04670, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
arXiv preprint arXiv:2510.16907 (2025)
Kangrui Wang, Pingyue Zhang, Zihan Wang, Yaning Gao, Linjie Li, Qineng Wang, Hanyang Chen, Chi Wan, Yiping Lu, Zhengyuan Yang, et al. Vagen: Reinforcing world model reasoning for multi-turn vlm agents.arXiv preprint arXiv:2510.16907, 2025
-
[27]
Wenlong Huang, Yu-Wei Chao, Arsalan Mousavian, Ming-Yu Liu, Dieter Fox, Kaichun Mo, and Li Fei-Fei. Pointworld: Scaling 3d world models for in-the-wild robotic manipulation.arXiv preprint arXiv:2601.03782, 2026
-
[28]
Sparse learning for state space models on mobile
Xuan Shen, Hangyu Zheng, Yifan Gong, Zhenglun Kong, Changdi Yang, Zheng Zhan, Yushu Wu, Xue Lin, Yanzhi Wang, Pu Zhao, and Wei Niu. Sparse learning for state space models on mobile. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[29]
Exploring token pruning in vision state space models
Zheng Zhan, Zhenglun Kong, Yifan Gong, et al. Exploring token pruning in vision state space models. In NeurIPS, 2024
work page 2024
-
[30]
Rethinking token reduction for state space models
Zheng Zhan, Yushu Wu, Zhenglun Kong, et al. Rethinking token reduction for state space models. In EMNLP, pages 1686–1697. ACL, nov 2024
work page 2024
-
[31]
Cocopie: enabling real-time ai on off-the-shelf mobile devices via compression-compilation co-design
Hui Guan, Shaoshan Liu, Xiaolong Ma, Wei Niu, Bin Ren, Xipeng Shen, Yanzhi Wang, and Pu Zhao. Cocopie: enabling real-time ai on off-the-shelf mobile devices via compression-compilation co-design. Commun. ACM, 64(6):62–68, May 2021
work page 2021
-
[32]
Zhendong Mi, Yixiao Chen, Pu Zhao, Xiaodong Yu, Hao Wang, Yanzhi Wang, and Shaoyi Huang. Effective moe-based llm compression by exploiting heterogeneous inter-group experts routing frequency and information density.arXiv preprint arXiv:2602.09316, 2026
-
[33]
Causal World Modeling for Robot Control
Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[34]
Advancing Open-source World Models
Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu, Ka Leong Cheng, Yixuan Li, Hanlin Wang, Yinghao Xu, Shuailei Ma, et al. Advancing open-source world models.arXiv preprint arXiv:2601.20540, 2026. 11
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[35]
Juyi Lin, Amir Taherin, Arash Akbari, Arman Akbari, Lei Lu, Guangyu Chen, Taskin Padir, Xiaomeng Yang, Weiwei Chen, Yiqian Li, et al. V ote: vision-language-action optimization with trajectory ensemble voting.arXiv preprint arXiv:2507.05116, 2025
-
[36]
Adaworld: Learning adaptable world models with latent actions.arXiv preprint arXiv:2503.18938, 2025
Shenyuan Gao, Siyuan Zhou, Yilun Du, Jun Zhang, and Chuang Gan. Adaworld: Learning adaptable world models with latent actions.arXiv preprint arXiv:2503.18938, 2025
-
[37]
Fastcar: Cache attentive replay for fast auto-regressive video generation on the edge
Xuan Shen, Weize Ma, Yufa Zhou, Enhao Tang, Yanyue Xie, Zhengang Li, Yifan Gong, Quanyi Wang, Henghui Ding, Yiwei Wang, Pu Zhao, Jun Lin, and Jiuxiang Gu. Fastcar: Cache attentive replay for fast auto-regressive video generation on the edge. InThe Fourteenth International Conference on Learning Representations, 2026
work page 2026
-
[38]
Xuan Shen, Zhao Song, Yufa Zhou, Bo Chen, Jing Liu, Ruiyi Zhang, et al. Numerical pruning for efficient autoregressive models.Proceedings of the AAAI Conference on Artificial Intelligence, 39(19):20418–20426, Apr. 2025
work page 2025
-
[39]
Xuan Shen, Zhao Song, Yufa Zhou, Bo Chen, Yanyu Li, Yifan Gong, et al. Lazydit: Lazy learning for the acceleration of diffusion transformers.Proceedings of the AAAI Conference on Artificial Intelligence, 39(19):20409–20417, Apr. 2025
work page 2025
-
[40]
K. Zhang et al. Epona: Autoregressive diffusion world model for autonomous driving.arXiv preprint arXiv:2506.24113, 2025
-
[41]
Hieramp: Coarse-to-fine autoregressive amplification for generative dataset distillation
Lin Zhao, Xinru Jiang, Xi Xiao, Qihui Fan, Lei Lu, Yanzhi Wang, Xue Lin, Octavia Camps, Pu Zhao, and Jianyang Gu. Hieramp: Coarse-to-fine autoregressive amplification for generative dataset distillation. arXiv preprint arXiv:2603.06932, 2026
-
[42]
Taming diffusion for dataset distillation with high representativeness
Lin Zhao, Yushu Wu, Xinru Jiang, Jianyang Gu, Yanzhi Wang, Xiaolin Xu, Pu Zhao, and Xue Lin. Taming diffusion for dataset distillation with high representativeness. InForty-second International Conference on Machine Learning, 2025
work page 2025
-
[43]
Fast and memory-efficient video diffusion using streamlined inference
Zheng Zhan, Yushu Wu, Yifan Gong, Zichong Meng, et al. Fast and memory-efficient video diffusion using streamlined inference. InAdvances in Neural Information Processing Systems, volume 37, pages 13660–13684. Curran Associates, Inc., 2024
work page 2024
-
[44]
Zongyue Li, Xiao Han, Yusong Li, Niklas Strauss, and Matthias Schubert. Dawm: Diffusion action world models for offline reinforcement learning via action-inferred transitions.arXiv preprint arXiv:2509.19538, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
Self-Forcing++: Towards Minute-Scale High-Quality Video Generation
Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Self-forcing++: Towards minute-scale high-quality video generation.arXiv preprint arXiv:2510.02283, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
Longcat-video technical report, 2025
Meituan LongCat Team, Xunliang Cai, Qilong Huang, Zhuoliang Kang, Hongyu Li, Shijun Liang, Liya Ma, Siyu Ren, Xiaoming Wei, Rixu Xie, and Tong Zhang. Longcat-video technical report, 2025
work page 2025
-
[48]
LongLive: Real-time Interactive Long Video Generation
Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, Song Han, and Yukang Chen. Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[49]
Longcat-next: Lexicalizing modalities as discrete tokens, 2026
Meituan LongCat Team, Bin Xiao, Chao Wang, Chengjiang Li, Chi Zhang, Chong Peng, Hang Yu, Hao Yang, Haonan Yan, Haoze Sun, et al. Longcat-next: Lexicalizing modalities as discrete tokens.arXiv preprint arXiv:2603.27538, 2026
-
[50]
Saman Motamed, Laura Culp, Kevin Swersky, Priyank Jaini, and Robert Geirhos. Do generative video mod- els understand physical principles? InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 948–958, 2026
work page 2026
-
[51]
Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation
Fanqing Meng, Jiaqi Liao, Xinyu Tan, Wenqi Shao, Quanfeng Lu, Kaipeng Zhang, Yu Cheng, Dianqi Li, Yu Qiao, and Ping Luo. Towards world simulator: Crafting physical commonsense-based benchmark for video generation.arXiv preprint arXiv:2410.05363, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[52]
Hritik Bansal, Clark Peng, Yonatan Bitton, Roman Goldenberg, Aditya Grover, and Kai-Wei Chang. Videophy-2: A challenging action-centric physical commonsense evaluation in video generation.arXiv preprint arXiv:2503.06800, 2025. 12
-
[53]
Dacheng Li, Yunhao Fang, Yukang Chen, Shuo Yang, Shiyi Cao, Justin Wong, Michael Luo, Xiaolong Wang, Hongxu Yin, Joseph E Gonzalez, et al. Worldmodelbench: Judging video generation models as world models.arXiv preprint arXiv:2502.20694, 2025
-
[54]
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023
work page 2023
-
[55]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021
work page 2021
-
[56]
OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation
Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhenheng Yang, Zhijie Chen, Xiang Li, Jian Yang, and Ying Tai. Openvid-1m: A large-scale high-quality dataset for text-to-video generation.arXiv preprint arXiv:2407.02371, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[57]
Revisiting weak-to-strong consistency in semi-supervised semantic segmentation
Lihe Yang, Lei Qi, Litong Feng, Wayne Zhang, and Yinghuan Shi. Revisiting weak-to-strong consistency in semi-supervised semantic segmentation. InCVPR, 2023
work page 2023
-
[58]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[59]
Scaling rectified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024
work page 2024
-
[60]
Qwen3.5: Towards native multimodal agents, February 2026
Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026
work page 2026
-
[61]
Diffsynth-studio.https://github.com/datawhalechina/diffsynth-studio, 2024
DiDi. Diffsynth-studio.https://github.com/datawhalechina/diffsynth-studio, 2024
work page 2024
-
[62]
LTX-2: Efficient Joint Audio-Visual Foundation Model
Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, Eitan Richardson, Guy Shiran, Itay Chachy, Jonathan Chetboun, Michael Finkelson, Michael Kupchick, Nir Zabari, Nitzan Guetta, Noa Kotler, Ofir Bibi, Ori Gordon, Poriya Panet, Roi Benita, Shahar Armon, V...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[63]
Kaihang Pan, Qi Tian, Jianwei Zhang, Weijie Kong, Jiangfeng Xiong, Yanxin Long, Shixue Zhang, Haiyi Qiu, Tan Wang, Zheqi Lv, et al. Omniweaving: Towards unified video generation with free-form composition and reasoning.arXiv preprint arXiv:2603.24458, 2026
-
[64]
World Simulation with Video Foundation Models for Physical AI
Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, et al. World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062, 2025. 13 A Benchmarks for Physical Faithfulness Current physics-evaluation pipelines for video generation suffer from a c...
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.