pith. sign in

arxiv: 2512.01843 · v3 · pith:7DQBDYEJnew · submitted 2025-12-01 · 💻 cs.CV

PhyDetEx: Detecting and Explaining the Physical Plausibility of T2V Models

Pith reviewed 2026-05-21 17:52 UTC · model grok-4.3

classification 💻 cs.CV
keywords text-to-videophysical plausibilityvision-language modelsPID datasetfine-tuningmodel benchmarkingphysics violations
0
0 comments X

The pith

Fine-tuned VLMs detect and explain physical implausibilities in text-to-video generations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a PID dataset of real videos paired with implausible versions created by rewriting their captions to induce specific physical violations. It applies a lightweight fine-tuning method to vision-language models so they can both identify these implausible events and describe the violated physical principles. This PhyDetEx system is then used to benchmark leading text-to-video models. The evaluation shows progress in visual quality but persistent failures to follow physical laws, particularly among open-source models. A sympathetic reader would care because video generation for practical use requires models that respect real-world physics rather than only produce visually convincing footage.

Core claim

We construct the PID dataset containing a test split of 500 manually annotated videos and a train split of 2,588 paired videos where each implausible video is produced by rewriting the caption of its corresponding real-world video. With this data we introduce a lightweight fine-tuning approach that turns VLMs into detectors and explainers of physical implausibility. The resulting PhyDetEx model is applied to benchmark state-of-the-art T2V systems, revealing that recent models have advanced toward physically plausible output yet still struggle to understand and adhere to physical laws, especially open-source variants.

What carries the argument

The PID dataset of caption-rewritten real-implausible video pairs used to fine-tune VLMs for joint detection of implausible events and generation of textual explanations for violated physical principles.

If this is right

  • State-of-the-art T2V models can be systematically ranked for physical adherence using an automated detector and explainer.
  • Explanations of violated principles supply concrete signals that can guide targeted improvements in T2V training data or loss functions.
  • Open-source T2V models exhibit larger gaps in physical understanding than closed-source counterparts and therefore require focused remediation.
  • The lightweight fine-tuning recipe demonstrates that general VLMs can be specialized for physics evaluation without prohibitive compute.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same caption-rewriting technique could be applied to create evaluation sets for text-to-image or text-to-3D generators.
  • Persistent physical failures suggest that current T2V training corpora under-represent explicit physical constraints.
  • PhyDetEx outputs could be fed back into T2V training loops to curate or re-weight data that emphasizes physical correctness.

Load-bearing premise

Rewriting captions of real videos reliably produces videos that violate specific physical principles in ways that are both detectable and representative of real T2V failure modes.

What would settle it

A held-out set of T2V outputs containing independently verified physical violations (such as broken object permanence or gravity) on which the fine-tuned VLM either misses the violation or gives incorrect explanations would falsify the method's reliability.

Figures

Figures reproduced from arXiv: 2512.01843 by Keze Wang, Lei Zhang, Zeqing Wang.

Figure 1
Figure 1. Figure 1: Illustration of the physical implausibility detection task. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Results of preliminary experiments in the Impossible [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the construction pipeline of the PID dataset and the training process of PhyDetEx. (a) The PID training split. The [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison between our PhyDetEx and recent VLMs on detecting physical implausibility. We illustrate two rep [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Driven by the growing capacity and training scale, Text-to-Video (T2V) generation models have recently achieved substantial progress in video quality, length, and instruction-following capability. However, whether these models can understand physics and generate physically plausible videos remains a question. While Vision-Language Models (VLMs) have been widely used as general-purpose evaluators in various applications, they struggle to identify the physically impossible content from generated videos. To investigate this issue, we construct a \textbf{PID} (\textbf{P}hysical \textbf{I}mplausibility \textbf{D}etection) dataset, which consists of a \textit{test split} of 500 manually annotated videos and a \textit{train split} of 2,588 paired videos, where each implausible video is generated by carefully rewriting the caption of its corresponding real-world video to induce T2V models producing physically implausible content. With the constructed dataset, we introduce a lightweight fine-tuning approach, enabling VLMs to not only detect physically implausible events but also generate textual explanations on the violated physical principles. Taking the fine-tuned VLM as a physical plausibility detector and explainer, namely \textbf{PhyDetEx}, we benchmark a series of state-of-the-art T2V models to assess their adherence to physical laws. Our findings show that although recent T2V models have made notable progress toward generating physically plausible content, understanding and adhering to physical laws remains a challenging issue, especially for open-source models. Our dataset, training code, and checkpoints are available at \href{https://github.com/Zeqing-Wang/PhyDetEx}{https://github.com/Zeqing-Wang/PhyDetEx}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces PhyDetEx, a fine-tuned VLM-based detector and explainer for physical implausibilities in T2V video generations. It constructs the PID dataset (500 manually annotated test videos plus 2,588 paired train videos created by rewriting real-video captions to induce targeted physical violations such as breaches of gravity or momentum), fine-tunes VLMs on this data to both detect implausible events and generate textual explanations of violated principles, and benchmarks a range of state-of-the-art T2V models, concluding that physical-law adherence remains challenging, especially for open-source models.

Significance. If the detector proves reliable, the work supplies a practical evaluation tool and benchmark for physical plausibility in T2V systems, an area where current models still fall short. The public release of the PID dataset, training code, and checkpoints is a clear strength that supports reproducibility and follow-on research.

major comments (2)
  1. [PID dataset construction] PID dataset construction (train split): rewriting captions of real videos to induce specific violations (gravity, momentum, object permanence, etc.) is assumed to produce a distribution of failures representative of those that arise under natural-language prompting. No comparison of violation statistics between rewritten-prompt generations and organic generations, nor human validation of representativeness, is described; without this, the fine-tuned detector's scores on the benchmarked T2V outputs may not generalize, weakening the claim that open-source models adhere less well to physical laws.
  2. [Evaluation and benchmarking] Evaluation section: the central claim that T2V models (particularly open-source ones) still fail to adhere to physical laws rests on PhyDetEx being a trustworthy detector. The manuscript reports dataset sizes and a benchmarking outcome but provides no quantitative detector metrics (accuracy, precision/recall on the 500-video test split), error analysis, or ablation on the fine-tuning procedure; these are required to establish that the detector itself is not the source of the observed differences.
minor comments (2)
  1. [Abstract] Abstract: dataset sizes are stated but no key quantitative benchmarking results or detector performance numbers appear, making it harder for readers to gauge the strength of the findings at a glance.
  2. [Figures] Figure and table captions: ensure all figures showing example implausible videos or detector outputs are accompanied by explicit captions that list the violated physical principle and the model that produced the video.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major comment point by point below, indicating planned revisions to strengthen the manuscript while maintaining the integrity of our contributions.

read point-by-point responses
  1. Referee: [PID dataset construction] PID dataset construction (train split): rewriting captions of real videos to induce specific violations (gravity, momentum, object permanence, etc.) is assumed to produce a distribution of failures representative of those that arise under natural-language prompting. No comparison of violation statistics between rewritten-prompt generations and organic generations, nor human validation of representativeness, is described; without this, the fine-tuned detector's scores on the benchmarked T2V outputs may not generalize, weakening the claim that open-source models adhere less well to physical laws.

    Authors: We appreciate the referee highlighting this aspect of our dataset design. The train split of 2,588 paired videos was intentionally constructed by rewriting real-video captions to induce targeted physical violations, enabling controlled and diverse training examples for fine-tuning the VLM on specific principles such as gravity and momentum. This paired structure supports the explainer component as well. The independent test split of 500 manually annotated videos provides ground-truth evaluation. We acknowledge that an explicit statistical comparison to organic generations and additional human validation of representativeness would further support generalization claims. In the revised manuscript, we will add a dedicated discussion of the construction rationale and include results from a small-scale human validation study on violation representativeness. revision: yes

  2. Referee: [Evaluation and benchmarking] Evaluation section: the central claim that T2V models (particularly open-source ones) still fail to adhere to physical laws rests on PhyDetEx being a trustworthy detector. The manuscript reports dataset sizes and a benchmarking outcome but provides no quantitative detector metrics (accuracy, precision/recall on the 500-video test split), error analysis, or ablation on the fine-tuning procedure; these are required to establish that the detector itself is not the source of the observed differences.

    Authors: We agree that quantitative metrics are necessary to substantiate the trustworthiness of PhyDetEx and the validity of the benchmarking conclusions. The current manuscript prioritizes the overall findings on T2V model performance but does not include the requested detector-level evaluations. In the revised version, we will expand the evaluation section to report accuracy, precision, and recall on the 500-video test split, provide an error analysis, and include ablations on the fine-tuning procedure. These additions will directly address concerns about the detector as a potential source of observed differences. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset construction and external benchmarking

full rationale

The paper constructs the PID dataset by rewriting real-video captions to induce targeted physical violations, fine-tunes VLMs on the resulting paired data for detection and explanation, and then applies the fine-tuned model to benchmark external T2V generators. No mathematical derivation, fitted parameter, or prediction is presented; all quantitative results are obtained by running the detector on outputs from independent models. The central claims rest on the methodological assumption that caption rewriting produces representative violations, but this assumption is not derived from or equivalent to the paper's own outputs by construction. No self-citation chain, uniqueness theorem, or ansatz is invoked to justify the core pipeline. The work is therefore self-contained against external T2V models and does not reduce to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that the caption-rewriting procedure produces videos whose physical violations are both genuine and representative, plus the assumption that fine-tuned VLMs can reliably generalize from the constructed pairs to outputs of other T2V models. No free parameters or invented entities are introduced.

axioms (2)
  • domain assumption Vision-language models can be fine-tuned on paired plausible/implausible videos to detect physical violations and generate explanations of violated principles.
    This is the core modeling assumption that enables PhyDetEx; it is invoked when the authors describe the lightweight fine-tuning approach.
  • domain assumption Caption rewriting of real videos induces T2V models to generate content that violates identifiable physical laws.
    Stated in the dataset construction paragraph of the abstract.

pith-pipeline@v0.9.0 · 5846 in / 1468 out tokens · 80870 ms · 2026-05-21T17:52:36.638030+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. PhyGround: Benchmarking Physical Reasoning in Generative World Models

    cs.CV 2026-05 accept novelty 7.0

    PhyGround is a new benchmark with curated prompts, a 13-law taxonomy, large-scale human annotations, and an open physics-specialized VLM judge for evaluating physical reasoning in generative video models.

  2. From Pixels to Concepts: Do Segmentation Models Understand What They Segment?

    cs.CV 2026-05 unverdicted novelty 7.0

    CAFE benchmark reveals that promptable segmentation models often produce correct masks for misleading prompts, showing a gap between localization accuracy and true concept understanding.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · cited by 2 Pith papers · 15 internal anchors

  1. [1]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023. 7

  2. [2]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...

  3. [3]

    Impossible videos, 2025

    Zechen Bai, Hai Ci, and Mike Zheng Shou. Impossible videos, 2025. 2, 3, 4, 6

  4. [4]

    VideoPhy: Evaluating Physical Commonsense for Video Generation

    Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chenfanfu Jiang, Yizhou Sun, Kai- Wei Chang, and Aditya Grover. Videophy: Evaluating physical commonsense for video generation.arXiv preprint arXiv:2406.03520, 2024. 1, 2

  5. [5]

    Videophy-2: A challenging action-centric physical commonsense evaluation in video generation.arXiv preprint arXiv:2503.06800, 2025

    Hritik Bansal, Clark Peng, Yonatan Bitton, Roman Golden- berg, Aditya Grover, and Kai-Wei Chang. Videophy-2: A challenging action-centric physical commonsense evaluation in video generation.arXiv preprint arXiv:2503.06800, 2025. 1, 2

  6. [6]

    T2vworldbench: A benchmark for evaluating world knowledge in text-to-video generation, 2025

    Yubin Chen, Xuyang Guo, Zhenmei Shi, Zhao Song, and Jiahao Zhang. T2vworldbench: A benchmark for evaluating world knowledge in text-to-video generation, 2025. 2

  7. [7]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling.arXiv preprint arXiv:2412.05271, 2024. 7

  8. [8]

    How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101,

    Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101,

  9. [9]

    Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024. 2

  10. [10]

    Sora as an agi world model? a complete survey on text-to-video generation.arXiv preprint arXiv:2403.05131, 2024

    Joseph Cho, Fachrina Dewi Puspitasari, Sheng Zheng, Jingyao Zheng, Lik-Hang Lee, Tae-Ho Kim, Choong Seon Hong, and Chaoning Zhang. Sora as an agi world model? a complete survey on text-to-video generation.arXiv preprint arXiv:2403.05131, 2024. 2

  11. [11]

    Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic ca- pabilities, 2025

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, and Luke Marris et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic ca- pabilities, 2025. 2

  12. [12]

    Sharegpt-4o: Comprehensive mul- timodal annotations with gpt-4o, 2024

    Erfei Cui, Yinan He, Zheng Ma, Zhe Chen, Hao Tian, Weiyun Wang, Kunchang Li, Yi Wang, Wenhai Wang, Xizhou Zhu, Lewei Lu, Tong Lu, Yali Wang, Limin Wang, Yu Qiao, and Jifeng Dai. Sharegpt-4o: Comprehensive mul- timodal annotations with gpt-4o, 2024. 2

  13. [13]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 3

  14. [14]

    Worldscore: A unified evaluation benchmark for world generation.arXiv preprint arXiv:2504.00983, 2025

    Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Ji- ajun Wu. Worldscore: A unified evaluation benchmark for world generation.arXiv preprint arXiv:2504.00983, 2025. 2

  15. [15]

    Veo - google deepmind, 2025.https: //deepmind.google/models/veo/[2025.09.08]

    Google DeepMind. Veo - google deepmind, 2025.https: //deepmind.google/models/veo/[2025.09.08]. 1, 7

  16. [16]

    Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017. 3

  17. [17]

    Visual program- ming: Compositional visual reasoning without training

    Tanmay Gupta and Aniruddha Kembhavi. Visual program- ming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 14953–14962, 2023. 3

  18. [18]

    VideoScore: Building automatic metrics to simulate fine-grained human feedback for video generation

    Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil Chandra, Ziyan Jiang, Aaran Arulraj, Kai Wang, Quy Duc Do, Yuansheng Ni, Bo- han Lyu, Yaswanth Narsupalli, Rongqi Fan, Zhiheng Lyu, Bill Yuchen Lin, and Wenhu Chen. VideoScore: Building automatic metrics to simulate fine-grained human feedback for video generation. In...

  19. [19]

    Vid2world: Crafting video diffusion models to interactive world models, 2025

    Siqiao Huang, Jialong Wu, Qixing Zhou, Shangchen Miao, and Mingsheng Long. Vid2world: Crafting video diffusion models to interactive world models, 2025. 1

  20. [20]

    Gqa: A new dataset for real-world visual reasoning and compositional question answering

    Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 6700–6709, 2019. 3

  21. [21]

    Vlm-r 3: Region recognition, reasoning, and refinement for enhanced multimodal chain-of-thought.arXiv preprint arXiv:2505.16192, 2025

    Chaoya Jiang, Yongrui Heng, Wei Ye, Han Yang, Haiyang Xu, Ming Yan, Ji Zhang, Fei Huang, and Shikun Zhang. Vlm-r 3: Region recognition, reasoning, and refinement for enhanced multimodal chain-of-thought.arXiv preprint arXiv:2505.16192, 2025. 2

  22. [22]

    Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

    Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. How far is video generation from world model? – a physical law perspective. arXiv preprint arXiv:2406.16860, 2024. 1, 2

  23. [23]

    Klingai: Image to video.https : / / app

    Kling. Klingai: Image to video.https : / / app . klingai.com/global, 2025. Accessed: 2025-09-08. 1, 2

  24. [24]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 1, 2, 7

  25. [25]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 3, 7

  26. [26]

    Sora generates videos with stunning geometrical consistency.arXiv preprint arXiv:2402.17403, 2024

    Xuanyi Li, Daquan Zhou, Chenxu Zhang, Shaodong Wei, Qibin Hou, and Ming-Ming Cheng. Sora generates videos with stunning geometrical consistency.arXiv preprint arXiv:2402.17403, 2024. 2

  27. [27]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13, pages 740–755. Springer, 2014. 2

  28. [28]

    Visual instruction tuning, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023. 2, 3

  29. [29]

    Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521,

    Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521,

  30. [30]

    Step-video-t2v technical report: The practice, challenges, and future of video foundation model, 2025

    Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiao- niu Song, Xing Chen, Yu Zhou, Deshan Sun, Deyu Zhou, Jian Zhou, Kaijun Tan, Kang An, Mei Chen, Wei Ji, Qil- ing Wu, Wen Sun, Xin Han, Yanan Wei, Zheng Ge, Aojie Li, Bin Wang, Bizhu Huang, Bo Wang, Brian Li, Changxing Miao, Chen Xu, Chenfei Wu, Cheng...

  31. [31]

    Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

    Fanqing Meng, Jiaqi Liao, Xinyu Tan, Wenqi Shao, Quan- feng Lu, Kaipeng Zhang, Yu Cheng, Dianqi Li, Yu Qiao, and Ping Luo. Towards world simulator: Crafting physical commonsense-based benchmark for video generation.arXiv preprint arXiv:2410.05363, 2024. 1

  32. [32]

    Towards world simulator: Crafting physical commonsense- based benchmark for video generation

    Fanqing Meng, Jiaqi Liao, Xinyu Tan, Quanfeng Lu, Wenqi Shao, Kaipeng Zhang, Yu Cheng, Dianqi Li, and Ping Luo. Towards world simulator: Crafting physical commonsense- based benchmark for video generation. InForty-second In- ternational Conference on Machine Learning, 2025. 2

  33. [33]

    Hailuo ai: Transform idea to visual with ai, 2025

    MiniMax. Hailuo ai: Transform idea to visual with ai, 2025. https://hailuoai.video/[2025.09.08]. 2

  34. [34]

    Sora-2, 2025

    OpenAI. Sora-2, 2025. https://openai.com/index/sora-2/. 7

  35. [35]

    Pika, 2025

    Pika Lab. Pika, 2025. https://pika.art/ [2025.09.09]. 2

  36. [36]

    Worldsimbench: Towards video generation models as world simulators

    Yiran Qin, Zhelun Shi, Jiwen Yu, Xijun Wang, Enshen Zhou, Lijun Li, Zhenfei Yin, Xihui Liu, Lu Sheng, Jing Shao, LEI BAI, and Ruimao Zhang. Worldsimbench: Towards video generation models as world simulators. InForty-second In- ternational Conference on Machine Learning, 2025. 2

  37. [37]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InThirty-seventh Conference on Neural In- formation Processing Systems, 2023. 8

  38. [38]

    A-okvqa: A benchmark for visual question answering using world knowl- edge.arXiv, 2022

    Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowl- edge.arXiv, 2022. 3

  39. [39]

    ViperGPT: Visual Inference via Python Execution for Reasoning

    D ´ıdac Sur´ıs, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execution for reasoning.arXiv preprint arXiv:2303.08128, 2023. 3

  40. [40]

    Vidgen-1m: A large-scale dataset for text-to-video genera- tion.arXiv preprint arXiv:2408.02629, 2024

    Zhiyu Tan, Xiaomeng Yang, Luozheng Qin, and Hao Li. Vidgen-1m: A large-scale dataset for text-to-video genera- tion.arXiv preprint arXiv:2408.02629, 2024. 3, 5, 7

  41. [41]

    Show and tell: A neural image caption gen- erator

    Oriol Vinyals, Alexander Toshev, Samy Bengio, and Du- mitru Erhan. Show and tell: A neural image caption gen- erator. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164, 2015. 3

  42. [42]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jin- gren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fan...

  43. [43]

    A continual learning paradigm for non- differentiable visual programming frameworks on visual rea- soning tasks.arXiv preprint arXiv:2309.09809, 2023

    Wentao Wan, Nan Kang, Zeqing Wang, Zhuojie Yang, Liang Lin, and Keze Wang. A continual learning paradigm for non- differentiable visual programming frameworks on visual rea- soning tasks.arXiv preprint arXiv:2309.09809, 2023. 2

  44. [44]

    Wisa: World simulator assistant for physics-aware text-to-video generation.arXiv preprint arXiv:2503.08153, 2025

    Jing Wang, Ao Ma, Ke Cao, Jun Zheng, Zhanjie Zhang, Jiasong Feng, Shanyuan Liu, Yuhang Ma, Bo Cheng, Dawei Leng, et al. Wisa: World simulator assistant for physics-aware text-to-video generation.arXiv preprint arXiv:2503.08153, 2025. 1

  45. [45]

    Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models

    Wenhao Wang and Yi Yang. Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models. Thirty-eighth Conference on Neural Information Processing Systems, 2024. 4

  46. [46]

    Lift: Leveraging human feed- back for text-to-video model alignment.arXiv preprint arXiv:2412.04814, 2024

    Yibin Wang, Zhiyu Tan, Junyan Wang, Xiaomeng Yang, Cheng Jin, and Hao Li. Lift: Leveraging human feed- back for text-to-video model alignment.arXiv preprint arXiv:2412.04814, 2024. 1, 2

  47. [47]

    Towards top-down reasoning: An explainable multi-agent approach for visual question answering.arXiv preprint arXiv:2311.17331, 2023

    Zeqing Wang, Wentao Wan, Qiqing Lao, Runmeng Chen, Minjie Lang, Keze Wang, and Liang Lin. Towards top-down reasoning: An explainable multi-agent approach for visual question answering.arXiv preprint arXiv:2311.17331, 2023. 3

  48. [48]

    Is this generated person existed in real-world? fine-grained detecting and calibrating abnor- mal human-body

    Zeqing Wang, Qingyang Ma, Wentao Wan, Haojie Li, Keze Wang, and Yonghong Tian. Is this generated person existed in real-world? fine-grained detecting and calibrating abnor- mal human-body. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21226–21237, 2025. 1, 2, 3

  49. [49]

    VideoVerse: Does Your T2V Generator Have World Model Capability to Synthesize Videos?

    Zeqing Wang, Xinyu Wei, Bairui Li, Zhen Guo, Jinrui Zhang, Hongyang Wei, Keze Wang, and Lei Zhang. Video- verse: How far is your t2v generator from a world model? arXiv preprint arXiv:2510.08398, 2025. 2

  50. [50]

    Timecausality: Evaluating the causal ability in time dimension for vision language models.arXiv preprint arXiv:2505.15435, 2025

    Zeqing Wang, Shiyuan Zhang, Chengpei Tang, and Keze Wang. Timecausality: Evaluating the causal ability in time dimension for vision language models.arXiv preprint arXiv:2505.15435, 2025. 2

  51. [51]

    Tiif-bench: How does your t2i model follow your instructions?arXiv preprint arXiv:2506.02161, 2025

    Xinyu Wei, Jinrui Zhang, Zeqing Wang, Hongyang Wei, Zhen Guo, and Lei Zhang. Tiif-bench: How does your t2i model follow your instructions?arXiv preprint arXiv:2506.02161, 2025. 2

  52. [52]

    Xing, and Zhiting Hu

    Jiannan Xiang, Guangyi Liu, Yi Gu, Qiyue Gao, Yuting Ning, Yuheng Zha, Zeyu Feng, Tianhua Tao, Shibo Hao, Yemin Shi, Zhengzhong Liu, Eric P. Xing, and Zhiting Hu. Pandora: Towards general world model with natural lan- guage actions and video states. 2024. 2

  53. [53]

    Captain cin- ema: Towards short movie generation.arXiv preprint arXiv:2507.18634, 2025

    Junfei Xiao, Ceyuan Yang, Lvmin Zhang, Shengqu Cai, Yang Zhao, Yuwei Guo, Gordon Wetzstein, Maneesh Agrawala, Alan Yuille, and Lu Jiang. Captain cin- ema: Towards short movie generation.arXiv preprint arXiv:2507.18634, 2025. 2

  54. [54]

    Phyt2v: Llm-guided iterative self-refinement for physics- grounded text-to-video generation

    Qiyao Xue, Xiangyu Yin, Boyuan Yang, and Wei Gao. Phyt2v: Llm-guided iterative self-refinement for physics- grounded text-to-video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18826–18836, 2025. 1, 2

  55. [55]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 1, 2, 7

  56. [56]

    MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024. 7

  57. [57]

    AnyGPT: Unified multimodal LLM with discrete sequence modeling

    Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, Dong Zhang, Zhigeng Liu, Xin Zhang, Ruibin Yuan, Ge Zhang, Linyang Li, Hang Yan, Jie Fu, Tao Gui, Tianxiang Sun, Yu- Gang Jiang, and Xipeng Qiu. AnyGPT: Unified multimodal LLM with discrete sequence modeling. InProceedings of the 62nd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1...

  58. [58]

    MA VIS: Mathe- matical visual instruction tuning with an automatic data en- gine

    Renrui Zhang, Xinyu Wei, Dongzhi Jiang, Ziyu Guo, Yichi Zhang, Chengzhuo Tong, Jiaming Liu, Aojun Zhou, Shang- hang Zhang, Peng Gao, and Hongsheng Li. MA VIS: Mathe- matical visual instruction tuning with an automatic data en- gine. InThe Thirteenth International Conference on Learn- ing Representations, 2025. 2

  59. [59]

    Omg-llava: Bridging image-level, object-level, pixel-level reasoning and understanding

    Tao Zhang, Xiangtai Li, Hao Fei, Haobo Yuan, Shengqiong Wu, Shunping Ji, Change Loy Chen, and Shuicheng Yan. Omg-llava: Bridging image-level, object-level, pixel-level reasoning and understanding. InNeurIPS, 2024. 3

  60. [60]

    Vide- orepa: Learning physics for video generation through re- lational alignment with foundation models.arXiv preprint arXiv:2505.23656, 2025

    Xiangdong Zhang, Jiaqi Liao, Shaofeng Zhang, Fanqing Meng, Xiangpeng Wan, Junchi Yan, and Yu Cheng. Vide- orepa: Learning physics for video generation through re- lational alignment with foundation models.arXiv preprint arXiv:2505.23656, 2025. 2

  61. [61]

    VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

    Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Yuanhan Zhang, Jingwen He, Wei-Shi Zheng, Yu Qiao, et al. Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025. 2

  62. [62]

    generated vs. non-generated

    Zheng Zhu, Xiaofeng Wang, Wangbo Zhao, Chen Min, Ni- anchen Deng, Min Dou, Yuqi Wang, Botian Shi, Kai Wang, Chi Zhang, et al. Is sora a world simulator? a comprehensive survey on general world models and beyond.arXiv preprint arXiv:2405.03520, 2024. 2 PhyDetEx: Detecting and Explaining the Physical Plausibility of T2V Models Supplementary Material We high...