PhyDetEx: Detecting and Explaining the Physical Plausibility of T2V Models
Pith reviewed 2026-05-21 17:52 UTC · model grok-4.3
The pith
Fine-tuned VLMs detect and explain physical implausibilities in text-to-video generations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We construct the PID dataset containing a test split of 500 manually annotated videos and a train split of 2,588 paired videos where each implausible video is produced by rewriting the caption of its corresponding real-world video. With this data we introduce a lightweight fine-tuning approach that turns VLMs into detectors and explainers of physical implausibility. The resulting PhyDetEx model is applied to benchmark state-of-the-art T2V systems, revealing that recent models have advanced toward physically plausible output yet still struggle to understand and adhere to physical laws, especially open-source variants.
What carries the argument
The PID dataset of caption-rewritten real-implausible video pairs used to fine-tune VLMs for joint detection of implausible events and generation of textual explanations for violated physical principles.
If this is right
- State-of-the-art T2V models can be systematically ranked for physical adherence using an automated detector and explainer.
- Explanations of violated principles supply concrete signals that can guide targeted improvements in T2V training data or loss functions.
- Open-source T2V models exhibit larger gaps in physical understanding than closed-source counterparts and therefore require focused remediation.
- The lightweight fine-tuning recipe demonstrates that general VLMs can be specialized for physics evaluation without prohibitive compute.
Where Pith is reading between the lines
- The same caption-rewriting technique could be applied to create evaluation sets for text-to-image or text-to-3D generators.
- Persistent physical failures suggest that current T2V training corpora under-represent explicit physical constraints.
- PhyDetEx outputs could be fed back into T2V training loops to curate or re-weight data that emphasizes physical correctness.
Load-bearing premise
Rewriting captions of real videos reliably produces videos that violate specific physical principles in ways that are both detectable and representative of real T2V failure modes.
What would settle it
A held-out set of T2V outputs containing independently verified physical violations (such as broken object permanence or gravity) on which the fine-tuned VLM either misses the violation or gives incorrect explanations would falsify the method's reliability.
Figures
read the original abstract
Driven by the growing capacity and training scale, Text-to-Video (T2V) generation models have recently achieved substantial progress in video quality, length, and instruction-following capability. However, whether these models can understand physics and generate physically plausible videos remains a question. While Vision-Language Models (VLMs) have been widely used as general-purpose evaluators in various applications, they struggle to identify the physically impossible content from generated videos. To investigate this issue, we construct a \textbf{PID} (\textbf{P}hysical \textbf{I}mplausibility \textbf{D}etection) dataset, which consists of a \textit{test split} of 500 manually annotated videos and a \textit{train split} of 2,588 paired videos, where each implausible video is generated by carefully rewriting the caption of its corresponding real-world video to induce T2V models producing physically implausible content. With the constructed dataset, we introduce a lightweight fine-tuning approach, enabling VLMs to not only detect physically implausible events but also generate textual explanations on the violated physical principles. Taking the fine-tuned VLM as a physical plausibility detector and explainer, namely \textbf{PhyDetEx}, we benchmark a series of state-of-the-art T2V models to assess their adherence to physical laws. Our findings show that although recent T2V models have made notable progress toward generating physically plausible content, understanding and adhering to physical laws remains a challenging issue, especially for open-source models. Our dataset, training code, and checkpoints are available at \href{https://github.com/Zeqing-Wang/PhyDetEx}{https://github.com/Zeqing-Wang/PhyDetEx}.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PhyDetEx, a fine-tuned VLM-based detector and explainer for physical implausibilities in T2V video generations. It constructs the PID dataset (500 manually annotated test videos plus 2,588 paired train videos created by rewriting real-video captions to induce targeted physical violations such as breaches of gravity or momentum), fine-tunes VLMs on this data to both detect implausible events and generate textual explanations of violated principles, and benchmarks a range of state-of-the-art T2V models, concluding that physical-law adherence remains challenging, especially for open-source models.
Significance. If the detector proves reliable, the work supplies a practical evaluation tool and benchmark for physical plausibility in T2V systems, an area where current models still fall short. The public release of the PID dataset, training code, and checkpoints is a clear strength that supports reproducibility and follow-on research.
major comments (2)
- [PID dataset construction] PID dataset construction (train split): rewriting captions of real videos to induce specific violations (gravity, momentum, object permanence, etc.) is assumed to produce a distribution of failures representative of those that arise under natural-language prompting. No comparison of violation statistics between rewritten-prompt generations and organic generations, nor human validation of representativeness, is described; without this, the fine-tuned detector's scores on the benchmarked T2V outputs may not generalize, weakening the claim that open-source models adhere less well to physical laws.
- [Evaluation and benchmarking] Evaluation section: the central claim that T2V models (particularly open-source ones) still fail to adhere to physical laws rests on PhyDetEx being a trustworthy detector. The manuscript reports dataset sizes and a benchmarking outcome but provides no quantitative detector metrics (accuracy, precision/recall on the 500-video test split), error analysis, or ablation on the fine-tuning procedure; these are required to establish that the detector itself is not the source of the observed differences.
minor comments (2)
- [Abstract] Abstract: dataset sizes are stated but no key quantitative benchmarking results or detector performance numbers appear, making it harder for readers to gauge the strength of the findings at a glance.
- [Figures] Figure and table captions: ensure all figures showing example implausible videos or detector outputs are accompanied by explicit captions that list the violated physical principle and the model that produced the video.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments. We address each major comment point by point below, indicating planned revisions to strengthen the manuscript while maintaining the integrity of our contributions.
read point-by-point responses
-
Referee: [PID dataset construction] PID dataset construction (train split): rewriting captions of real videos to induce specific violations (gravity, momentum, object permanence, etc.) is assumed to produce a distribution of failures representative of those that arise under natural-language prompting. No comparison of violation statistics between rewritten-prompt generations and organic generations, nor human validation of representativeness, is described; without this, the fine-tuned detector's scores on the benchmarked T2V outputs may not generalize, weakening the claim that open-source models adhere less well to physical laws.
Authors: We appreciate the referee highlighting this aspect of our dataset design. The train split of 2,588 paired videos was intentionally constructed by rewriting real-video captions to induce targeted physical violations, enabling controlled and diverse training examples for fine-tuning the VLM on specific principles such as gravity and momentum. This paired structure supports the explainer component as well. The independent test split of 500 manually annotated videos provides ground-truth evaluation. We acknowledge that an explicit statistical comparison to organic generations and additional human validation of representativeness would further support generalization claims. In the revised manuscript, we will add a dedicated discussion of the construction rationale and include results from a small-scale human validation study on violation representativeness. revision: yes
-
Referee: [Evaluation and benchmarking] Evaluation section: the central claim that T2V models (particularly open-source ones) still fail to adhere to physical laws rests on PhyDetEx being a trustworthy detector. The manuscript reports dataset sizes and a benchmarking outcome but provides no quantitative detector metrics (accuracy, precision/recall on the 500-video test split), error analysis, or ablation on the fine-tuning procedure; these are required to establish that the detector itself is not the source of the observed differences.
Authors: We agree that quantitative metrics are necessary to substantiate the trustworthiness of PhyDetEx and the validity of the benchmarking conclusions. The current manuscript prioritizes the overall findings on T2V model performance but does not include the requested detector-level evaluations. In the revised version, we will expand the evaluation section to report accuracy, precision, and recall on the 500-video test split, provide an error analysis, and include ablations on the fine-tuning procedure. These additions will directly address concerns about the detector as a potential source of observed differences. revision: yes
Circularity Check
No circularity: empirical dataset construction and external benchmarking
full rationale
The paper constructs the PID dataset by rewriting real-video captions to induce targeted physical violations, fine-tunes VLMs on the resulting paired data for detection and explanation, and then applies the fine-tuned model to benchmark external T2V generators. No mathematical derivation, fitted parameter, or prediction is presented; all quantitative results are obtained by running the detector on outputs from independent models. The central claims rest on the methodological assumption that caption rewriting produces representative violations, but this assumption is not derived from or equivalent to the paper's own outputs by construction. No self-citation chain, uniqueness theorem, or ansatz is invoked to justify the core pipeline. The work is therefore self-contained against external T2V models and does not reduce to its inputs.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Vision-language models can be fine-tuned on paired plausible/implausible videos to detect physical violations and generate explanations of violated principles.
- domain assumption Caption rewriting of real videos induces T2V models to generate content that violates identifiable physical laws.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
each implausible video is generated by carefully rewriting the caption of its corresponding real-world video to induce T2V models producing physically implausible content
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
lightweight fine-tuning approach, enabling VLMs to not only detect physically implausible events but also generate textual explanations on the violated physical principles
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
PhyGround: Benchmarking Physical Reasoning in Generative World Models
PhyGround is a new benchmark with curated prompts, a 13-law taxonomy, large-scale human annotations, and an open physics-specialized VLM judge for evaluating physical reasoning in generative video models.
-
From Pixels to Concepts: Do Segmentation Models Understand What They Segment?
CAFE benchmark reveals that promptable segmentation models often produce correct masks for misleading prompts, showing a gap between localization accuracy and true concept understanding.
Reference graph
Works this paper leans on
-
[1]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023. 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Zechen Bai, Hai Ci, and Mike Zheng Shou. Impossible videos, 2025. 2, 3, 4, 6
work page 2025
-
[4]
VideoPhy: Evaluating Physical Commonsense for Video Generation
Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chenfanfu Jiang, Yizhou Sun, Kai- Wei Chang, and Aditya Grover. Videophy: Evaluating physical commonsense for video generation.arXiv preprint arXiv:2406.03520, 2024. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Hritik Bansal, Clark Peng, Yonatan Bitton, Roman Golden- berg, Aditya Grover, and Kai-Wei Chang. Videophy-2: A challenging action-centric physical commonsense evaluation in video generation.arXiv preprint arXiv:2503.06800, 2025. 1, 2
-
[6]
T2vworldbench: A benchmark for evaluating world knowledge in text-to-video generation, 2025
Yubin Chen, Xuyang Guo, Zhenmei Shi, Zhao Song, and Jiahao Zhang. T2vworldbench: A benchmark for evaluating world knowledge in text-to-video generation, 2025. 2
work page 2025
-
[7]
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling.arXiv preprint arXiv:2412.05271, 2024. 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101,
-
[9]
Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024. 2
work page 2024
-
[10]
Joseph Cho, Fachrina Dewi Puspitasari, Sheng Zheng, Jingyao Zheng, Lik-Hang Lee, Tae-Ho Kim, Choong Seon Hong, and Chaoning Zhang. Sora as an agi world model? a complete survey on text-to-video generation.arXiv preprint arXiv:2403.05131, 2024. 2
-
[11]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, and Luke Marris et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic ca- pabilities, 2025. 2
work page 2025
-
[12]
Sharegpt-4o: Comprehensive mul- timodal annotations with gpt-4o, 2024
Erfei Cui, Yinan He, Zheng Ma, Zhe Chen, Hao Tian, Weiyun Wang, Kunchang Li, Yi Wang, Wenhai Wang, Xizhou Zhu, Lewei Lu, Tong Lu, Yali Wang, Limin Wang, Yu Qiao, and Jifeng Dai. Sharegpt-4o: Comprehensive mul- timodal annotations with gpt-4o, 2024. 2
work page 2024
-
[13]
Emerging Properties in Unified Multimodal Pretraining
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Ji- ajun Wu. Worldscore: A unified evaluation benchmark for world generation.arXiv preprint arXiv:2504.00983, 2025. 2
-
[15]
Veo - google deepmind, 2025.https: //deepmind.google/models/veo/[2025.09.08]
Google DeepMind. Veo - google deepmind, 2025.https: //deepmind.google/models/veo/[2025.09.08]. 1, 7
work page 2025
-
[16]
Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017. 3
work page 2017
-
[17]
Visual program- ming: Compositional visual reasoning without training
Tanmay Gupta and Aniruddha Kembhavi. Visual program- ming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 14953–14962, 2023. 3
work page 2023
-
[18]
VideoScore: Building automatic metrics to simulate fine-grained human feedback for video generation
Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil Chandra, Ziyan Jiang, Aaran Arulraj, Kai Wang, Quy Duc Do, Yuansheng Ni, Bo- han Lyu, Yaswanth Narsupalli, Rongqi Fan, Zhiheng Lyu, Bill Yuchen Lin, and Wenhu Chen. VideoScore: Building automatic metrics to simulate fine-grained human feedback for video generation. In...
work page 2024
-
[19]
Vid2world: Crafting video diffusion models to interactive world models, 2025
Siqiao Huang, Jialong Wu, Qixing Zhou, Shangchen Miao, and Mingsheng Long. Vid2world: Crafting video diffusion models to interactive world models, 2025. 1
work page 2025
-
[20]
Gqa: A new dataset for real-world visual reasoning and compositional question answering
Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 6700–6709, 2019. 3
work page 2019
-
[21]
Chaoya Jiang, Yongrui Heng, Wei Ye, Han Yang, Haiyang Xu, Ming Yan, Ji Zhang, Fei Huang, and Shikun Zhang. Vlm-r 3: Region recognition, reasoning, and refinement for enhanced multimodal chain-of-thought.arXiv preprint arXiv:2505.16192, 2025. 2
-
[22]
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. How far is video generation from world model? – a physical law perspective. arXiv preprint arXiv:2406.16860, 2024. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
Klingai: Image to video.https : / / app
Kling. Klingai: Image to video.https : / / app . klingai.com/global, 2025. Accessed: 2025-09-08. 1, 2
work page 2025
-
[24]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 1, 2, 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 3, 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
Sora generates videos with stunning geometrical consistency.arXiv preprint arXiv:2402.17403, 2024
Xuanyi Li, Daquan Zhou, Chenxu Zhang, Shaodong Wei, Qibin Hou, and Ming-Ming Cheng. Sora generates videos with stunning geometrical consistency.arXiv preprint arXiv:2402.17403, 2024. 2
-
[27]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13, pages 740–755. Springer, 2014. 2
work page 2014
-
[28]
Visual instruction tuning, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023. 2, 3
work page 2023
-
[29]
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521,
-
[30]
Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiao- niu Song, Xing Chen, Yu Zhou, Deshan Sun, Deyu Zhou, Jian Zhou, Kaijun Tan, Kang An, Mei Chen, Wei Ji, Qil- ing Wu, Wen Sun, Xin Han, Yanan Wei, Zheng Ge, Aojie Li, Bin Wang, Bizhu Huang, Bo Wang, Brian Li, Changxing Miao, Chen Xu, Chenfei Wu, Cheng...
work page 2025
-
[31]
Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation
Fanqing Meng, Jiaqi Liao, Xinyu Tan, Wenqi Shao, Quan- feng Lu, Kaipeng Zhang, Yu Cheng, Dianqi Li, Yu Qiao, and Ping Luo. Towards world simulator: Crafting physical commonsense-based benchmark for video generation.arXiv preprint arXiv:2410.05363, 2024. 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
Towards world simulator: Crafting physical commonsense- based benchmark for video generation
Fanqing Meng, Jiaqi Liao, Xinyu Tan, Quanfeng Lu, Wenqi Shao, Kaipeng Zhang, Yu Cheng, Dianqi Li, and Ping Luo. Towards world simulator: Crafting physical commonsense- based benchmark for video generation. InForty-second In- ternational Conference on Machine Learning, 2025. 2
work page 2025
-
[33]
Hailuo ai: Transform idea to visual with ai, 2025
MiniMax. Hailuo ai: Transform idea to visual with ai, 2025. https://hailuoai.video/[2025.09.08]. 2
work page 2025
- [34]
- [35]
-
[36]
Worldsimbench: Towards video generation models as world simulators
Yiran Qin, Zhelun Shi, Jiwen Yu, Xijun Wang, Enshen Zhou, Lijun Li, Zhenfei Yin, Xihui Liu, Lu Sheng, Jing Shao, LEI BAI, and Ruimao Zhang. Worldsimbench: Towards video generation models as world simulators. InForty-second In- ternational Conference on Machine Learning, 2025. 2
work page 2025
-
[37]
Direct preference optimization: Your language model is secretly a reward model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InThirty-seventh Conference on Neural In- formation Processing Systems, 2023. 8
work page 2023
-
[38]
A-okvqa: A benchmark for visual question answering using world knowl- edge.arXiv, 2022
Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowl- edge.arXiv, 2022. 3
work page 2022
-
[39]
ViperGPT: Visual Inference via Python Execution for Reasoning
D ´ıdac Sur´ıs, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execution for reasoning.arXiv preprint arXiv:2303.08128, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[40]
Zhiyu Tan, Xiaomeng Yang, Luozheng Qin, and Hao Li. Vidgen-1m: A large-scale dataset for text-to-video genera- tion.arXiv preprint arXiv:2408.02629, 2024. 3, 5, 7
-
[41]
Show and tell: A neural image caption gen- erator
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Du- mitru Erhan. Show and tell: A neural image caption gen- erator. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164, 2015. 3
work page 2015
-
[42]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jin- gren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fan...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
Wentao Wan, Nan Kang, Zeqing Wang, Zhuojie Yang, Liang Lin, and Keze Wang. A continual learning paradigm for non- differentiable visual programming frameworks on visual rea- soning tasks.arXiv preprint arXiv:2309.09809, 2023. 2
-
[44]
Jing Wang, Ao Ma, Ke Cao, Jun Zheng, Zhanjie Zhang, Jiasong Feng, Shanyuan Liu, Yuhang Ma, Bo Cheng, Dawei Leng, et al. Wisa: World simulator assistant for physics-aware text-to-video generation.arXiv preprint arXiv:2503.08153, 2025. 1
-
[45]
Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models
Wenhao Wang and Yi Yang. Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models. Thirty-eighth Conference on Neural Information Processing Systems, 2024. 4
work page 2024
-
[46]
Yibin Wang, Zhiyu Tan, Junyan Wang, Xiaomeng Yang, Cheng Jin, and Hao Li. Lift: Leveraging human feed- back for text-to-video model alignment.arXiv preprint arXiv:2412.04814, 2024. 1, 2
-
[47]
Zeqing Wang, Wentao Wan, Qiqing Lao, Runmeng Chen, Minjie Lang, Keze Wang, and Liang Lin. Towards top-down reasoning: An explainable multi-agent approach for visual question answering.arXiv preprint arXiv:2311.17331, 2023. 3
-
[48]
Zeqing Wang, Qingyang Ma, Wentao Wan, Haojie Li, Keze Wang, and Yonghong Tian. Is this generated person existed in real-world? fine-grained detecting and calibrating abnor- mal human-body. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21226–21237, 2025. 1, 2, 3
work page 2025
-
[49]
VideoVerse: Does Your T2V Generator Have World Model Capability to Synthesize Videos?
Zeqing Wang, Xinyu Wei, Bairui Li, Zhen Guo, Jinrui Zhang, Hongyang Wei, Keze Wang, and Lei Zhang. Video- verse: How far is your t2v generator from a world model? arXiv preprint arXiv:2510.08398, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
Zeqing Wang, Shiyuan Zhang, Chengpei Tang, and Keze Wang. Timecausality: Evaluating the causal ability in time dimension for vision language models.arXiv preprint arXiv:2505.15435, 2025. 2
-
[51]
Tiif-bench: How does your t2i model follow your instructions?arXiv preprint arXiv:2506.02161, 2025
Xinyu Wei, Jinrui Zhang, Zeqing Wang, Hongyang Wei, Zhen Guo, and Lei Zhang. Tiif-bench: How does your t2i model follow your instructions?arXiv preprint arXiv:2506.02161, 2025. 2
-
[52]
Jiannan Xiang, Guangyi Liu, Yi Gu, Qiyue Gao, Yuting Ning, Yuheng Zha, Zeyu Feng, Tianhua Tao, Shibo Hao, Yemin Shi, Zhengzhong Liu, Eric P. Xing, and Zhiting Hu. Pandora: Towards general world model with natural lan- guage actions and video states. 2024. 2
work page 2024
-
[53]
Captain cin- ema: Towards short movie generation.arXiv preprint arXiv:2507.18634, 2025
Junfei Xiao, Ceyuan Yang, Lvmin Zhang, Shengqu Cai, Yang Zhao, Yuwei Guo, Gordon Wetzstein, Maneesh Agrawala, Alan Yuille, and Lu Jiang. Captain cin- ema: Towards short movie generation.arXiv preprint arXiv:2507.18634, 2025. 2
-
[54]
Phyt2v: Llm-guided iterative self-refinement for physics- grounded text-to-video generation
Qiyao Xue, Xiangyu Yin, Boyuan Yang, and Wei Gao. Phyt2v: Llm-guided iterative self-refinement for physics- grounded text-to-video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18826–18836, 2025. 1, 2
work page 2025
-
[55]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 1, 2, 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[56]
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024. 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[57]
AnyGPT: Unified multimodal LLM with discrete sequence modeling
Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, Dong Zhang, Zhigeng Liu, Xin Zhang, Ruibin Yuan, Ge Zhang, Linyang Li, Hang Yan, Jie Fu, Tao Gui, Tianxiang Sun, Yu- Gang Jiang, and Xipeng Qiu. AnyGPT: Unified multimodal LLM with discrete sequence modeling. InProceedings of the 62nd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1...
work page 2024
-
[58]
MA VIS: Mathe- matical visual instruction tuning with an automatic data en- gine
Renrui Zhang, Xinyu Wei, Dongzhi Jiang, Ziyu Guo, Yichi Zhang, Chengzhuo Tong, Jiaming Liu, Aojun Zhou, Shang- hang Zhang, Peng Gao, and Hongsheng Li. MA VIS: Mathe- matical visual instruction tuning with an automatic data en- gine. InThe Thirteenth International Conference on Learn- ing Representations, 2025. 2
work page 2025
-
[59]
Omg-llava: Bridging image-level, object-level, pixel-level reasoning and understanding
Tao Zhang, Xiangtai Li, Hao Fei, Haobo Yuan, Shengqiong Wu, Shunping Ji, Change Loy Chen, and Shuicheng Yan. Omg-llava: Bridging image-level, object-level, pixel-level reasoning and understanding. InNeurIPS, 2024. 3
work page 2024
-
[60]
Xiangdong Zhang, Jiaqi Liao, Shaofeng Zhang, Fanqing Meng, Xiangpeng Wan, Junchi Yan, and Yu Cheng. Vide- orepa: Learning physics for video generation through re- lational alignment with foundation models.arXiv preprint arXiv:2505.23656, 2025. 2
-
[61]
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness
Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Yuanhan Zhang, Jingwen He, Wei-Shi Zheng, Yu Qiao, et al. Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[62]
Zheng Zhu, Xiaofeng Wang, Wangbo Zhao, Chen Min, Ni- anchen Deng, Min Dou, Yuqi Wang, Botian Shi, Kai Wang, Chi Zhang, et al. Is sora a world simulator? a comprehensive survey on general world models and beyond.arXiv preprint arXiv:2405.03520, 2024. 2 PhyDetEx: Detecting and Explaining the Physical Plausibility of T2V Models Supplementary Material We high...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.