PhyDetEx: Detecting and Explaining the Physical Plausibility of T2V Models

Keze Wang; Lei Zhang; Zeqing Wang

arxiv: 2512.01843 · v3 · pith:7DQBDYEJnew · submitted 2025-12-01 · 💻 cs.CV

PhyDetEx: Detecting and Explaining the Physical Plausibility of T2V Models

Zeqing Wang , Keze Wang , Lei Zhang This is my paper

Pith reviewed 2026-05-21 17:52 UTC · model grok-4.3

classification 💻 cs.CV

keywords text-to-videophysical plausibilityvision-language modelsPID datasetfine-tuningmodel benchmarkingphysics violations

0 comments

The pith

Fine-tuned VLMs detect and explain physical implausibilities in text-to-video generations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a PID dataset of real videos paired with implausible versions created by rewriting their captions to induce specific physical violations. It applies a lightweight fine-tuning method to vision-language models so they can both identify these implausible events and describe the violated physical principles. This PhyDetEx system is then used to benchmark leading text-to-video models. The evaluation shows progress in visual quality but persistent failures to follow physical laws, particularly among open-source models. A sympathetic reader would care because video generation for practical use requires models that respect real-world physics rather than only produce visually convincing footage.

Core claim

We construct the PID dataset containing a test split of 500 manually annotated videos and a train split of 2,588 paired videos where each implausible video is produced by rewriting the caption of its corresponding real-world video. With this data we introduce a lightweight fine-tuning approach that turns VLMs into detectors and explainers of physical implausibility. The resulting PhyDetEx model is applied to benchmark state-of-the-art T2V systems, revealing that recent models have advanced toward physically plausible output yet still struggle to understand and adhere to physical laws, especially open-source variants.

What carries the argument

The PID dataset of caption-rewritten real-implausible video pairs used to fine-tune VLMs for joint detection of implausible events and generation of textual explanations for violated physical principles.

If this is right

State-of-the-art T2V models can be systematically ranked for physical adherence using an automated detector and explainer.
Explanations of violated principles supply concrete signals that can guide targeted improvements in T2V training data or loss functions.
Open-source T2V models exhibit larger gaps in physical understanding than closed-source counterparts and therefore require focused remediation.
The lightweight fine-tuning recipe demonstrates that general VLMs can be specialized for physics evaluation without prohibitive compute.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same caption-rewriting technique could be applied to create evaluation sets for text-to-image or text-to-3D generators.
Persistent physical failures suggest that current T2V training corpora under-represent explicit physical constraints.
PhyDetEx outputs could be fed back into T2V training loops to curate or re-weight data that emphasizes physical correctness.

Load-bearing premise

Rewriting captions of real videos reliably produces videos that violate specific physical principles in ways that are both detectable and representative of real T2V failure modes.

What would settle it

A held-out set of T2V outputs containing independently verified physical violations (such as broken object permanence or gravity) on which the fine-tuned VLM either misses the violation or gives incorrect explanations would falsify the method's reliability.

Figures

Figures reproduced from arXiv: 2512.01843 by Keze Wang, Lei Zhang, Zeqing Wang.

**Figure 2.** Figure 2: Results of preliminary experiments in the Impossible [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the construction pipeline of the PID dataset and the training process of PhyDetEx. (a) The PID training split. The [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison between our PhyDetEx and recent VLMs on detecting physical implausibility. We illustrate two rep [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Driven by the growing capacity and training scale, Text-to-Video (T2V) generation models have recently achieved substantial progress in video quality, length, and instruction-following capability. However, whether these models can understand physics and generate physically plausible videos remains a question. While Vision-Language Models (VLMs) have been widely used as general-purpose evaluators in various applications, they struggle to identify the physically impossible content from generated videos. To investigate this issue, we construct a \textbf{PID} (\textbf{P}hysical \textbf{I}mplausibility \textbf{D}etection) dataset, which consists of a \textit{test split} of 500 manually annotated videos and a \textit{train split} of 2,588 paired videos, where each implausible video is generated by carefully rewriting the caption of its corresponding real-world video to induce T2V models producing physically implausible content. With the constructed dataset, we introduce a lightweight fine-tuning approach, enabling VLMs to not only detect physically implausible events but also generate textual explanations on the violated physical principles. Taking the fine-tuned VLM as a physical plausibility detector and explainer, namely \textbf{PhyDetEx}, we benchmark a series of state-of-the-art T2V models to assess their adherence to physical laws. Our findings show that although recent T2V models have made notable progress toward generating physically plausible content, understanding and adhering to physical laws remains a challenging issue, especially for open-source models. Our dataset, training code, and checkpoints are available at \href{https://github.com/Zeqing-Wang/PhyDetEx}{https://github.com/Zeqing-Wang/PhyDetEx}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PhyDetEx offers a paired dataset built by caption rewriting plus a fine-tuned VLM for detecting and explaining physical violations in T2V, but the method's value hinges on whether those induced errors match real model failures.

read the letter

The punchline here is a new way to create training data for spotting physical implausibilities in text-to-video models. They take real videos, rewrite the captions to target specific violations such as gravity or momentum issues, generate the bad versions, and pair them up. Then they fine-tune a vision-language model to both flag the implausible events and explain which physical principles are violated. This PhyDetEx setup is used to check how current T2V models perform. What works is the release of the full PID dataset along with training code and checkpoints. That makes the empirical part easy to check and extend. The idea of using caption rewriting for paired data is simple and avoids needing to collect implausible videos from scratch. Their conclusion that open-source models still have notable gaps in physical understanding fits with what many have seen in practice. The softer part is the assumption in the dataset construction. Rewriting captions could lead to violations that are more systematic or obvious than the ones that show up in normal use of the models. If so, the fine-tuned detector might overfit to these artificial cases and give misleading scores when applied to typical generations. Also, without numbers on detection accuracy, false positive rates, or how the fine-tuning was tuned, it's difficult to gauge how strong the tool actually is. This paper targets people working on video generation who want better ways to measure and fix physical errors. Anyone evaluating T2V outputs for realism could use the detector or the dataset. It has enough new elements and supporting materials to go through peer review rather than a desk reject.

Referee Report

2 major / 2 minor

Summary. The paper introduces PhyDetEx, a fine-tuned VLM-based detector and explainer for physical implausibilities in T2V video generations. It constructs the PID dataset (500 manually annotated test videos plus 2,588 paired train videos created by rewriting real-video captions to induce targeted physical violations such as breaches of gravity or momentum), fine-tunes VLMs on this data to both detect implausible events and generate textual explanations of violated principles, and benchmarks a range of state-of-the-art T2V models, concluding that physical-law adherence remains challenging, especially for open-source models.

Significance. If the detector proves reliable, the work supplies a practical evaluation tool and benchmark for physical plausibility in T2V systems, an area where current models still fall short. The public release of the PID dataset, training code, and checkpoints is a clear strength that supports reproducibility and follow-on research.

major comments (2)

[PID dataset construction] PID dataset construction (train split): rewriting captions of real videos to induce specific violations (gravity, momentum, object permanence, etc.) is assumed to produce a distribution of failures representative of those that arise under natural-language prompting. No comparison of violation statistics between rewritten-prompt generations and organic generations, nor human validation of representativeness, is described; without this, the fine-tuned detector's scores on the benchmarked T2V outputs may not generalize, weakening the claim that open-source models adhere less well to physical laws.
[Evaluation and benchmarking] Evaluation section: the central claim that T2V models (particularly open-source ones) still fail to adhere to physical laws rests on PhyDetEx being a trustworthy detector. The manuscript reports dataset sizes and a benchmarking outcome but provides no quantitative detector metrics (accuracy, precision/recall on the 500-video test split), error analysis, or ablation on the fine-tuning procedure; these are required to establish that the detector itself is not the source of the observed differences.

minor comments (2)

[Abstract] Abstract: dataset sizes are stated but no key quantitative benchmarking results or detector performance numbers appear, making it harder for readers to gauge the strength of the findings at a glance.
[Figures] Figure and table captions: ensure all figures showing example implausible videos or detector outputs are accompanied by explicit captions that list the violated physical principle and the model that produced the video.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major comment point by point below, indicating planned revisions to strengthen the manuscript while maintaining the integrity of our contributions.

read point-by-point responses

Referee: [PID dataset construction] PID dataset construction (train split): rewriting captions of real videos to induce specific violations (gravity, momentum, object permanence, etc.) is assumed to produce a distribution of failures representative of those that arise under natural-language prompting. No comparison of violation statistics between rewritten-prompt generations and organic generations, nor human validation of representativeness, is described; without this, the fine-tuned detector's scores on the benchmarked T2V outputs may not generalize, weakening the claim that open-source models adhere less well to physical laws.

Authors: We appreciate the referee highlighting this aspect of our dataset design. The train split of 2,588 paired videos was intentionally constructed by rewriting real-video captions to induce targeted physical violations, enabling controlled and diverse training examples for fine-tuning the VLM on specific principles such as gravity and momentum. This paired structure supports the explainer component as well. The independent test split of 500 manually annotated videos provides ground-truth evaluation. We acknowledge that an explicit statistical comparison to organic generations and additional human validation of representativeness would further support generalization claims. In the revised manuscript, we will add a dedicated discussion of the construction rationale and include results from a small-scale human validation study on violation representativeness. revision: yes
Referee: [Evaluation and benchmarking] Evaluation section: the central claim that T2V models (particularly open-source ones) still fail to adhere to physical laws rests on PhyDetEx being a trustworthy detector. The manuscript reports dataset sizes and a benchmarking outcome but provides no quantitative detector metrics (accuracy, precision/recall on the 500-video test split), error analysis, or ablation on the fine-tuning procedure; these are required to establish that the detector itself is not the source of the observed differences.

Authors: We agree that quantitative metrics are necessary to substantiate the trustworthiness of PhyDetEx and the validity of the benchmarking conclusions. The current manuscript prioritizes the overall findings on T2V model performance but does not include the requested detector-level evaluations. In the revised version, we will expand the evaluation section to report accuracy, precision, and recall on the 500-video test split, provide an error analysis, and include ablations on the fine-tuning procedure. These additions will directly address concerns about the detector as a potential source of observed differences. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset construction and external benchmarking

full rationale

The paper constructs the PID dataset by rewriting real-video captions to induce targeted physical violations, fine-tunes VLMs on the resulting paired data for detection and explanation, and then applies the fine-tuned model to benchmark external T2V generators. No mathematical derivation, fitted parameter, or prediction is presented; all quantitative results are obtained by running the detector on outputs from independent models. The central claims rest on the methodological assumption that caption rewriting produces representative violations, but this assumption is not derived from or equivalent to the paper's own outputs by construction. No self-citation chain, uniqueness theorem, or ansatz is invoked to justify the core pipeline. The work is therefore self-contained against external T2V models and does not reduce to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that the caption-rewriting procedure produces videos whose physical violations are both genuine and representative, plus the assumption that fine-tuned VLMs can reliably generalize from the constructed pairs to outputs of other T2V models. No free parameters or invented entities are introduced.

axioms (2)

domain assumption Vision-language models can be fine-tuned on paired plausible/implausible videos to detect physical violations and generate explanations of violated principles.
This is the core modeling assumption that enables PhyDetEx; it is invoked when the authors describe the lightweight fine-tuning approach.
domain assumption Caption rewriting of real videos induces T2V models to generate content that violates identifiable physical laws.
Stated in the dataset construction paragraph of the abstract.

pith-pipeline@v0.9.0 · 5846 in / 1468 out tokens · 80870 ms · 2026-05-21T17:52:36.638030+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

each implausible video is generated by carefully rewriting the caption of its corresponding real-world video to induce T2V models producing physically implausible content
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

lightweight fine-tuning approach, enabling VLMs to not only detect physically implausible events but also generate textual explanations on the violated physical principles

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PhyGround: Benchmarking Physical Reasoning in Generative World Models
cs.CV 2026-05 accept novelty 7.0

PhyGround is a new benchmark with curated prompts, a 13-law taxonomy, large-scale human annotations, and an open physics-specialized VLM judge for evaluating physical reasoning in generative video models.
From Pixels to Concepts: Do Segmentation Models Understand What They Segment?
cs.CV 2026-05 unverdicted novelty 7.0

CAFE benchmark reveals that promptable segmentation models often produce correct masks for misleading prompts, showing a gap between localization accuracy and true concept understanding.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · cited by 2 Pith papers · 15 internal anchors

[1]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023. 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Impossible videos, 2025

Zechen Bai, Hai Ci, and Mike Zheng Shou. Impossible videos, 2025. 2, 3, 4, 6

work page 2025
[4]

VideoPhy: Evaluating Physical Commonsense for Video Generation

Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chenfanfu Jiang, Yizhou Sun, Kai- Wei Chang, and Aditya Grover. Videophy: Evaluating physical commonsense for video generation.arXiv preprint arXiv:2406.03520, 2024. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Videophy-2: A challenging action-centric physical commonsense evaluation in video generation.arXiv preprint arXiv:2503.06800, 2025

Hritik Bansal, Clark Peng, Yonatan Bitton, Roman Golden- berg, Aditya Grover, and Kai-Wei Chang. Videophy-2: A challenging action-centric physical commonsense evaluation in video generation.arXiv preprint arXiv:2503.06800, 2025. 1, 2

work page arXiv 2025
[6]

T2vworldbench: A benchmark for evaluating world knowledge in text-to-video generation, 2025

Yubin Chen, Xuyang Guo, Zhenmei Shi, Zhao Song, and Jiahao Zhang. T2vworldbench: A benchmark for evaluating world knowledge in text-to-video generation, 2025. 2

work page 2025
[7]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling.arXiv preprint arXiv:2412.05271, 2024. 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101,

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101,

work page
[9]

Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024. 2

work page 2024
[10]

Sora as an agi world model? a complete survey on text-to-video generation.arXiv preprint arXiv:2403.05131, 2024

Joseph Cho, Fachrina Dewi Puspitasari, Sheng Zheng, Jingyao Zheng, Lik-Hang Lee, Tae-Ho Kim, Choong Seon Hong, and Chaoning Zhang. Sora as an agi world model? a complete survey on text-to-video generation.arXiv preprint arXiv:2403.05131, 2024. 2

work page arXiv 2024
[11]

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic ca- pabilities, 2025

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, and Luke Marris et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic ca- pabilities, 2025. 2

work page 2025
[12]

Sharegpt-4o: Comprehensive mul- timodal annotations with gpt-4o, 2024

Erfei Cui, Yinan He, Zheng Ma, Zhe Chen, Hao Tian, Weiyun Wang, Kunchang Li, Yi Wang, Wenhai Wang, Xizhou Zhu, Lewei Lu, Tong Lu, Yali Wang, Limin Wang, Yu Qiao, and Jifeng Dai. Sharegpt-4o: Comprehensive mul- timodal annotations with gpt-4o, 2024. 2

work page 2024
[13]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Worldscore: A unified evaluation benchmark for world generation.arXiv preprint arXiv:2504.00983, 2025

Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Ji- ajun Wu. Worldscore: A unified evaluation benchmark for world generation.arXiv preprint arXiv:2504.00983, 2025. 2

work page arXiv 2025
[15]

Veo - google deepmind, 2025.https: //deepmind.google/models/veo/[2025.09.08]

Google DeepMind. Veo - google deepmind, 2025.https: //deepmind.google/models/veo/[2025.09.08]. 1, 7

work page 2025
[16]

Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017. 3

work page 2017
[17]

Visual program- ming: Compositional visual reasoning without training

Tanmay Gupta and Aniruddha Kembhavi. Visual program- ming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 14953–14962, 2023. 3

work page 2023
[18]

VideoScore: Building automatic metrics to simulate fine-grained human feedback for video generation

Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil Chandra, Ziyan Jiang, Aaran Arulraj, Kai Wang, Quy Duc Do, Yuansheng Ni, Bo- han Lyu, Yaswanth Narsupalli, Rongqi Fan, Zhiheng Lyu, Bill Yuchen Lin, and Wenhu Chen. VideoScore: Building automatic metrics to simulate fine-grained human feedback for video generation. In...

work page 2024
[19]

Vid2world: Crafting video diffusion models to interactive world models, 2025

Siqiao Huang, Jialong Wu, Qixing Zhou, Shangchen Miao, and Mingsheng Long. Vid2world: Crafting video diffusion models to interactive world models, 2025. 1

work page 2025
[20]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 6700–6709, 2019. 3

work page 2019
[21]

Vlm-r 3: Region recognition, reasoning, and refinement for enhanced multimodal chain-of-thought.arXiv preprint arXiv:2505.16192, 2025

Chaoya Jiang, Yongrui Heng, Wei Ye, Han Yang, Haiyang Xu, Ming Yan, Ji Zhang, Fei Huang, and Shikun Zhang. Vlm-r 3: Region recognition, reasoning, and refinement for enhanced multimodal chain-of-thought.arXiv preprint arXiv:2505.16192, 2025. 2

work page arXiv 2025
[22]

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. How far is video generation from world model? – a physical law perspective. arXiv preprint arXiv:2406.16860, 2024. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Klingai: Image to video.https : / / app

Kling. Klingai: Image to video.https : / / app . klingai.com/global, 2025. Accessed: 2025-09-08. 1, 2

work page 2025
[24]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 1, 2, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 3, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Sora generates videos with stunning geometrical consistency.arXiv preprint arXiv:2402.17403, 2024

Xuanyi Li, Daquan Zhou, Chenxu Zhang, Shaodong Wei, Qibin Hou, and Ming-Ming Cheng. Sora generates videos with stunning geometrical consistency.arXiv preprint arXiv:2402.17403, 2024. 2

work page arXiv 2024
[27]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13, pages 740–755. Springer, 2014. 2

work page 2014
[28]

Visual instruction tuning, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023. 2, 3

work page 2023
[29]

Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521,

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521,

work page
[30]

Step-video-t2v technical report: The practice, challenges, and future of video foundation model, 2025

Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiao- niu Song, Xing Chen, Yu Zhou, Deshan Sun, Deyu Zhou, Jian Zhou, Kaijun Tan, Kang An, Mei Chen, Wei Ji, Qil- ing Wu, Wen Sun, Xin Han, Yanan Wei, Zheng Ge, Aojie Li, Bin Wang, Bizhu Huang, Bo Wang, Brian Li, Changxing Miao, Chen Xu, Chenfei Wu, Cheng...

work page 2025
[31]

Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

Fanqing Meng, Jiaqi Liao, Xinyu Tan, Wenqi Shao, Quan- feng Lu, Kaipeng Zhang, Yu Cheng, Dianqi Li, Yu Qiao, and Ping Luo. Towards world simulator: Crafting physical commonsense-based benchmark for video generation.arXiv preprint arXiv:2410.05363, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Towards world simulator: Crafting physical commonsense- based benchmark for video generation

Fanqing Meng, Jiaqi Liao, Xinyu Tan, Quanfeng Lu, Wenqi Shao, Kaipeng Zhang, Yu Cheng, Dianqi Li, and Ping Luo. Towards world simulator: Crafting physical commonsense- based benchmark for video generation. InForty-second In- ternational Conference on Machine Learning, 2025. 2

work page 2025
[33]

Hailuo ai: Transform idea to visual with ai, 2025

MiniMax. Hailuo ai: Transform idea to visual with ai, 2025. https://hailuoai.video/[2025.09.08]. 2

work page 2025
[34]

Sora-2, 2025

OpenAI. Sora-2, 2025. https://openai.com/index/sora-2/. 7

work page 2025
[35]

Pika, 2025

Pika Lab. Pika, 2025. https://pika.art/ [2025.09.09]. 2

work page 2025
[36]

Worldsimbench: Towards video generation models as world simulators

Yiran Qin, Zhelun Shi, Jiwen Yu, Xijun Wang, Enshen Zhou, Lijun Li, Zhenfei Yin, Xihui Liu, Lu Sheng, Jing Shao, LEI BAI, and Ruimao Zhang. Worldsimbench: Towards video generation models as world simulators. InForty-second In- ternational Conference on Machine Learning, 2025. 2

work page 2025
[37]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InThirty-seventh Conference on Neural In- formation Processing Systems, 2023. 8

work page 2023
[38]

A-okvqa: A benchmark for visual question answering using world knowl- edge.arXiv, 2022

Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowl- edge.arXiv, 2022. 3

work page 2022
[39]

ViperGPT: Visual Inference via Python Execution for Reasoning

D ´ıdac Sur´ıs, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execution for reasoning.arXiv preprint arXiv:2303.08128, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

Vidgen-1m: A large-scale dataset for text-to-video genera- tion.arXiv preprint arXiv:2408.02629, 2024

Zhiyu Tan, Xiaomeng Yang, Luozheng Qin, and Hao Li. Vidgen-1m: A large-scale dataset for text-to-video genera- tion.arXiv preprint arXiv:2408.02629, 2024. 3, 5, 7

work page arXiv 2024
[41]

Show and tell: A neural image caption gen- erator

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Du- mitru Erhan. Show and tell: A neural image caption gen- erator. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164, 2015. 3

work page 2015
[42]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jin- gren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fan...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

A continual learning paradigm for non- differentiable visual programming frameworks on visual rea- soning tasks.arXiv preprint arXiv:2309.09809, 2023

Wentao Wan, Nan Kang, Zeqing Wang, Zhuojie Yang, Liang Lin, and Keze Wang. A continual learning paradigm for non- differentiable visual programming frameworks on visual rea- soning tasks.arXiv preprint arXiv:2309.09809, 2023. 2

work page arXiv 2023
[44]

Wisa: World simulator assistant for physics-aware text-to-video generation.arXiv preprint arXiv:2503.08153, 2025

Jing Wang, Ao Ma, Ke Cao, Jun Zheng, Zhanjie Zhang, Jiasong Feng, Shanyuan Liu, Yuhang Ma, Bo Cheng, Dawei Leng, et al. Wisa: World simulator assistant for physics-aware text-to-video generation.arXiv preprint arXiv:2503.08153, 2025. 1

work page arXiv 2025
[45]

Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models

Wenhao Wang and Yi Yang. Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models. Thirty-eighth Conference on Neural Information Processing Systems, 2024. 4

work page 2024
[46]

Lift: Leveraging human feed- back for text-to-video model alignment.arXiv preprint arXiv:2412.04814, 2024

Yibin Wang, Zhiyu Tan, Junyan Wang, Xiaomeng Yang, Cheng Jin, and Hao Li. Lift: Leveraging human feed- back for text-to-video model alignment.arXiv preprint arXiv:2412.04814, 2024. 1, 2

work page arXiv 2024
[47]

Towards top-down reasoning: An explainable multi-agent approach for visual question answering.arXiv preprint arXiv:2311.17331, 2023

Zeqing Wang, Wentao Wan, Qiqing Lao, Runmeng Chen, Minjie Lang, Keze Wang, and Liang Lin. Towards top-down reasoning: An explainable multi-agent approach for visual question answering.arXiv preprint arXiv:2311.17331, 2023. 3

work page arXiv 2023
[48]

Is this generated person existed in real-world? fine-grained detecting and calibrating abnor- mal human-body

Zeqing Wang, Qingyang Ma, Wentao Wan, Haojie Li, Keze Wang, and Yonghong Tian. Is this generated person existed in real-world? fine-grained detecting and calibrating abnor- mal human-body. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21226–21237, 2025. 1, 2, 3

work page 2025
[49]

VideoVerse: Does Your T2V Generator Have World Model Capability to Synthesize Videos?

Zeqing Wang, Xinyu Wei, Bairui Li, Zhen Guo, Jinrui Zhang, Hongyang Wei, Keze Wang, and Lei Zhang. Video- verse: How far is your t2v generator from a world model? arXiv preprint arXiv:2510.08398, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Timecausality: Evaluating the causal ability in time dimension for vision language models.arXiv preprint arXiv:2505.15435, 2025

Zeqing Wang, Shiyuan Zhang, Chengpei Tang, and Keze Wang. Timecausality: Evaluating the causal ability in time dimension for vision language models.arXiv preprint arXiv:2505.15435, 2025. 2

work page arXiv 2025
[51]

Tiif-bench: How does your t2i model follow your instructions?arXiv preprint arXiv:2506.02161, 2025

Xinyu Wei, Jinrui Zhang, Zeqing Wang, Hongyang Wei, Zhen Guo, and Lei Zhang. Tiif-bench: How does your t2i model follow your instructions?arXiv preprint arXiv:2506.02161, 2025. 2

work page arXiv 2025
[52]

Xing, and Zhiting Hu

Jiannan Xiang, Guangyi Liu, Yi Gu, Qiyue Gao, Yuting Ning, Yuheng Zha, Zeyu Feng, Tianhua Tao, Shibo Hao, Yemin Shi, Zhengzhong Liu, Eric P. Xing, and Zhiting Hu. Pandora: Towards general world model with natural lan- guage actions and video states. 2024. 2

work page 2024
[53]

Captain cin- ema: Towards short movie generation.arXiv preprint arXiv:2507.18634, 2025

Junfei Xiao, Ceyuan Yang, Lvmin Zhang, Shengqu Cai, Yang Zhao, Yuwei Guo, Gordon Wetzstein, Maneesh Agrawala, Alan Yuille, and Lu Jiang. Captain cin- ema: Towards short movie generation.arXiv preprint arXiv:2507.18634, 2025. 2

work page arXiv 2025
[54]

Phyt2v: Llm-guided iterative self-refinement for physics- grounded text-to-video generation

Qiyao Xue, Xiangyu Yin, Boyuan Yang, and Wei Gao. Phyt2v: Llm-guided iterative self-refinement for physics- grounded text-to-video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18826–18836, 2025. 1, 2

work page 2025
[55]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 1, 2, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[56]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024. 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

AnyGPT: Unified multimodal LLM with discrete sequence modeling

Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, Dong Zhang, Zhigeng Liu, Xin Zhang, Ruibin Yuan, Ge Zhang, Linyang Li, Hang Yan, Jie Fu, Tao Gui, Tianxiang Sun, Yu- Gang Jiang, and Xipeng Qiu. AnyGPT: Unified multimodal LLM with discrete sequence modeling. InProceedings of the 62nd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1...

work page 2024
[58]

MA VIS: Mathe- matical visual instruction tuning with an automatic data en- gine

Renrui Zhang, Xinyu Wei, Dongzhi Jiang, Ziyu Guo, Yichi Zhang, Chengzhuo Tong, Jiaming Liu, Aojun Zhou, Shang- hang Zhang, Peng Gao, and Hongsheng Li. MA VIS: Mathe- matical visual instruction tuning with an automatic data en- gine. InThe Thirteenth International Conference on Learn- ing Representations, 2025. 2

work page 2025
[59]

Omg-llava: Bridging image-level, object-level, pixel-level reasoning and understanding

Tao Zhang, Xiangtai Li, Hao Fei, Haobo Yuan, Shengqiong Wu, Shunping Ji, Change Loy Chen, and Shuicheng Yan. Omg-llava: Bridging image-level, object-level, pixel-level reasoning and understanding. InNeurIPS, 2024. 3

work page 2024
[60]

Vide- orepa: Learning physics for video generation through re- lational alignment with foundation models.arXiv preprint arXiv:2505.23656, 2025

Xiangdong Zhang, Jiaqi Liao, Shaofeng Zhang, Fanqing Meng, Xiangpeng Wan, Junchi Yan, and Yu Cheng. Vide- orepa: Learning physics for video generation through re- lational alignment with foundation models.arXiv preprint arXiv:2505.23656, 2025. 2

work page arXiv 2025
[61]

VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Yuanhan Zhang, Jingwen He, Wei-Shi Zheng, Yu Qiao, et al. Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[62]

generated vs. non-generated

Zheng Zhu, Xiaofeng Wang, Wangbo Zhao, Chen Min, Ni- anchen Deng, Min Dou, Yuqi Wang, Botian Shi, Kai Wang, Chi Zhang, et al. Is sora a world simulator? a comprehensive survey on general world models and beyond.arXiv preprint arXiv:2405.03520, 2024. 2 PhyDetEx: Detecting and Explaining the Physical Plausibility of T2V Models Supplementary Material We high...

work page arXiv 2024

[1] [1]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023. 7

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Impossible videos, 2025

Zechen Bai, Hai Ci, and Mike Zheng Shou. Impossible videos, 2025. 2, 3, 4, 6

work page 2025

[4] [4]

VideoPhy: Evaluating Physical Commonsense for Video Generation

Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chenfanfu Jiang, Yizhou Sun, Kai- Wei Chang, and Aditya Grover. Videophy: Evaluating physical commonsense for video generation.arXiv preprint arXiv:2406.03520, 2024. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Videophy-2: A challenging action-centric physical commonsense evaluation in video generation.arXiv preprint arXiv:2503.06800, 2025

Hritik Bansal, Clark Peng, Yonatan Bitton, Roman Golden- berg, Aditya Grover, and Kai-Wei Chang. Videophy-2: A challenging action-centric physical commonsense evaluation in video generation.arXiv preprint arXiv:2503.06800, 2025. 1, 2

work page arXiv 2025

[6] [6]

T2vworldbench: A benchmark for evaluating world knowledge in text-to-video generation, 2025

Yubin Chen, Xuyang Guo, Zhenmei Shi, Zhao Song, and Jiahao Zhang. T2vworldbench: A benchmark for evaluating world knowledge in text-to-video generation, 2025. 2

work page 2025

[7] [7]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling.arXiv preprint arXiv:2412.05271, 2024. 7

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101,

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Information Sciences, 67(12):220101,

work page

[9] [9]

Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024. 2

work page 2024

[10] [10]

Sora as an agi world model? a complete survey on text-to-video generation.arXiv preprint arXiv:2403.05131, 2024

Joseph Cho, Fachrina Dewi Puspitasari, Sheng Zheng, Jingyao Zheng, Lik-Hang Lee, Tae-Ho Kim, Choong Seon Hong, and Chaoning Zhang. Sora as an agi world model? a complete survey on text-to-video generation.arXiv preprint arXiv:2403.05131, 2024. 2

work page arXiv 2024

[11] [11]

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic ca- pabilities, 2025

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, and Luke Marris et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic ca- pabilities, 2025. 2

work page 2025

[12] [12]

Sharegpt-4o: Comprehensive mul- timodal annotations with gpt-4o, 2024

Erfei Cui, Yinan He, Zheng Ma, Zhe Chen, Hao Tian, Weiyun Wang, Kunchang Li, Yi Wang, Wenhai Wang, Xizhou Zhu, Lewei Lu, Tong Lu, Yali Wang, Limin Wang, Yu Qiao, and Jifeng Dai. Sharegpt-4o: Comprehensive mul- timodal annotations with gpt-4o, 2024. 2

work page 2024

[13] [13]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

Worldscore: A unified evaluation benchmark for world generation.arXiv preprint arXiv:2504.00983, 2025

Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Ji- ajun Wu. Worldscore: A unified evaluation benchmark for world generation.arXiv preprint arXiv:2504.00983, 2025. 2

work page arXiv 2025

[15] [15]

Veo - google deepmind, 2025.https: //deepmind.google/models/veo/[2025.09.08]

Google DeepMind. Veo - google deepmind, 2025.https: //deepmind.google/models/veo/[2025.09.08]. 1, 7

work page 2025

[16] [16]

Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017. 3

work page 2017

[17] [17]

Visual program- ming: Compositional visual reasoning without training

Tanmay Gupta and Aniruddha Kembhavi. Visual program- ming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 14953–14962, 2023. 3

work page 2023

[18] [18]

VideoScore: Building automatic metrics to simulate fine-grained human feedback for video generation

Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil Chandra, Ziyan Jiang, Aaran Arulraj, Kai Wang, Quy Duc Do, Yuansheng Ni, Bo- han Lyu, Yaswanth Narsupalli, Rongqi Fan, Zhiheng Lyu, Bill Yuchen Lin, and Wenhu Chen. VideoScore: Building automatic metrics to simulate fine-grained human feedback for video generation. In...

work page 2024

[19] [19]

Vid2world: Crafting video diffusion models to interactive world models, 2025

Siqiao Huang, Jialong Wu, Qixing Zhou, Shangchen Miao, and Mingsheng Long. Vid2world: Crafting video diffusion models to interactive world models, 2025. 1

work page 2025

[20] [20]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 6700–6709, 2019. 3

work page 2019

[21] [21]

Vlm-r 3: Region recognition, reasoning, and refinement for enhanced multimodal chain-of-thought.arXiv preprint arXiv:2505.16192, 2025

Chaoya Jiang, Yongrui Heng, Wei Ye, Han Yang, Haiyang Xu, Ming Yan, Ji Zhang, Fei Huang, and Shikun Zhang. Vlm-r 3: Region recognition, reasoning, and refinement for enhanced multimodal chain-of-thought.arXiv preprint arXiv:2505.16192, 2025. 2

work page arXiv 2025

[22] [22]

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. How far is video generation from world model? – a physical law perspective. arXiv preprint arXiv:2406.16860, 2024. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

Klingai: Image to video.https : / / app

Kling. Klingai: Image to video.https : / / app . klingai.com/global, 2025. Accessed: 2025-09-08. 1, 2

work page 2025

[24] [24]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 1, 2, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 3, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

Sora generates videos with stunning geometrical consistency.arXiv preprint arXiv:2402.17403, 2024

Xuanyi Li, Daquan Zhou, Chenxu Zhang, Shaodong Wei, Qibin Hou, and Ming-Ming Cheng. Sora generates videos with stunning geometrical consistency.arXiv preprint arXiv:2402.17403, 2024. 2

work page arXiv 2024

[27] [27]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13, pages 740–755. Springer, 2014. 2

work page 2014

[28] [28]

Visual instruction tuning, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023. 2, 3

work page 2023

[29] [29]

Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521,

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521,

work page

[30] [30]

Step-video-t2v technical report: The practice, challenges, and future of video foundation model, 2025

Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiao- niu Song, Xing Chen, Yu Zhou, Deshan Sun, Deyu Zhou, Jian Zhou, Kaijun Tan, Kang An, Mei Chen, Wei Ji, Qil- ing Wu, Wen Sun, Xin Han, Yanan Wei, Zheng Ge, Aojie Li, Bin Wang, Bizhu Huang, Bo Wang, Brian Li, Changxing Miao, Chen Xu, Chenfei Wu, Cheng...

work page 2025

[31] [31]

Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

Fanqing Meng, Jiaqi Liao, Xinyu Tan, Wenqi Shao, Quan- feng Lu, Kaipeng Zhang, Yu Cheng, Dianqi Li, Yu Qiao, and Ping Luo. Towards world simulator: Crafting physical commonsense-based benchmark for video generation.arXiv preprint arXiv:2410.05363, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

Towards world simulator: Crafting physical commonsense- based benchmark for video generation

Fanqing Meng, Jiaqi Liao, Xinyu Tan, Quanfeng Lu, Wenqi Shao, Kaipeng Zhang, Yu Cheng, Dianqi Li, and Ping Luo. Towards world simulator: Crafting physical commonsense- based benchmark for video generation. InForty-second In- ternational Conference on Machine Learning, 2025. 2

work page 2025

[33] [33]

Hailuo ai: Transform idea to visual with ai, 2025

MiniMax. Hailuo ai: Transform idea to visual with ai, 2025. https://hailuoai.video/[2025.09.08]. 2

work page 2025

[34] [34]

Sora-2, 2025

OpenAI. Sora-2, 2025. https://openai.com/index/sora-2/. 7

work page 2025

[35] [35]

Pika, 2025

Pika Lab. Pika, 2025. https://pika.art/ [2025.09.09]. 2

work page 2025

[36] [36]

Worldsimbench: Towards video generation models as world simulators

Yiran Qin, Zhelun Shi, Jiwen Yu, Xijun Wang, Enshen Zhou, Lijun Li, Zhenfei Yin, Xihui Liu, Lu Sheng, Jing Shao, LEI BAI, and Ruimao Zhang. Worldsimbench: Towards video generation models as world simulators. InForty-second In- ternational Conference on Machine Learning, 2025. 2

work page 2025

[37] [37]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InThirty-seventh Conference on Neural In- formation Processing Systems, 2023. 8

work page 2023

[38] [38]

A-okvqa: A benchmark for visual question answering using world knowl- edge.arXiv, 2022

Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowl- edge.arXiv, 2022. 3

work page 2022

[39] [39]

ViperGPT: Visual Inference via Python Execution for Reasoning

D ´ıdac Sur´ıs, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execution for reasoning.arXiv preprint arXiv:2303.08128, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[40] [40]

Vidgen-1m: A large-scale dataset for text-to-video genera- tion.arXiv preprint arXiv:2408.02629, 2024

Zhiyu Tan, Xiaomeng Yang, Luozheng Qin, and Hao Li. Vidgen-1m: A large-scale dataset for text-to-video genera- tion.arXiv preprint arXiv:2408.02629, 2024. 3, 5, 7

work page arXiv 2024

[41] [41]

Show and tell: A neural image caption gen- erator

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Du- mitru Erhan. Show and tell: A neural image caption gen- erator. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164, 2015. 3

work page 2015

[42] [42]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jin- gren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fan...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

A continual learning paradigm for non- differentiable visual programming frameworks on visual rea- soning tasks.arXiv preprint arXiv:2309.09809, 2023

Wentao Wan, Nan Kang, Zeqing Wang, Zhuojie Yang, Liang Lin, and Keze Wang. A continual learning paradigm for non- differentiable visual programming frameworks on visual rea- soning tasks.arXiv preprint arXiv:2309.09809, 2023. 2

work page arXiv 2023

[44] [44]

Wisa: World simulator assistant for physics-aware text-to-video generation.arXiv preprint arXiv:2503.08153, 2025

Jing Wang, Ao Ma, Ke Cao, Jun Zheng, Zhanjie Zhang, Jiasong Feng, Shanyuan Liu, Yuhang Ma, Bo Cheng, Dawei Leng, et al. Wisa: World simulator assistant for physics-aware text-to-video generation.arXiv preprint arXiv:2503.08153, 2025. 1

work page arXiv 2025

[45] [45]

Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models

Wenhao Wang and Yi Yang. Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models. Thirty-eighth Conference on Neural Information Processing Systems, 2024. 4

work page 2024

[46] [46]

Lift: Leveraging human feed- back for text-to-video model alignment.arXiv preprint arXiv:2412.04814, 2024

Yibin Wang, Zhiyu Tan, Junyan Wang, Xiaomeng Yang, Cheng Jin, and Hao Li. Lift: Leveraging human feed- back for text-to-video model alignment.arXiv preprint arXiv:2412.04814, 2024. 1, 2

work page arXiv 2024

[47] [47]

Towards top-down reasoning: An explainable multi-agent approach for visual question answering.arXiv preprint arXiv:2311.17331, 2023

Zeqing Wang, Wentao Wan, Qiqing Lao, Runmeng Chen, Minjie Lang, Keze Wang, and Liang Lin. Towards top-down reasoning: An explainable multi-agent approach for visual question answering.arXiv preprint arXiv:2311.17331, 2023. 3

work page arXiv 2023

[48] [48]

Is this generated person existed in real-world? fine-grained detecting and calibrating abnor- mal human-body

Zeqing Wang, Qingyang Ma, Wentao Wan, Haojie Li, Keze Wang, and Yonghong Tian. Is this generated person existed in real-world? fine-grained detecting and calibrating abnor- mal human-body. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21226–21237, 2025. 1, 2, 3

work page 2025

[49] [49]

VideoVerse: Does Your T2V Generator Have World Model Capability to Synthesize Videos?

Zeqing Wang, Xinyu Wei, Bairui Li, Zhen Guo, Jinrui Zhang, Hongyang Wei, Keze Wang, and Lei Zhang. Video- verse: How far is your t2v generator from a world model? arXiv preprint arXiv:2510.08398, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[50] [50]

Timecausality: Evaluating the causal ability in time dimension for vision language models.arXiv preprint arXiv:2505.15435, 2025

Zeqing Wang, Shiyuan Zhang, Chengpei Tang, and Keze Wang. Timecausality: Evaluating the causal ability in time dimension for vision language models.arXiv preprint arXiv:2505.15435, 2025. 2

work page arXiv 2025

[51] [51]

Tiif-bench: How does your t2i model follow your instructions?arXiv preprint arXiv:2506.02161, 2025

Xinyu Wei, Jinrui Zhang, Zeqing Wang, Hongyang Wei, Zhen Guo, and Lei Zhang. Tiif-bench: How does your t2i model follow your instructions?arXiv preprint arXiv:2506.02161, 2025. 2

work page arXiv 2025

[52] [52]

Xing, and Zhiting Hu

Jiannan Xiang, Guangyi Liu, Yi Gu, Qiyue Gao, Yuting Ning, Yuheng Zha, Zeyu Feng, Tianhua Tao, Shibo Hao, Yemin Shi, Zhengzhong Liu, Eric P. Xing, and Zhiting Hu. Pandora: Towards general world model with natural lan- guage actions and video states. 2024. 2

work page 2024

[53] [53]

Captain cin- ema: Towards short movie generation.arXiv preprint arXiv:2507.18634, 2025

Junfei Xiao, Ceyuan Yang, Lvmin Zhang, Shengqu Cai, Yang Zhao, Yuwei Guo, Gordon Wetzstein, Maneesh Agrawala, Alan Yuille, and Lu Jiang. Captain cin- ema: Towards short movie generation.arXiv preprint arXiv:2507.18634, 2025. 2

work page arXiv 2025

[54] [54]

Phyt2v: Llm-guided iterative self-refinement for physics- grounded text-to-video generation

Qiyao Xue, Xiangyu Yin, Boyuan Yang, and Wei Gao. Phyt2v: Llm-guided iterative self-refinement for physics- grounded text-to-video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18826–18836, 2025. 1, 2

work page 2025

[55] [55]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiao- han Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024. 1, 2, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024

[56] [56]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024. 7

work page internal anchor Pith review Pith/arXiv arXiv 2024

[57] [57]

AnyGPT: Unified multimodal LLM with discrete sequence modeling

Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, Dong Zhang, Zhigeng Liu, Xin Zhang, Ruibin Yuan, Ge Zhang, Linyang Li, Hang Yan, Jie Fu, Tao Gui, Tianxiang Sun, Yu- Gang Jiang, and Xipeng Qiu. AnyGPT: Unified multimodal LLM with discrete sequence modeling. InProceedings of the 62nd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1...

work page 2024

[58] [58]

MA VIS: Mathe- matical visual instruction tuning with an automatic data en- gine

Renrui Zhang, Xinyu Wei, Dongzhi Jiang, Ziyu Guo, Yichi Zhang, Chengzhuo Tong, Jiaming Liu, Aojun Zhou, Shang- hang Zhang, Peng Gao, and Hongsheng Li. MA VIS: Mathe- matical visual instruction tuning with an automatic data en- gine. InThe Thirteenth International Conference on Learn- ing Representations, 2025. 2

work page 2025

[59] [59]

Omg-llava: Bridging image-level, object-level, pixel-level reasoning and understanding

Tao Zhang, Xiangtai Li, Hao Fei, Haobo Yuan, Shengqiong Wu, Shunping Ji, Change Loy Chen, and Shuicheng Yan. Omg-llava: Bridging image-level, object-level, pixel-level reasoning and understanding. InNeurIPS, 2024. 3

work page 2024

[60] [60]

Vide- orepa: Learning physics for video generation through re- lational alignment with foundation models.arXiv preprint arXiv:2505.23656, 2025

Xiangdong Zhang, Jiaqi Liao, Shaofeng Zhang, Fanqing Meng, Xiangpeng Wan, Junchi Yan, and Yu Cheng. Vide- orepa: Learning physics for video generation through re- lational alignment with foundation models.arXiv preprint arXiv:2505.23656, 2025. 2

work page arXiv 2025

[61] [61]

VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Yuanhan Zhang, Jingwen He, Wei-Shi Zheng, Yu Qiao, et al. Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[62] [62]

generated vs. non-generated

Zheng Zhu, Xiaofeng Wang, Wangbo Zhao, Chen Min, Ni- anchen Deng, Min Dou, Yuqi Wang, Botian Shi, Kai Wang, Chi Zhang, et al. Is sora a world simulator? a comprehensive survey on general world models and beyond.arXiv preprint arXiv:2405.03520, 2024. 2 PhyDetEx: Detecting and Explaining the Physical Plausibility of T2V Models Supplementary Material We high...

work page arXiv 2024