pith. sign in

arxiv: 2512.15693 · v2 · pith:GTDIOGX7new · submitted 2025-12-17 · 💻 cs.CV

Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning

Pith reviewed 2026-05-21 16:38 UTC · model grok-4.3

classification 💻 cs.CV
keywords AI-generated video detectionmultimodal large language modelvisual artifactsspatio-temporal reasoningexplainable detectionvideo generation
0
0 comments X

The pith

Skyra detects AI-generated videos by spotting and explaining human-perceivable visual artifacts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a multimodal large language model that learns to find specific visual flaws in AI-made videos and uses those flaws as direct evidence for both deciding whether a video is fake and explaining why. It builds this capability through a dataset of human-annotated artifact examples and a two-stage training process that improves the model's ability to notice inconsistencies across space and time. A new benchmark with videos from many different generators is used to show that the resulting system outperforms prior detection methods. If the approach holds, detection tools would move from opaque binary labels to interpretable outputs that highlight concrete visual reasons. This shift matters for building trust in systems meant to counter the spread of misleading AI videos.

Core claim

Skyra is a multimodal large language model that identifies human-perceivable spatio-temporal visual artifacts in AI-generated videos and treats those artifacts as grounded evidence for accurate detection together with human-readable explanations.

What carries the argument

Two-stage training on a large-scale dataset of fine-grained human annotations for visual artifacts, which strengthens the model's perception of inconsistencies and its ability to verbalize them as detection rationale.

If this is right

  • Detection outputs include specific visual reasons that humans can verify instead of a single yes-or-no label.
  • Performance gains appear across benchmarks that include videos from more than ten current generators.
  • The evaluation process surfaces patterns in artifact types that can inform refinements to future detectors.
  • Explainable outputs become available for applications that require human oversight of AI video content.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same artifact-grounding idea could be tested on other media such as AI-generated images to see whether explanations remain effective.
  • Deployment in practice would likely need periodic retraining whenever new video generators introduce previously unseen artifact patterns.
  • Collecting ongoing human annotations might prove more scalable if automated proposals for candidate artifacts are first generated by the model itself.

Load-bearing premise

The human annotations on the training videos accurately identify the artifacts that will appear and remain useful in videos made by generators never seen during development.

What would settle it

Run Skyra on a fresh collection of videos produced by an entirely new video generator outside the training set and the evaluation benchmark, then measure whether artifact identification accuracy and overall detection performance stay high.

Figures

Figures reproduced from arXiv: 2512.15693 by Jie Zhou, Jiwen Lu, Lei Chen, Runze Sun, Wenzhao Zheng, Yanran Zhang, Yifei Li, Yu Zheng.

Figure 1
Figure 1. Figure 1: Performance on ViF-Bench. Our method outperforms [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Skyra leverages human-perceivable artifacts in AI-generated videos as grounded evidence for detection and explanation. Compared to off-the-shelf MLLMs and previous MLLM-based detectors, Skyra demonstrates superior artifact perception and detection capabilities. identifies artifacts and leverages them as spatio-temporally grounded evidence. As shown in Figures 1 and 2, Skyra achieves substantially higher de… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the ViF-CoT-4K dataset. (a) The hierarchical taxonomy of AI-generated video artifacts. (b) Visual examples of artifacts under our taxonomy. (c) Construction pipeline of ViF-CoT-4K dataset, including authentic data collection and AI-generated video collection, manual annotation, and the step-by-step chain-of-thought explanation data construction process. Video-LLaMA [81] employs multimodal encod… view at source ↗
Figure 4
Figure 4. Figure 4: Statistics of the ViF-CoT-4K and ViF-Bench. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Overview of Skyra. We leverage a two-stage training pipeline to improve Skyra’s artifacts perception and detection capabilities: [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Case Study. More examples are provided in the appendix. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Annotation platform UI. A.3. Chain-of-Thought Annotation Prompt Design To transform concise human annotations into training￾ready step-by-step supervision, we design a structured prompt for Gemini-2.5-Pro that operates on each fake–real video pair. For every annotated instance, the model re￾ceives sampled frames from the fake and real videos to￾gether with the curated artifact Type, Textual Explana￾tion, T… view at source ↗
Figure 9
Figure 9. Figure 9: Visualization of Class Activation Maps (CAMs) produced [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Response examples of off-the-shelf MLLMs. [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Response examples of existing MLLM-based detector, BusterX++ [ [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: System prompt and user prompt design. impersonation, and erosion of trust in authentic media. By focusing on interpretable, artifact-centric detection, Skyra aims to provide not only predictions but also grounded visual evidence that can assist journalists, fact-checkers, regulators, 18 [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Chain-of-Thought Annotation Prompt. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: ViF-Bench Video Sample Examples-I 22 [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: ViF-Bench Video Sample Examples-II 23 [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Skyra’s Response Example on Real Videos, I [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Skyra’s Response Example on Real Videos, II [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Skyra’s Response Example on Fake Videos, Texture Anomaly-Structure Anomaly [PITH_FULL_IMAGE:figures/full_fig_p025_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Skyra’s Response Example on Fake Videos, Color & Lighting Anomaly-Color Over-Saturation [PITH_FULL_IMAGE:figures/full_fig_p025_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Skyra’s Response Example on Fake Videos, Move Forgery-Camera Motion Inconsistency [PITH_FULL_IMAGE:figures/full_fig_p026_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Skyra’s Response Example on Fake Videos, Object Inconsistency-Shape Distortion [PITH_FULL_IMAGE:figures/full_fig_p026_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Skyra’s Response Example on Fake Videos, Interaction Inconsistency-Abnormal Rigid-Body Crossing [PITH_FULL_IMAGE:figures/full_fig_p027_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Skyra’s Response Example on Fake Videos, Unnatural Movement-Unnatural Human Movement [PITH_FULL_IMAGE:figures/full_fig_p027_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Skyra’s Response Example on Fake Videos, Violation of Causality Law-Violation of Physical Law [PITH_FULL_IMAGE:figures/full_fig_p028_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Skyra’s Response Example on Fake Videos, Violation of Commonsense-Text Distortion [PITH_FULL_IMAGE:figures/full_fig_p028_25.png] view at source ↗
read the original abstract

The misuse of AI-driven video generation technologies has raised serious social concerns, highlighting the urgent need for reliable AI-generated video detectors. However, most existing methods are limited to binary classification and lack the necessary explanations for human interpretation. In this paper, we present Skyra, a specialized multimodal large language model (MLLM) that identifies human-perceivable visual artifacts in AI-generated videos and leverages them as grounded evidence for both detection and explanation. To support this objective, we construct ViF-CoT-4K for Supervised Fine-Tuning (SFT), which represents the first large-scale AI-generated video artifact dataset with fine-grained human annotations. We then develop a two-stage training strategy that systematically enhances our model's spatio-temporal artifact perception, explanation capability, and detection accuracy. To comprehensively evaluate Skyra, we introduce ViF-Bench, a benchmark comprising 3K high-quality samples generated by over ten state-of-the-art video generators. Extensive experiments demonstrate that Skyra surpasses existing methods across multiple benchmarks, while our evaluation yields valuable insights for advancing explainable AI-generated video detection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Skyra, a multimodal large language model (MLLM) for AI-generated video detection that identifies human-perceivable spatio-temporal visual artifacts and uses them as grounded evidence for both binary detection and natural-language explanations. It constructs the ViF-CoT-4K dataset (4K samples with fine-grained human annotations) for supervised fine-tuning, applies a two-stage training procedure to improve artifact perception and reasoning, and evaluates on the newly introduced ViF-Bench (3K high-quality samples from >10 generators), claiming superior detection accuracy and explanation quality over prior methods.

Significance. If the central claims hold, the work advances explainable detection of synthetic video, a timely problem given rapid progress in generative models. The emphasis on human-perceivable artifacts and the release of annotated datasets plus a multi-generator benchmark could support more interpretable and robust detectors than current binary classifiers. The two-stage training and grounded CoT reasoning represent a concrete methodological direction worth further exploration.

major comments (2)
  1. [Section 5] ViF-Bench evaluation (Section 5): The central claim that fine-grained annotations in ViF-CoT-4K capture transferable spatio-temporal artifacts relies on generalization to videos from >10 unseen generators. No cross-generator ablation or leave-one-generator-out analysis is reported; without it, reported gains in accuracy and explanation quality could arise from the base MLLM visual encoder or generic reasoning rather than the intended artifact-grounding mechanism.
  2. [Section 4.2] Two-stage training description (Section 4.2): The first stage is described as enhancing spatio-temporal artifact perception, yet the precise supervision signals, loss terms, and how artifact labels are converted into training targets are not specified. This makes it impossible to determine whether the second-stage detection and CoT improvements are attributable to the proposed grounded reasoning or to standard SFT effects.
minor comments (2)
  1. [Section 5] The exact list of generators and generation parameters used for ViF-Bench should be tabulated for reproducibility; the abstract's phrase 'over ten' is insufficient.
  2. Figure captions and axis labels in the qualitative results should explicitly indicate which frames or regions correspond to the cited artifacts to aid reader verification.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which has helped us identify areas for improvement in our manuscript. Below, we provide point-by-point responses to the major comments and indicate the revisions we plan to implement.

read point-by-point responses
  1. Referee: [Section 5] ViF-Bench evaluation (Section 5): The central claim that fine-grained annotations in ViF-CoT-4K capture transferable spatio-temporal artifacts relies on generalization to videos from >10 unseen generators. No cross-generator ablation or leave-one-generator-out analysis is reported; without it, reported gains in accuracy and explanation quality could arise from the base MLLM visual encoder or generic reasoning rather than the intended artifact-grounding mechanism.

    Authors: We appreciate this observation. Our ViF-Bench does evaluate on videos generated by more than 10 state-of-the-art models, many of which were not used in creating the ViF-CoT-4K training set, providing evidence of generalization to unseen generators. However, we acknowledge that an explicit cross-generator ablation would more rigorously isolate the contribution of our artifact-grounding approach. In the revised manuscript, we will add a leave-one-generator-out analysis to demonstrate that the performance gains persist when excluding specific generators from training. revision: yes

  2. Referee: [Section 4.2] Two-stage training description (Section 4.2): The first stage is described as enhancing spatio-temporal artifact perception, yet the precise supervision signals, loss terms, and how artifact labels are converted into training targets are not specified. This makes it impossible to determine whether the second-stage detection and CoT improvements are attributable to the proposed grounded reasoning or to standard SFT effects.

    Authors: We agree that additional details on the training procedure are necessary for reproducibility and to clarify the contributions. In the revised version of the paper, we will expand Section 4.2 to include the specific supervision signals used in the first stage (such as artifact localization and description tasks derived from the human annotations), the loss functions employed (including the primary language modeling loss and any auxiliary objectives), and how the annotations are processed into training targets. revision: yes

Circularity Check

0 steps flagged

No circularity; standard data collection, training, and benchmarking pipeline

full rationale

The paper constructs a new human-annotated dataset ViF-CoT-4K for supervised fine-tuning, applies a two-stage training process to improve artifact perception and detection, and evaluates on a separately introduced benchmark ViF-Bench containing videos from over ten generators. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the derivation. The central claims rest on empirical training and external human annotations rather than reducing outputs to inputs by construction. This is a self-contained ML pipeline with independent evaluation data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central claim depends on the quality of human artifact annotations and the effectiveness of the two-stage training procedure. No free parameters are explicitly fitted in the abstract. The main axioms are domain assumptions about annotation reliability and generalization. The new model and datasets are introduced entities without independent external validation in the provided text.

axioms (1)
  • domain assumption Human annotations in ViF-CoT-4K accurately and comprehensively identify the human-perceivable visual artifacts that distinguish AI-generated videos.
    The supervised fine-tuning and subsequent performance claims rest directly on these annotations.
invented entities (3)
  • Skyra MLLM no independent evidence
    purpose: Multimodal model specialized for spatio-temporal artifact perception and grounded explanation.
    New model presented in the paper.
  • ViF-CoT-4K dataset no independent evidence
    purpose: Large-scale training set with fine-grained artifact annotations for SFT.
    Newly constructed dataset described as the first of its kind.
  • ViF-Bench no independent evidence
    purpose: Evaluation benchmark containing videos from over ten state-of-the-art generators.
    New benchmark introduced for comprehensive testing.

pith-pipeline@v0.9.0 · 5737 in / 1485 out tokens · 64280 ms · 2026-05-21T16:38:47.483353+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating

    cs.CV 2026-05 unverdicted novelty 7.0

    CaC is a hierarchical spatiotemporal concentrating reward model for video anomalies that reports 25.7% accuracy gains on fine-grained benchmarks and 11.7% anomaly reduction in generated videos via a new dataset and GR...

Reference graph

Works this paper leans on

124 extracted references · 124 canonical work pages · cited by 1 Pith paper · 31 internal anchors

  1. [1]

    Ai-generated video detection via spatial-temporal anomaly learning

    Jianfa Bai, Man Lin, Gang Cao, and Zijie Lou. Ai-generated video detection via spatial-temporal anomaly learning. In Chinese Conference on Pattern Recognition and Computer Vision (PRCV), pages 460–470. Springer, 2024. 1, 2, 5, 7

  2. [2]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 2, 4, 5, 7

  3. [3]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 1, 4

  4. [4]

    SkyReels-V2: Infinite-length Film Generative Model

    Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, et al. Skyreels-v2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025. 4

  5. [5]

    Demamba: Ai-generated video detec- tion on million-scale genvideo benchmark.arXiv preprint arXiv:2405.19707, 2024

    Haoxing Chen, Yan Hong, Zizheng Huang, Zhuoer Xu, Zhangxuan Gu, Yaohui Li, Jun Lan, Huijia Zhu, Jianfu Zhang, Weiqiang Wang, et al. Demamba: Ai-generated video detec- tion on million-scale genvideo benchmark.arXiv preprint arXiv:2405.19707, 2024. 1, 2, 3, 4, 5, 6, 7, 16

  6. [6]

    Panda-70m: Captioning 70m videos with multiple cross- modality teachers

    Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Eka- terina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al. Panda-70m: Captioning 70m videos with multiple cross- modality teachers. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 13320–13331, 2024. 4, 19

  7. [7]

    Genworld: Towards detect- ing ai-generated real-world simulation videos.arXiv preprint arXiv:2506.10975, 2025

    Weiliang Chen, Wenzhao Zheng, Yu Zheng, Lei Chen, Jie Zhou, Jiwen Lu, and Yueqi Duan. Genworld: Towards detect- ing ai-generated real-world simulation videos.arXiv preprint arXiv:2506.10975, 2025. 1, 2, 3, 5, 7

  8. [8]

    X2-dfd: A framework for explainable and extendable deepfake detection.arXiv preprint arXiv:2410.06126, 2024

    Yize Chen, Zhiyuan Yan, Guangliang Cheng, Kangran Zhao, Siwei Lyu, and Baoyuan Wu. X2-dfd: A framework for explainable and extendable deepfake detection.arXiv preprint arXiv:2410.06126, 2024. 1, 2

  9. [9]

    Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025

    Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Han- rong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, et al. Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025. 3

  10. [10]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 2, 4

  11. [11]

    Gemini 2.5: Our most intelligent ai model

    Google DeepMind. Gemini 2.5: Our most intelligent ai model. https://blog.google/technology/ google - deepmind / gemini - model - thinking - updates-march-2025/ , 2025. Accessed: 2025-11-14. 1, 4, 5, 7, 17

  12. [12]

    Veo 3: Advanced generative video model

    Google DeepMind. Veo 3: Advanced generative video model. https://aistudio.google.com/models/veo- 3, 2025. Accessed: 2025-11-14. 1, 3

  13. [13]

    Video-R1: Reinforcing Video Reasoning in MLLMs

    Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776,

  14. [14]

    Love-r1: Advancing long video understanding with an adaptive zoom-in mechanism via multi- step reasoning.arXiv preprint arXiv:2509.24786, 2025

    Shenghao Fu, Qize Yang, Yuan-Ming Li, Xihan Wei, Xiaohua Xie, and Wei-Shi Zheng. Love-r1: Advancing long video understanding with an adaptive zoom-in mechanism via multi- step reasoning.arXiv preprint arXiv:2509.24786, 2025. 3

  15. [15]

    David-xr1: Detecting ai-generated videos with explain- able reasoning.arXiv preprint arXiv:2506.14827, 2025

    Yifeng Gao, Yifan Ding, Hongyu Su, Juncheng Li, Yunhan Zhao, Lin Luo, Zixing Chen, Li Wang, Xin Wang, Yixu Wang, et al. David-xr1: Detecting ai-generated videos with explain- able reasoning.arXiv preprint arXiv:2506.14827, 2025. 1, 2, 4

  16. [16]

    Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025. 3

  17. [17]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 6, 9

  18. [18]

    Geovlmath: Enhancing geometry reasoning in vision-language models via cross-modal reward for auxiliary line creation.arXiv preprint arXiv:2510.11020,

    Shasha Guo, Liang Pang, Xi Wang, Yanling Wang, Huawei Shen, and Jing Zhang. Geovlmath: Enhancing geometry reasoning in vision-language models via cross-modal reward for auxiliary line creation.arXiv preprint arXiv:2510.11020,

  19. [19]

    Rethinking vision-language model in face forensics: Multi-modal interpretable forged face detector

    Xiao Guo, Xiufeng Song, Yue Zhang, Xiaohong Liu, and Xiaoming Liu. Rethinking vision-language model in face forensics: Multi-modal interpretable forged face detector. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 105–116, 2025. 1, 2

  20. [20]

    LTX-Video: Realtime Video Latent Diffusion

    Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024. 4

  21. [21]

    Framethinker: Learning to think with long videos via multi-turn frame spotlighting.arXiv preprint arXiv:2509.24304, 2025

    Zefeng He, Xiaoye Qu, Yafu Li, Siyuan Huang, Daizong Liu, and Yu Cheng. Framethinker: Learning to think with long videos via multi-turn frame spotlighting.arXiv preprint arXiv:2509.24304, 2025. 3

  22. [22]

    CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

    Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video gen- eration via transformers.arXiv preprint arXiv:2205.15868,

  23. [23]

    Visual sketchpad: Sketching as a visual chain of thought for mul- timodal language models.Advances in Neural Information Processing Systems, 37:139348–139379, 2024

    Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for mul- timodal language models.Advances in Neural Information Processing Systems, 37:139348–139379, 2024. 3

  24. [24]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. 1, 2, 4

  25. [25]

    Ai-generated 10 video detection via perceptual straightening.arXiv preprint arXiv:2507.00583, 2025

    Christian Internò, Robert Geirhos, Markus Olhofer, Sunny Liu, Barbara Hammer, and David Klindt. Ai-generated 10 video detection via perceptual straightening.arXiv preprint arXiv:2507.00583, 2025. 1, 2, 5

  26. [26]

    VACE: All-in-One Video Creation and Editing

    Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing. arXiv preprint arXiv:2503.07598, 2025. 4

  27. [27]

    Legion: Learning to ground and explain for synthetic image detection.arXiv preprint arXiv:2503.15264,

    Hengrui Kang, Siwei Wen, Zichen Wen, Junyan Ye, Weijia Li, Peilin Feng, Baichuan Zhou, Bin Wang, Dahua Lin, Linfeng Zhang, et al. Legion: Learning to ground and explain for synthetic image detection.arXiv preprint arXiv:2503.15264,

  28. [28]

    The Kinetics Human Action Video Dataset

    Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics hu- man action video dataset.arXiv preprint arXiv:1705.06950,

  29. [29]

    Text2video-zero: Text- to-image diffusion models are zero-shot video generators

    Levon Khachatryan, Andranik Movsisyan, Vahram Tade- vosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text- to-image diffusion models are zero-shot video generators. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15954–15964, 2023. 4

  30. [30]

    Klingai: Creative video generation platform

    KlingAI. Klingai: Creative video generation platform. https://klingai.com/ , 2025. Accessed: 2025-11-

  31. [31]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 1, 3, 4

  32. [32]

    Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

    Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, and Heng- shuang Zhao. Mini-o3: Scaling up reasoning patterns and interaction turns for visual search.arXiv preprint arXiv:2509.07969, 2025. 3

  33. [33]

    Blip- 2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip- 2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR,

  34. [34]

    Mvbench: A comprehensive multi-modal video understand- ing benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195– 22206, 2024. 3

  35. [35]

    VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

    Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via re- inforcement fine-tuning.arXiv preprint arXiv:2504.06958,

  36. [36]

    Fakebench: Probing explainable fake image detection via large multimodal models

    Yixuan Li, Xuelin Liu, Xiaoyang Wang, Bu Sung Lee, Shiqi Wang, Anderson Rocha, and Weisi Lin. Fakebench: Probing explainable fake image detection via large multimodal models. IEEE Transactions on Information Forensics and Security,

  37. [37]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1, 2

  38. [38]

    Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

    Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, and Jiaya Jia. Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement.arXiv preprint arXiv:2503.06520, 2025. 3

  39. [39]

    Decof: Generated video detection via frame consistency: The first benchmark dataset.arXiv e-prints, pages arXiv–2402,

    Long Ma, Jiajia Zhang, Hongping Deng, Ningyu Zhang, Qinglang Guo, Haiyang Yu, Yong Liao, and Pengyuan Zhou. Decof: Generated video detection via frame consistency: The first benchmark dataset.arXiv e-prints, pages arXiv–2402,

  40. [40]

    Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

    Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models.arXiv preprint arXiv:2306.05424, 2023. 2

  41. [41]

    Open-o3 video: Grounded video reasoning with explicit spatio-temporal evidence.arXiv preprint arXiv:2510.20579, 2025

    Jiahao Meng, Xiangtai Li, Haochen Wang, Yue Tan, Tao Zhang, Lingdong Kong, Yunhai Tong, Anran Wang, Zhiyang Teng, Yujing Wang, et al. Open-o3 video: Grounded video reasoning with explicit spatio-temporal evidence.arXiv preprint arXiv:2510.20579, 2025. 6

  42. [42]

    Hailuo 02: Global ai video generation model by minimax

    MiniMax. Hailuo 02: Global ai video generation model by minimax. https://hailuo-02.com/, 2025. Accessed: 2025-11-14. 4

  43. [43]

    Genvidbench: A challenging benchmark for detecting ai- generated video.arXiv preprint arXiv:2501.11340, 2025

    Zhenliang Ni, Qiangyu Yan, Mouxiao Huang, Tianning Yuan, Yehui Tang, Hailin Hu, Xinghao Chen, and Yunhe Wang. Genvidbench: A challenging benchmark for detecting ai- generated video.arXiv preprint arXiv:2501.11340, 2025. 1, 2, 3, 4

  44. [44]

    Ovo-bench: How far is your video-llms from real-world online video understanding? InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18902–18913, 2025

    Junbo Niu, Yifei Li, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao He, Xiaoyi Dong, Haodong Duan, Shuangrui Ding, Rui Qian, et al. Ovo-bench: How far is your video-llms from real-world online video understanding? InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18902–18913, 2025. 3

  45. [45]

    Gpt-4o mini: Advancing cost-efficient intelligence

    OpenAI. Gpt-4o mini: Advancing cost-efficient intelligence. https : / / openai . com / index / gpt - 4o - mini - advancing - cost - efficient - intelligence/,

  46. [46]

    Accessed: 2025-11-14. 4

  47. [47]

    Sora 2 is here: Next-generation video-and-audio gen- eration model

    OpenAI. Sora 2 is here: Next-generation video-and-audio gen- eration model. https://openai.com/index/sora- 2/, 2025. Accessed: 2025-11-14. 1, 3, 4

  48. [48]

    Introducing gpt-4.1 in the api

    OpenAI. Introducing gpt-4.1 in the api. https://openai. com/index/gpt-4-1/, 2025. Accessed: 2025-11-14. 1, 4, 5, 6, 7

  49. [49]

    Vidguard-r1: Ai-generated video detection and explanation via reasoning mllms and rl.arXiv preprint arXiv:2510.02282, 2025

    Kyoungjun Park, Yifan Yang, Juheon Yi, Shicheng Zheng, Yifei Shen, Dongqi Han, Caihua Shan, Muhammad Muaz, and Lili Qiu. Vidguard-r1: Ai-generated video detection and explanation via reasoning mllms and rl.arXiv preprint arXiv:2510.02282, 2025. 1, 2, 3, 4

  50. [50]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,

  51. [51]

    Pika.art

    Pika Art. Pika.art. https://pika.art/, 2025. Accessed: 2025-11-14. 4

  52. [52]

    Pixverse – ai video generator from text & photos

    PixVerse AI. Pixverse – ai video generator from text & photos. https://app.pixverse.ai/, 2025. Accessed: 2025- 11-14. 4

  53. [53]

    Qwen3-vl: Sharper vision, deeper thought, broader action

    Qwen Team. Qwen3-vl: Sharper vision, deeper thought, broader action. https : / / qwen . ai / blog ? id = 11 99f0335c4ad9ff6153e517418d48535ab6d8afef& from = research . latest - advancements - list,

  54. [54]

    Accessed: 2025-10-23. 1, 3

  55. [55]

    Introducing runway gen-4

    Runway AI, Inc. Introducing runway gen-4. https: / / runwayml . com / research / introducing - runway-gen-4, 2025. Accessed: 2025-11-14. 4

  56. [56]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 6

  57. [57]

    Perceptual-cognitive universals as reflec- tions of the world.Psychonomic Bulletin & Review, 1(1): 2–28, 1994

    Roger N Shepard. Perceptual-cognitive universals as reflec- tions of the world.Psychonomic Bulletin & Review, 1(1): 2–28, 1994. 5

  58. [58]

    Deepfakes, misinformation, and disinformation in the era of frontier ai, generative ai, and large ai models

    Mohamed R Shoaib, Zefan Wang, Milad Taleby Ahvanooey, and Jun Zhao. Deepfakes, misinformation, and disinformation in the era of frontier ai, generative ai, and large ai models. In 2023 international conference on computer and applications (ICCA), pages 1–7. IEEE, 2023. 1

  59. [59]

    Initial knowledge: Six suggestions.Cogni- tion, 50(1-3):431–445, 1994

    Elizabeth Spelke. Initial knowledge: Six suggestions.Cogni- tion, 50(1-3):431–445, 1994. 5

  60. [60]

    Core knowledge

    Elizabeth S Spelke and Katherine D Kinzler. Core knowledge. Developmental science, 10(1):89–96, 2007. 5

  61. [61]

    OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning

    Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Jun- tao Li, Xiaoye Qu, et al. Openthinkimg: Learning to think with images via visual tool reinforcement learning.arXiv preprint arXiv:2505.08617, 2025. 3

  62. [62]

    Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

    Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, et al. Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers.arXiv preprint arXiv:2506.23918, 2025. 3, 9

  63. [63]

    Forgerysleuth: Em- powering multimodal large language models for image ma- nipulation detection.arXiv preprint arXiv:2411.19466, 2024

    Zhihao Sun, Haoran Jiang, Haoran Chen, Yixin Cao, Xipeng Qiu, Zuxuan Wu, and Yu-Gang Jiang. Forgerysleuth: Em- powering multimodal large language models for image ma- nipulation detection.arXiv preprint arXiv:2411.19466, 2024. 2

  64. [64]

    Veritas: Generalizable deepfake detection via pattern- aware reasoning.arXiv preprint arXiv:2508.21048, 2025

    Hao Tan, Jun Lan, Zichang Tan, Ajian Liu, Chuanbiao Song, Senyuan Shi, Huijia Zhu, Weiqiang Wang, Jun Wan, and Zhen Lei. Veritas: Generalizable deepfake detection via pattern- aware reasoning.arXiv preprint arXiv:2508.21048, 2025. 1, 2

  65. [65]

    Video-lmm post-training: A deep dive into video reasoning with large multimodal models.arXiv preprint arXiv:2510.05034, 2025

    Yunlong Tang, Jing Bi, Pinxin Liu, Zhenyu Pan, Zhangyun Tan, Qianxiang Shen, Jiani Liu, Hang Hua, Junjia Guo, Yun- zhong Xiao, et al. Video-lmm post-training: A deep dive into video reasoning with large multimodal models.arXiv preprint arXiv:2510.05034, 2025. 3, 9

  66. [66]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 2

  67. [67]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 1, 3, 4

  68. [68]

    ModelScope Text-to-Video Technical Report

    Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report.arXiv preprint arXiv:2308.06571, 2023. 4

  69. [69]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 2, 3

  70. [70]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025. 3, 5, 7

  71. [71]

    Lvbench: An extreme long video understanding benchmark

    Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Ming Ding, Xiaotao Gu, Shiyu Huang, Bin Xu, et al. Lvbench: An extreme long video understanding benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22958–22967, 2025. 3

  72. [72]

    Chain-of- thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of- thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824– 24837, 2022. 1, 5

  73. [73]

    Busterx: Mllm-powered ai-generated video forgery detection and explanation.arXiv preprint arXiv:2505.12620, 2025

    Haiquan Wen, Yiwei He, Zhenglin Huang, Tianxiao Li, Zihan Yu, Xingru Huang, Lu Qi, Baoyuan Wu, Xiangtai Li, and Guangliang Cheng. Busterx: Mllm-powered ai-generated video forgery detection and explanation.arXiv preprint arXiv:2505.12620, 2025. 1, 2, 3, 4, 17

  74. [74]

    Busterx++: Towards unified cross-modal ai-generated content detection and explanation with mllm

    Haiquan Wen, Tianxiao Li, Zhenglin Huang, Yiwei He, and Guangliang Cheng. Busterx++: Towards unified cross-modal ai-generated content detection and explanation with mllm. arXiv preprint arXiv:2507.14632, 2025. 1, 2, 4, 6, 7, 17, 18

  75. [75]

    Spot the fake: Large multimodal model-based synthetic image detection with artifact explanation.arXiv preprint arXiv:2503.14905, 2025

    Siwei Wen, Junyan Ye, Peilin Feng, Hengrui Kang, Zichen Wen, Yize Chen, Jiang Wu, Wenjun Wu, Conghui He, and Weijia Li. Spot the fake: Large multimodal model-based synthetic image detection with artifact explanation.arXiv preprint arXiv:2503.14905, 2025. 2

  76. [76]

    Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Informa- tion Processing Systems, 37:28828–28857, 2024

    Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Informa- tion Processing Systems, 37:28828–28857, 2024. 3

  77. [77]

    Combat- ing misinformation in the era of generative ai models

    Danni Xu, Shaojing Fan, and Mohan Kankanhalli. Combat- ing misinformation in the era of generative ai models. In Proceedings of the 31st ACM International Conference on Multimedia, pages 9291–9298, 2023. 1

  78. [78]

    Fakeshield: Explainable image forgery detection and localization via multi-modal large lan- guage models.arXiv preprint arXiv:2410.02761, 2024

    Zhipei Xu, Xuanyu Zhang, Runyi Li, Zecheng Tang, Qing Huang, and Jian Zhang. Fakeshield: Explainable image forgery detection and localization via multi-modal large lan- guage models.arXiv preprint arXiv:2410.02761, 2024. 2

  79. [79]

    Advanc- ing high-resolution video-language representation with large- scale video transcriptions

    Hongwei Xue, Tiankai Hang, Yanhong Zeng, Yuchong Sun, Bei Liu, Huan Yang, Jianlong Fu, and Baining Guo. Advanc- ing high-resolution video-language representation with large- scale video transcriptions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5036–5045, 2022. 4, 19

  80. [80]

    12 Videochat-r1

    Ziang Yan, Xinhao Li, Yinan He, Zhengrong Yue, Xiangyu Zeng, Yali Wang, Yu Qiao, Limin Wang, and Yi Wang. 12 Videochat-r1. 5: Visual test-time scaling to reinforce mul- timodal reasoning by iterative perception.arXiv preprint arXiv:2509.21100, 2025. 3

Showing first 80 references.