Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning

Jie Zhou; Jiwen Lu; Lei Chen; Runze Sun; Wenzhao Zheng; Yanran Zhang; Yifei Li; Yu Zheng

arxiv: 2512.15693 · v2 · pith:GTDIOGX7new · submitted 2025-12-17 · 💻 cs.CV

Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning

Yifei Li , Wenzhao Zheng , Yanran Zhang , Runze Sun , Yu Zheng , Lei Chen , Jie Zhou , Jiwen Lu This is my paper

Pith reviewed 2026-05-21 16:38 UTC · model grok-4.3

classification 💻 cs.CV

keywords AI-generated video detectionmultimodal large language modelvisual artifactsspatio-temporal reasoningexplainable detectionvideo generation

0 comments

The pith

Skyra detects AI-generated videos by spotting and explaining human-perceivable visual artifacts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a multimodal large language model that learns to find specific visual flaws in AI-made videos and uses those flaws as direct evidence for both deciding whether a video is fake and explaining why. It builds this capability through a dataset of human-annotated artifact examples and a two-stage training process that improves the model's ability to notice inconsistencies across space and time. A new benchmark with videos from many different generators is used to show that the resulting system outperforms prior detection methods. If the approach holds, detection tools would move from opaque binary labels to interpretable outputs that highlight concrete visual reasons. This shift matters for building trust in systems meant to counter the spread of misleading AI videos.

Core claim

Skyra is a multimodal large language model that identifies human-perceivable spatio-temporal visual artifacts in AI-generated videos and treats those artifacts as grounded evidence for accurate detection together with human-readable explanations.

What carries the argument

Two-stage training on a large-scale dataset of fine-grained human annotations for visual artifacts, which strengthens the model's perception of inconsistencies and its ability to verbalize them as detection rationale.

If this is right

Detection outputs include specific visual reasons that humans can verify instead of a single yes-or-no label.
Performance gains appear across benchmarks that include videos from more than ten current generators.
The evaluation process surfaces patterns in artifact types that can inform refinements to future detectors.
Explainable outputs become available for applications that require human oversight of AI video content.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same artifact-grounding idea could be tested on other media such as AI-generated images to see whether explanations remain effective.
Deployment in practice would likely need periodic retraining whenever new video generators introduce previously unseen artifact patterns.
Collecting ongoing human annotations might prove more scalable if automated proposals for candidate artifacts are first generated by the model itself.

Load-bearing premise

The human annotations on the training videos accurately identify the artifacts that will appear and remain useful in videos made by generators never seen during development.

What would settle it

Run Skyra on a fresh collection of videos produced by an entirely new video generator outside the training set and the evaluation benchmark, then measure whether artifact identification accuracy and overall detection performance stay high.

Figures

Figures reproduced from arXiv: 2512.15693 by Jie Zhou, Jiwen Lu, Lei Chen, Runze Sun, Wenzhao Zheng, Yanran Zhang, Yifei Li, Yu Zheng.

**Figure 2.** Figure 2: Skyra leverages human-perceivable artifacts in AI-generated videos as grounded evidence for detection and explanation. Compared to off-the-shelf MLLMs and previous MLLM-based detectors, Skyra demonstrates superior artifact perception and detection capabilities. identifies artifacts and leverages them as spatio-temporally grounded evidence. As shown in Figures 1 and 2, Skyra achieves substantially higher de… view at source ↗

**Figure 3.** Figure 3: Overview of the ViF-CoT-4K dataset. (a) The hierarchical taxonomy of AI-generated video artifacts. (b) Visual examples of artifacts under our taxonomy. (c) Construction pipeline of ViF-CoT-4K dataset, including authentic data collection and AI-generated video collection, manual annotation, and the step-by-step chain-of-thought explanation data construction process. Video-LLaMA [81] employs multimodal encod… view at source ↗

**Figure 4.** Figure 4: Statistics of the ViF-CoT-4K and ViF-Bench. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Overview of Skyra. We leverage a two-stage training pipeline to improve Skyra’s artifacts perception and detection capabilities: [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Case Study. More examples are provided in the appendix. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Annotation platform UI. A.3. Chain-of-Thought Annotation Prompt Design To transform concise human annotations into trainingready step-by-step supervision, we design a structured prompt for Gemini-2.5-Pro that operates on each fake–real video pair. For every annotated instance, the model receives sampled frames from the fake and real videos together with the curated artifact Type, Textual Explanation, T… view at source ↗

**Figure 9.** Figure 9: Visualization of Class Activation Maps (CAMs) produced [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Response examples of off-the-shelf MLLMs. [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: Response examples of existing MLLM-based detector, BusterX++ [ [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: System prompt and user prompt design. impersonation, and erosion of trust in authentic media. By focusing on interpretable, artifact-centric detection, Skyra aims to provide not only predictions but also grounded visual evidence that can assist journalists, fact-checkers, regulators, 18 [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

**Figure 13.** Figure 13: Chain-of-Thought Annotation Prompt. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

**Figure 14.** Figure 14: ViF-Bench Video Sample Examples-I 22 [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗

**Figure 15.** Figure 15: ViF-Bench Video Sample Examples-II 23 [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗

**Figure 16.** Figure 16: Skyra’s Response Example on Real Videos, I [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗

**Figure 17.** Figure 17: Skyra’s Response Example on Real Videos, II [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗

**Figure 18.** Figure 18: Skyra’s Response Example on Fake Videos, Texture Anomaly-Structure Anomaly [PITH_FULL_IMAGE:figures/full_fig_p025_18.png] view at source ↗

**Figure 19.** Figure 19: Skyra’s Response Example on Fake Videos, Color & Lighting Anomaly-Color Over-Saturation [PITH_FULL_IMAGE:figures/full_fig_p025_19.png] view at source ↗

**Figure 20.** Figure 20: Skyra’s Response Example on Fake Videos, Move Forgery-Camera Motion Inconsistency [PITH_FULL_IMAGE:figures/full_fig_p026_20.png] view at source ↗

**Figure 21.** Figure 21: Skyra’s Response Example on Fake Videos, Object Inconsistency-Shape Distortion [PITH_FULL_IMAGE:figures/full_fig_p026_21.png] view at source ↗

**Figure 22.** Figure 22: Skyra’s Response Example on Fake Videos, Interaction Inconsistency-Abnormal Rigid-Body Crossing [PITH_FULL_IMAGE:figures/full_fig_p027_22.png] view at source ↗

**Figure 23.** Figure 23: Skyra’s Response Example on Fake Videos, Unnatural Movement-Unnatural Human Movement [PITH_FULL_IMAGE:figures/full_fig_p027_23.png] view at source ↗

**Figure 24.** Figure 24: Skyra’s Response Example on Fake Videos, Violation of Causality Law-Violation of Physical Law [PITH_FULL_IMAGE:figures/full_fig_p028_24.png] view at source ↗

**Figure 25.** Figure 25: Skyra’s Response Example on Fake Videos, Violation of Commonsense-Text Distortion [PITH_FULL_IMAGE:figures/full_fig_p028_25.png] view at source ↗

read the original abstract

The misuse of AI-driven video generation technologies has raised serious social concerns, highlighting the urgent need for reliable AI-generated video detectors. However, most existing methods are limited to binary classification and lack the necessary explanations for human interpretation. In this paper, we present Skyra, a specialized multimodal large language model (MLLM) that identifies human-perceivable visual artifacts in AI-generated videos and leverages them as grounded evidence for both detection and explanation. To support this objective, we construct ViF-CoT-4K for Supervised Fine-Tuning (SFT), which represents the first large-scale AI-generated video artifact dataset with fine-grained human annotations. We then develop a two-stage training strategy that systematically enhances our model's spatio-temporal artifact perception, explanation capability, and detection accuracy. To comprehensively evaluate Skyra, we introduce ViF-Bench, a benchmark comprising 3K high-quality samples generated by over ten state-of-the-art video generators. Extensive experiments demonstrate that Skyra surpasses existing methods across multiple benchmarks, while our evaluation yields valuable insights for advancing explainable AI-generated video detection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Skyra adds a new artifact dataset and multi-generator benchmark that could help explainable detection, but the transfer of human annotations to unseen generators is the unverified core assumption.

read the letter

The main takeaway is that this paper supplies two concrete new resources: ViF-CoT-4K, a 4K-scale dataset with fine-grained human annotations of visual artifacts in AI videos, and ViF-Bench, a 3K-sample test set drawn from more than ten generators. They train an MLLM called Skyra with a two-stage process that first builds artifact perception then adds chain-of-thought explanation and detection. That setup lets the model point to specific, human-visible flaws rather than just outputting a binary label, which is a step beyond most current detectors.

Referee Report

2 major / 2 minor

Summary. The paper introduces Skyra, a multimodal large language model (MLLM) for AI-generated video detection that identifies human-perceivable spatio-temporal visual artifacts and uses them as grounded evidence for both binary detection and natural-language explanations. It constructs the ViF-CoT-4K dataset (4K samples with fine-grained human annotations) for supervised fine-tuning, applies a two-stage training procedure to improve artifact perception and reasoning, and evaluates on the newly introduced ViF-Bench (3K high-quality samples from >10 generators), claiming superior detection accuracy and explanation quality over prior methods.

Significance. If the central claims hold, the work advances explainable detection of synthetic video, a timely problem given rapid progress in generative models. The emphasis on human-perceivable artifacts and the release of annotated datasets plus a multi-generator benchmark could support more interpretable and robust detectors than current binary classifiers. The two-stage training and grounded CoT reasoning represent a concrete methodological direction worth further exploration.

major comments (2)

[Section 5] ViF-Bench evaluation (Section 5): The central claim that fine-grained annotations in ViF-CoT-4K capture transferable spatio-temporal artifacts relies on generalization to videos from >10 unseen generators. No cross-generator ablation or leave-one-generator-out analysis is reported; without it, reported gains in accuracy and explanation quality could arise from the base MLLM visual encoder or generic reasoning rather than the intended artifact-grounding mechanism.
[Section 4.2] Two-stage training description (Section 4.2): The first stage is described as enhancing spatio-temporal artifact perception, yet the precise supervision signals, loss terms, and how artifact labels are converted into training targets are not specified. This makes it impossible to determine whether the second-stage detection and CoT improvements are attributable to the proposed grounded reasoning or to standard SFT effects.

minor comments (2)

[Section 5] The exact list of generators and generation parameters used for ViF-Bench should be tabulated for reproducibility; the abstract's phrase 'over ten' is insufficient.
Figure captions and axis labels in the qualitative results should explicitly indicate which frames or regions correspond to the cited artifacts to aid reader verification.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which has helped us identify areas for improvement in our manuscript. Below, we provide point-by-point responses to the major comments and indicate the revisions we plan to implement.

read point-by-point responses

Referee: [Section 5] ViF-Bench evaluation (Section 5): The central claim that fine-grained annotations in ViF-CoT-4K capture transferable spatio-temporal artifacts relies on generalization to videos from >10 unseen generators. No cross-generator ablation or leave-one-generator-out analysis is reported; without it, reported gains in accuracy and explanation quality could arise from the base MLLM visual encoder or generic reasoning rather than the intended artifact-grounding mechanism.

Authors: We appreciate this observation. Our ViF-Bench does evaluate on videos generated by more than 10 state-of-the-art models, many of which were not used in creating the ViF-CoT-4K training set, providing evidence of generalization to unseen generators. However, we acknowledge that an explicit cross-generator ablation would more rigorously isolate the contribution of our artifact-grounding approach. In the revised manuscript, we will add a leave-one-generator-out analysis to demonstrate that the performance gains persist when excluding specific generators from training. revision: yes
Referee: [Section 4.2] Two-stage training description (Section 4.2): The first stage is described as enhancing spatio-temporal artifact perception, yet the precise supervision signals, loss terms, and how artifact labels are converted into training targets are not specified. This makes it impossible to determine whether the second-stage detection and CoT improvements are attributable to the proposed grounded reasoning or to standard SFT effects.

Authors: We agree that additional details on the training procedure are necessary for reproducibility and to clarify the contributions. In the revised version of the paper, we will expand Section 4.2 to include the specific supervision signals used in the first stage (such as artifact localization and description tasks derived from the human annotations), the loss functions employed (including the primary language modeling loss and any auxiliary objectives), and how the annotations are processed into training targets. revision: yes

Circularity Check

0 steps flagged

No circularity; standard data collection, training, and benchmarking pipeline

full rationale

The paper constructs a new human-annotated dataset ViF-CoT-4K for supervised fine-tuning, applies a two-stage training process to improve artifact perception and detection, and evaluates on a separately introduced benchmark ViF-Bench containing videos from over ten generators. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the derivation. The central claims rest on empirical training and external human annotations rather than reducing outputs to inputs by construction. This is a self-contained ML pipeline with independent evaluation data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central claim depends on the quality of human artifact annotations and the effectiveness of the two-stage training procedure. No free parameters are explicitly fitted in the abstract. The main axioms are domain assumptions about annotation reliability and generalization. The new model and datasets are introduced entities without independent external validation in the provided text.

axioms (1)

domain assumption Human annotations in ViF-CoT-4K accurately and comprehensively identify the human-perceivable visual artifacts that distinguish AI-generated videos.
The supervised fine-tuning and subsequent performance claims rest directly on these annotations.

invented entities (3)

Skyra MLLM no independent evidence
purpose: Multimodal model specialized for spatio-temporal artifact perception and grounded explanation.
New model presented in the paper.
ViF-CoT-4K dataset no independent evidence
purpose: Large-scale training set with fine-grained artifact annotations for SFT.
Newly constructed dataset described as the first of its kind.
ViF-Bench no independent evidence
purpose: Evaluation benchmark containing videos from over ten state-of-the-art generators.
New benchmark introduced for comprehensive testing.

pith-pipeline@v0.9.0 · 5737 in / 1485 out tokens · 64280 ms · 2026-05-21T16:38:47.483353+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a hierarchical taxonomy... Layer 1 (L1) defines two high-level categories: Low-level forgery... and Violation of Laws (physical and logical inconsistencies).
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Skyra... identifies human-perceivable visual artifacts... two-stage training strategy... ViF-Bench... over ten state-of-the-art video generators.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating
cs.CV 2026-05 unverdicted novelty 7.0

CaC is a hierarchical spatiotemporal concentrating reward model for video anomalies that reports 25.7% accuracy gains on fine-grained benchmarks and 11.7% anomaly reduction in generated videos via a new dataset and GR...

Reference graph

Works this paper leans on

124 extracted references · 124 canonical work pages · cited by 1 Pith paper · 31 internal anchors

[1]

Ai-generated video detection via spatial-temporal anomaly learning

Jianfa Bai, Man Lin, Gang Cao, and Zijie Lou. Ai-generated video detection via spatial-temporal anomaly learning. In Chinese Conference on Pattern Recognition and Computer Vision (PRCV), pages 460–470. Springer, 2024. 1, 2, 5, 7

work page 2024
[2]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 2, 4, 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 1, 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

SkyReels-V2: Infinite-length Film Generative Model

Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, et al. Skyreels-v2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025. 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Demamba: Ai-generated video detec- tion on million-scale genvideo benchmark.arXiv preprint arXiv:2405.19707, 2024

Haoxing Chen, Yan Hong, Zizheng Huang, Zhuoer Xu, Zhangxuan Gu, Yaohui Li, Jun Lan, Huijia Zhu, Jianfu Zhang, Weiqiang Wang, et al. Demamba: Ai-generated video detec- tion on million-scale genvideo benchmark.arXiv preprint arXiv:2405.19707, 2024. 1, 2, 3, 4, 5, 6, 7, 16

work page arXiv 2024
[6]

Panda-70m: Captioning 70m videos with multiple cross- modality teachers

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Eka- terina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al. Panda-70m: Captioning 70m videos with multiple cross- modality teachers. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 13320–13331, 2024. 4, 19

work page 2024
[7]

Genworld: Towards detect- ing ai-generated real-world simulation videos.arXiv preprint arXiv:2506.10975, 2025

Weiliang Chen, Wenzhao Zheng, Yu Zheng, Lei Chen, Jie Zhou, Jiwen Lu, and Yueqi Duan. Genworld: Towards detect- ing ai-generated real-world simulation videos.arXiv preprint arXiv:2506.10975, 2025. 1, 2, 3, 5, 7

work page arXiv 2025
[8]

X2-dfd: A framework for explainable and extendable deepfake detection.arXiv preprint arXiv:2410.06126, 2024

Yize Chen, Zhiyuan Yan, Guangliang Cheng, Kangran Zhao, Siwei Lyu, and Baoyuan Wu. X2-dfd: A framework for explainable and extendable deepfake detection.arXiv preprint arXiv:2410.06126, 2024. 1, 2

work page arXiv 2024
[9]

Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025

Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Han- rong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, et al. Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025. 3

work page arXiv 2025
[10]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 2, 4

work page 2024
[11]

Gemini 2.5: Our most intelligent ai model

Google DeepMind. Gemini 2.5: Our most intelligent ai model. https://blog.google/technology/ google - deepmind / gemini - model - thinking - updates-march-2025/ , 2025. Accessed: 2025-11-14. 1, 4, 5, 7, 17

work page 2025
[12]

Veo 3: Advanced generative video model

Google DeepMind. Veo 3: Advanced generative video model. https://aistudio.google.com/models/veo- 3, 2025. Accessed: 2025-11-14. 1, 3

work page 2025
[13]

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Love-r1: Advancing long video understanding with an adaptive zoom-in mechanism via multi- step reasoning.arXiv preprint arXiv:2509.24786, 2025

Shenghao Fu, Qize Yang, Yuan-Ming Li, Xihan Wei, Xiaohua Xie, and Wei-Shi Zheng. Love-r1: Advancing long video understanding with an adaptive zoom-in mechanism via multi- step reasoning.arXiv preprint arXiv:2509.24786, 2025. 3

work page arXiv 2025
[15]

David-xr1: Detecting ai-generated videos with explain- able reasoning.arXiv preprint arXiv:2506.14827, 2025

Yifeng Gao, Yifan Ding, Hongyu Su, Juncheng Li, Yunhan Zhao, Lin Luo, Zixing Chen, Li Wang, Xin Wang, Yixu Wang, et al. David-xr1: Detecting ai-generated videos with explain- able reasoning.arXiv preprint arXiv:2506.14827, 2025. 1, 2, 4

work page arXiv 2025
[16]

Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 6, 9

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Geovlmath: Enhancing geometry reasoning in vision-language models via cross-modal reward for auxiliary line creation.arXiv preprint arXiv:2510.11020,

Shasha Guo, Liang Pang, Xi Wang, Yanling Wang, Huawei Shen, and Jing Zhang. Geovlmath: Enhancing geometry reasoning in vision-language models via cross-modal reward for auxiliary line creation.arXiv preprint arXiv:2510.11020,

work page arXiv
[19]

Rethinking vision-language model in face forensics: Multi-modal interpretable forged face detector

Xiao Guo, Xiufeng Song, Yue Zhang, Xiaohong Liu, and Xiaoming Liu. Rethinking vision-language model in face forensics: Multi-modal interpretable forged face detector. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 105–116, 2025. 1, 2

work page 2025
[20]

LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024. 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Framethinker: Learning to think with long videos via multi-turn frame spotlighting.arXiv preprint arXiv:2509.24304, 2025

Zefeng He, Xiaoye Qu, Yafu Li, Siyuan Huang, Daizong Liu, and Yu Cheng. Framethinker: Learning to think with long videos via multi-turn frame spotlighting.arXiv preprint arXiv:2509.24304, 2025. 3

work page arXiv 2025
[22]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video gen- eration via transformers.arXiv preprint arXiv:2205.15868,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Visual sketchpad: Sketching as a visual chain of thought for mul- timodal language models.Advances in Neural Information Processing Systems, 37:139348–139379, 2024

Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for mul- timodal language models.Advances in Neural Information Processing Systems, 37:139348–139379, 2024. 3

work page 2024
[24]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. 1, 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Ai-generated 10 video detection via perceptual straightening.arXiv preprint arXiv:2507.00583, 2025

Christian Internò, Robert Geirhos, Markus Olhofer, Sunny Liu, Barbara Hammer, and David Klindt. Ai-generated 10 video detection via perceptual straightening.arXiv preprint arXiv:2507.00583, 2025. 1, 2, 5

work page arXiv 2025
[26]

VACE: All-in-One Video Creation and Editing

Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing. arXiv preprint arXiv:2503.07598, 2025. 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Legion: Learning to ground and explain for synthetic image detection.arXiv preprint arXiv:2503.15264,

Hengrui Kang, Siwei Wen, Zichen Wen, Junyan Ye, Weijia Li, Peilin Feng, Baichuan Zhou, Bin Wang, Dahua Lin, Linfeng Zhang, et al. Legion: Learning to ground and explain for synthetic image detection.arXiv preprint arXiv:2503.15264,

work page arXiv
[28]

The Kinetics Human Action Video Dataset

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics hu- man action video dataset.arXiv preprint arXiv:1705.06950,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Text2video-zero: Text- to-image diffusion models are zero-shot video generators

Levon Khachatryan, Andranik Movsisyan, Vahram Tade- vosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text- to-image diffusion models are zero-shot video generators. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15954–15964, 2023. 4

work page 2023
[30]

Klingai: Creative video generation platform

KlingAI. Klingai: Creative video generation platform. https://klingai.com/ , 2025. Accessed: 2025-11-

work page 2025
[31]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 1, 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, and Heng- shuang Zhao. Mini-o3: Scaling up reasoning patterns and interaction turns for visual search.arXiv preprint arXiv:2509.07969, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Blip- 2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip- 2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR,

work page
[34]

Mvbench: A comprehensive multi-modal video understand- ing benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195– 22206, 2024. 3

work page 2024
[35]

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via re- inforcement fine-tuning.arXiv preprint arXiv:2504.06958,

work page internal anchor Pith review Pith/arXiv arXiv
[36]

Fakebench: Probing explainable fake image detection via large multimodal models

Yixuan Li, Xuelin Liu, Xiaoyang Wang, Bu Sung Lee, Shiqi Wang, Anderson Rocha, and Weisi Lin. Fakebench: Probing explainable fake image detection via large multimodal models. IEEE Transactions on Information Forensics and Security,

work page
[37]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1, 2

work page 2023
[38]

Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, and Jiaya Jia. Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement.arXiv preprint arXiv:2503.06520, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Decof: Generated video detection via frame consistency: The first benchmark dataset.arXiv e-prints, pages arXiv–2402,

Long Ma, Jiajia Zhang, Hongping Deng, Ningyu Zhang, Qinglang Guo, Haiyang Yu, Yong Liao, and Pengyuan Zhou. Decof: Generated video detection via frame consistency: The first benchmark dataset.arXiv e-prints, pages arXiv–2402,

work page
[40]

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models.arXiv preprint arXiv:2306.05424, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

Open-o3 video: Grounded video reasoning with explicit spatio-temporal evidence.arXiv preprint arXiv:2510.20579, 2025

Jiahao Meng, Xiangtai Li, Haochen Wang, Yue Tan, Tao Zhang, Lingdong Kong, Yunhai Tong, Anran Wang, Zhiyang Teng, Yujing Wang, et al. Open-o3 video: Grounded video reasoning with explicit spatio-temporal evidence.arXiv preprint arXiv:2510.20579, 2025. 6

work page arXiv 2025
[42]

Hailuo 02: Global ai video generation model by minimax

MiniMax. Hailuo 02: Global ai video generation model by minimax. https://hailuo-02.com/, 2025. Accessed: 2025-11-14. 4

work page 2025
[43]

Genvidbench: A challenging benchmark for detecting ai- generated video.arXiv preprint arXiv:2501.11340, 2025

Zhenliang Ni, Qiangyu Yan, Mouxiao Huang, Tianning Yuan, Yehui Tang, Hailin Hu, Xinghao Chen, and Yunhe Wang. Genvidbench: A challenging benchmark for detecting ai- generated video.arXiv preprint arXiv:2501.11340, 2025. 1, 2, 3, 4

work page arXiv 2025
[44]

Ovo-bench: How far is your video-llms from real-world online video understanding? InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18902–18913, 2025

Junbo Niu, Yifei Li, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao He, Xiaoyi Dong, Haodong Duan, Shuangrui Ding, Rui Qian, et al. Ovo-bench: How far is your video-llms from real-world online video understanding? InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18902–18913, 2025. 3

work page 2025
[45]

Gpt-4o mini: Advancing cost-efficient intelligence

OpenAI. Gpt-4o mini: Advancing cost-efficient intelligence. https : / / openai . com / index / gpt - 4o - mini - advancing - cost - efficient - intelligence/,

work page
[46]

Accessed: 2025-11-14. 4

work page 2025
[47]

Sora 2 is here: Next-generation video-and-audio gen- eration model

OpenAI. Sora 2 is here: Next-generation video-and-audio gen- eration model. https://openai.com/index/sora- 2/, 2025. Accessed: 2025-11-14. 1, 3, 4

work page 2025
[48]

Introducing gpt-4.1 in the api

OpenAI. Introducing gpt-4.1 in the api. https://openai. com/index/gpt-4-1/, 2025. Accessed: 2025-11-14. 1, 4, 5, 6, 7

work page 2025
[49]

Vidguard-r1: Ai-generated video detection and explanation via reasoning mllms and rl.arXiv preprint arXiv:2510.02282, 2025

Kyoungjun Park, Yifan Yang, Juheon Yi, Shicheng Zheng, Yifei Shen, Dongqi Han, Caihua Shan, Muhammad Muaz, and Lili Qiu. Vidguard-r1: Ai-generated video detection and explanation via reasoning mllms and rl.arXiv preprint arXiv:2510.02282, 2025. 1, 2, 3, 4

work page arXiv 2025
[50]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,

work page
[51]

Pika.art

Pika Art. Pika.art. https://pika.art/, 2025. Accessed: 2025-11-14. 4

work page 2025
[52]

Pixverse – ai video generator from text & photos

PixVerse AI. Pixverse – ai video generator from text & photos. https://app.pixverse.ai/, 2025. Accessed: 2025- 11-14. 4

work page 2025
[53]

Qwen3-vl: Sharper vision, deeper thought, broader action

Qwen Team. Qwen3-vl: Sharper vision, deeper thought, broader action. https : / / qwen . ai / blog ? id = 11 99f0335c4ad9ff6153e517418d48535ab6d8afef& from = research . latest - advancements - list,

work page
[54]

Accessed: 2025-10-23. 1, 3

work page 2025
[55]

Introducing runway gen-4

Runway AI, Inc. Introducing runway gen-4. https: / / runwayml . com / research / introducing - runway-gen-4, 2025. Accessed: 2025-11-14. 4

work page 2025
[56]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

Perceptual-cognitive universals as reflec- tions of the world.Psychonomic Bulletin & Review, 1(1): 2–28, 1994

Roger N Shepard. Perceptual-cognitive universals as reflec- tions of the world.Psychonomic Bulletin & Review, 1(1): 2–28, 1994. 5

work page 1994
[58]

Deepfakes, misinformation, and disinformation in the era of frontier ai, generative ai, and large ai models

Mohamed R Shoaib, Zefan Wang, Milad Taleby Ahvanooey, and Jun Zhao. Deepfakes, misinformation, and disinformation in the era of frontier ai, generative ai, and large ai models. In 2023 international conference on computer and applications (ICCA), pages 1–7. IEEE, 2023. 1

work page 2023
[59]

Initial knowledge: Six suggestions.Cogni- tion, 50(1-3):431–445, 1994

Elizabeth Spelke. Initial knowledge: Six suggestions.Cogni- tion, 50(1-3):431–445, 1994. 5

work page 1994
[60]

Core knowledge

Elizabeth S Spelke and Katherine D Kinzler. Core knowledge. Developmental science, 10(1):89–96, 2007. 5

work page 2007
[61]

OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning

Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Jun- tao Li, Xiaoye Qu, et al. Openthinkimg: Learning to think with images via visual tool reinforcement learning.arXiv preprint arXiv:2505.08617, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[62]

Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, et al. Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers.arXiv preprint arXiv:2506.23918, 2025. 3, 9

work page internal anchor Pith review Pith/arXiv arXiv 2025
[63]

Forgerysleuth: Em- powering multimodal large language models for image ma- nipulation detection.arXiv preprint arXiv:2411.19466, 2024

Zhihao Sun, Haoran Jiang, Haoran Chen, Yixin Cao, Xipeng Qiu, Zuxuan Wu, and Yu-Gang Jiang. Forgerysleuth: Em- powering multimodal large language models for image ma- nipulation detection.arXiv preprint arXiv:2411.19466, 2024. 2

work page arXiv 2024
[64]

Veritas: Generalizable deepfake detection via pattern- aware reasoning.arXiv preprint arXiv:2508.21048, 2025

Hao Tan, Jun Lan, Zichang Tan, Ajian Liu, Chuanbiao Song, Senyuan Shi, Huijia Zhu, Weiqiang Wang, Jun Wan, and Zhen Lei. Veritas: Generalizable deepfake detection via pattern- aware reasoning.arXiv preprint arXiv:2508.21048, 2025. 1, 2

work page arXiv 2025
[65]

Video-lmm post-training: A deep dive into video reasoning with large multimodal models.arXiv preprint arXiv:2510.05034, 2025

Yunlong Tang, Jing Bi, Pinxin Liu, Zhenyu Pan, Zhangyun Tan, Qianxiang Shen, Jiani Liu, Hang Hua, Junjia Guo, Yun- zhong Xiao, et al. Video-lmm post-training: A deep dive into video reasoning with large multimodal models.arXiv preprint arXiv:2510.05034, 2025. 3, 9

work page arXiv 2025
[66]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[67]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 1, 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[68]

ModelScope Text-to-Video Technical Report

Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report.arXiv preprint arXiv:2308.06571, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[69]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[70]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025. 3, 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[71]

Lvbench: An extreme long video understanding benchmark

Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Ming Ding, Xiaotao Gu, Shiyu Huang, Bin Xu, et al. Lvbench: An extreme long video understanding benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22958–22967, 2025. 3

work page 2025
[72]

Chain-of- thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of- thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824– 24837, 2022. 1, 5

work page 2022
[73]

Busterx: Mllm-powered ai-generated video forgery detection and explanation.arXiv preprint arXiv:2505.12620, 2025

Haiquan Wen, Yiwei He, Zhenglin Huang, Tianxiao Li, Zihan Yu, Xingru Huang, Lu Qi, Baoyuan Wu, Xiangtai Li, and Guangliang Cheng. Busterx: Mllm-powered ai-generated video forgery detection and explanation.arXiv preprint arXiv:2505.12620, 2025. 1, 2, 3, 4, 17

work page arXiv 2025
[74]

Busterx++: Towards unified cross-modal ai-generated content detection and explanation with mllm

Haiquan Wen, Tianxiao Li, Zhenglin Huang, Yiwei He, and Guangliang Cheng. Busterx++: Towards unified cross-modal ai-generated content detection and explanation with mllm. arXiv preprint arXiv:2507.14632, 2025. 1, 2, 4, 6, 7, 17, 18

work page arXiv 2025
[75]

Spot the fake: Large multimodal model-based synthetic image detection with artifact explanation.arXiv preprint arXiv:2503.14905, 2025

Siwei Wen, Junyan Ye, Peilin Feng, Hengrui Kang, Zichen Wen, Yize Chen, Jiang Wu, Wenjun Wu, Conghui He, and Weijia Li. Spot the fake: Large multimodal model-based synthetic image detection with artifact explanation.arXiv preprint arXiv:2503.14905, 2025. 2

work page arXiv 2025
[76]

Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Informa- tion Processing Systems, 37:28828–28857, 2024

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Informa- tion Processing Systems, 37:28828–28857, 2024. 3

work page 2024
[77]

Combat- ing misinformation in the era of generative ai models

Danni Xu, Shaojing Fan, and Mohan Kankanhalli. Combat- ing misinformation in the era of generative ai models. In Proceedings of the 31st ACM International Conference on Multimedia, pages 9291–9298, 2023. 1

work page 2023
[78]

Fakeshield: Explainable image forgery detection and localization via multi-modal large lan- guage models.arXiv preprint arXiv:2410.02761, 2024

Zhipei Xu, Xuanyu Zhang, Runyi Li, Zecheng Tang, Qing Huang, and Jian Zhang. Fakeshield: Explainable image forgery detection and localization via multi-modal large lan- guage models.arXiv preprint arXiv:2410.02761, 2024. 2

work page arXiv 2024
[79]

Advanc- ing high-resolution video-language representation with large- scale video transcriptions

Hongwei Xue, Tiankai Hang, Yanhong Zeng, Yuchong Sun, Bei Liu, Huan Yang, Jianlong Fu, and Baining Guo. Advanc- ing high-resolution video-language representation with large- scale video transcriptions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5036–5045, 2022. 4, 19

work page 2022
[80]

12 Videochat-r1

Ziang Yan, Xinhao Li, Yinan He, Zhengrong Yue, Xiangyu Zeng, Yali Wang, Yu Qiao, Limin Wang, and Yi Wang. 12 Videochat-r1. 5: Visual test-time scaling to reinforce mul- timodal reasoning by iterative perception.arXiv preprint arXiv:2509.21100, 2025. 3

work page arXiv 2025

Showing first 80 references.

[1] [1]

Ai-generated video detection via spatial-temporal anomaly learning

Jianfa Bai, Man Lin, Gang Cao, and Zijie Lou. Ai-generated video detection via spatial-temporal anomaly learning. In Chinese Conference on Pattern Recognition and Computer Vision (PRCV), pages 460–470. Springer, 2024. 1, 2, 5, 7

work page 2024

[2] [2]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 2, 4, 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 1, 4

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

SkyReels-V2: Infinite-length Film Generative Model

Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, et al. Skyreels-v2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025. 4

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Demamba: Ai-generated video detec- tion on million-scale genvideo benchmark.arXiv preprint arXiv:2405.19707, 2024

Haoxing Chen, Yan Hong, Zizheng Huang, Zhuoer Xu, Zhangxuan Gu, Yaohui Li, Jun Lan, Huijia Zhu, Jianfu Zhang, Weiqiang Wang, et al. Demamba: Ai-generated video detec- tion on million-scale genvideo benchmark.arXiv preprint arXiv:2405.19707, 2024. 1, 2, 3, 4, 5, 6, 7, 16

work page arXiv 2024

[6] [6]

Panda-70m: Captioning 70m videos with multiple cross- modality teachers

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Eka- terina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al. Panda-70m: Captioning 70m videos with multiple cross- modality teachers. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 13320–13331, 2024. 4, 19

work page 2024

[7] [7]

Genworld: Towards detect- ing ai-generated real-world simulation videos.arXiv preprint arXiv:2506.10975, 2025

Weiliang Chen, Wenzhao Zheng, Yu Zheng, Lei Chen, Jie Zhou, Jiwen Lu, and Yueqi Duan. Genworld: Towards detect- ing ai-generated real-world simulation videos.arXiv preprint arXiv:2506.10975, 2025. 1, 2, 3, 5, 7

work page arXiv 2025

[8] [8]

X2-dfd: A framework for explainable and extendable deepfake detection.arXiv preprint arXiv:2410.06126, 2024

Yize Chen, Zhiyuan Yan, Guangliang Cheng, Kangran Zhao, Siwei Lyu, and Baoyuan Wu. X2-dfd: A framework for explainable and extendable deepfake detection.arXiv preprint arXiv:2410.06126, 2024. 1, 2

work page arXiv 2024

[9] [9]

Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025

Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Han- rong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, et al. Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025. 3

work page arXiv 2025

[10] [10]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 2, 4

work page 2024

[11] [11]

Gemini 2.5: Our most intelligent ai model

Google DeepMind. Gemini 2.5: Our most intelligent ai model. https://blog.google/technology/ google - deepmind / gemini - model - thinking - updates-march-2025/ , 2025. Accessed: 2025-11-14. 1, 4, 5, 7, 17

work page 2025

[12] [12]

Veo 3: Advanced generative video model

Google DeepMind. Veo 3: Advanced generative video model. https://aistudio.google.com/models/veo- 3, 2025. Accessed: 2025-11-14. 1, 3

work page 2025

[13] [13]

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Love-r1: Advancing long video understanding with an adaptive zoom-in mechanism via multi- step reasoning.arXiv preprint arXiv:2509.24786, 2025

Shenghao Fu, Qize Yang, Yuan-Ming Li, Xihan Wei, Xiaohua Xie, and Wei-Shi Zheng. Love-r1: Advancing long video understanding with an adaptive zoom-in mechanism via multi- step reasoning.arXiv preprint arXiv:2509.24786, 2025. 3

work page arXiv 2025

[15] [15]

David-xr1: Detecting ai-generated videos with explain- able reasoning.arXiv preprint arXiv:2506.14827, 2025

Yifeng Gao, Yifan Ding, Hongyu Su, Juncheng Li, Yunhan Zhao, Lin Luo, Zixing Chen, Li Wang, Xin Wang, Yixu Wang, et al. David-xr1: Detecting ai-generated videos with explain- able reasoning.arXiv preprint arXiv:2506.14827, 2025. 1, 2, 4

work page arXiv 2025

[16] [16]

Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 6, 9

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

Geovlmath: Enhancing geometry reasoning in vision-language models via cross-modal reward for auxiliary line creation.arXiv preprint arXiv:2510.11020,

Shasha Guo, Liang Pang, Xi Wang, Yanling Wang, Huawei Shen, and Jing Zhang. Geovlmath: Enhancing geometry reasoning in vision-language models via cross-modal reward for auxiliary line creation.arXiv preprint arXiv:2510.11020,

work page arXiv

[19] [19]

Rethinking vision-language model in face forensics: Multi-modal interpretable forged face detector

Xiao Guo, Xiufeng Song, Yue Zhang, Xiaohong Liu, and Xiaoming Liu. Rethinking vision-language model in face forensics: Multi-modal interpretable forged face detector. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 105–116, 2025. 1, 2

work page 2025

[20] [20]

LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024. 4

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

Framethinker: Learning to think with long videos via multi-turn frame spotlighting.arXiv preprint arXiv:2509.24304, 2025

Zefeng He, Xiaoye Qu, Yafu Li, Siyuan Huang, Daizong Liu, and Yu Cheng. Framethinker: Learning to think with long videos via multi-turn frame spotlighting.arXiv preprint arXiv:2509.24304, 2025. 3

work page arXiv 2025

[22] [22]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video gen- eration via transformers.arXiv preprint arXiv:2205.15868,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

Visual sketchpad: Sketching as a visual chain of thought for mul- timodal language models.Advances in Neural Information Processing Systems, 37:139348–139379, 2024

Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for mul- timodal language models.Advances in Neural Information Processing Systems, 37:139348–139379, 2024. 3

work page 2024

[24] [24]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. 1, 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

Ai-generated 10 video detection via perceptual straightening.arXiv preprint arXiv:2507.00583, 2025

Christian Internò, Robert Geirhos, Markus Olhofer, Sunny Liu, Barbara Hammer, and David Klindt. Ai-generated 10 video detection via perceptual straightening.arXiv preprint arXiv:2507.00583, 2025. 1, 2, 5

work page arXiv 2025

[26] [26]

VACE: All-in-One Video Creation and Editing

Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing. arXiv preprint arXiv:2503.07598, 2025. 4

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

Legion: Learning to ground and explain for synthetic image detection.arXiv preprint arXiv:2503.15264,

Hengrui Kang, Siwei Wen, Zichen Wen, Junyan Ye, Weijia Li, Peilin Feng, Baichuan Zhou, Bin Wang, Dahua Lin, Linfeng Zhang, et al. Legion: Learning to ground and explain for synthetic image detection.arXiv preprint arXiv:2503.15264,

work page arXiv

[28] [28]

The Kinetics Human Action Video Dataset

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics hu- man action video dataset.arXiv preprint arXiv:1705.06950,

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

Text2video-zero: Text- to-image diffusion models are zero-shot video generators

Levon Khachatryan, Andranik Movsisyan, Vahram Tade- vosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text- to-image diffusion models are zero-shot video generators. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15954–15964, 2023. 4

work page 2023

[30] [30]

Klingai: Creative video generation platform

KlingAI. Klingai: Creative video generation platform. https://klingai.com/ , 2025. Accessed: 2025-11-

work page 2025

[31] [31]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 1, 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, and Heng- shuang Zhao. Mini-o3: Scaling up reasoning patterns and interaction turns for visual search.arXiv preprint arXiv:2509.07969, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

Blip- 2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip- 2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR,

work page

[34] [34]

Mvbench: A comprehensive multi-modal video understand- ing benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195– 22206, 2024. 3

work page 2024

[35] [35]

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via re- inforcement fine-tuning.arXiv preprint arXiv:2504.06958,

work page internal anchor Pith review Pith/arXiv arXiv

[36] [36]

Fakebench: Probing explainable fake image detection via large multimodal models

Yixuan Li, Xuelin Liu, Xiaoyang Wang, Bu Sung Lee, Shiqi Wang, Anderson Rocha, and Weisi Lin. Fakebench: Probing explainable fake image detection via large multimodal models. IEEE Transactions on Information Forensics and Security,

work page

[37] [37]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1, 2

work page 2023

[38] [38]

Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, and Jiaya Jia. Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement.arXiv preprint arXiv:2503.06520, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

Decof: Generated video detection via frame consistency: The first benchmark dataset.arXiv e-prints, pages arXiv–2402,

Long Ma, Jiajia Zhang, Hongping Deng, Ningyu Zhang, Qinglang Guo, Haiyang Yu, Yong Liao, and Pengyuan Zhou. Decof: Generated video detection via frame consistency: The first benchmark dataset.arXiv e-prints, pages arXiv–2402,

work page

[40] [40]

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models.arXiv preprint arXiv:2306.05424, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[41] [41]

Open-o3 video: Grounded video reasoning with explicit spatio-temporal evidence.arXiv preprint arXiv:2510.20579, 2025

Jiahao Meng, Xiangtai Li, Haochen Wang, Yue Tan, Tao Zhang, Lingdong Kong, Yunhai Tong, Anran Wang, Zhiyang Teng, Yujing Wang, et al. Open-o3 video: Grounded video reasoning with explicit spatio-temporal evidence.arXiv preprint arXiv:2510.20579, 2025. 6

work page arXiv 2025

[42] [42]

Hailuo 02: Global ai video generation model by minimax

MiniMax. Hailuo 02: Global ai video generation model by minimax. https://hailuo-02.com/, 2025. Accessed: 2025-11-14. 4

work page 2025

[43] [43]

Genvidbench: A challenging benchmark for detecting ai- generated video.arXiv preprint arXiv:2501.11340, 2025

Zhenliang Ni, Qiangyu Yan, Mouxiao Huang, Tianning Yuan, Yehui Tang, Hailin Hu, Xinghao Chen, and Yunhe Wang. Genvidbench: A challenging benchmark for detecting ai- generated video.arXiv preprint arXiv:2501.11340, 2025. 1, 2, 3, 4

work page arXiv 2025

[44] [44]

Ovo-bench: How far is your video-llms from real-world online video understanding? InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18902–18913, 2025

Junbo Niu, Yifei Li, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao He, Xiaoyi Dong, Haodong Duan, Shuangrui Ding, Rui Qian, et al. Ovo-bench: How far is your video-llms from real-world online video understanding? InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18902–18913, 2025. 3

work page 2025

[45] [45]

Gpt-4o mini: Advancing cost-efficient intelligence

OpenAI. Gpt-4o mini: Advancing cost-efficient intelligence. https : / / openai . com / index / gpt - 4o - mini - advancing - cost - efficient - intelligence/,

work page

[46] [46]

Accessed: 2025-11-14. 4

work page 2025

[47] [47]

Sora 2 is here: Next-generation video-and-audio gen- eration model

OpenAI. Sora 2 is here: Next-generation video-and-audio gen- eration model. https://openai.com/index/sora- 2/, 2025. Accessed: 2025-11-14. 1, 3, 4

work page 2025

[48] [48]

Introducing gpt-4.1 in the api

OpenAI. Introducing gpt-4.1 in the api. https://openai. com/index/gpt-4-1/, 2025. Accessed: 2025-11-14. 1, 4, 5, 6, 7

work page 2025

[49] [49]

Vidguard-r1: Ai-generated video detection and explanation via reasoning mllms and rl.arXiv preprint arXiv:2510.02282, 2025

Kyoungjun Park, Yifan Yang, Juheon Yi, Shicheng Zheng, Yifei Shen, Dongqi Han, Caihua Shan, Muhammad Muaz, and Lili Qiu. Vidguard-r1: Ai-generated video detection and explanation via reasoning mllms and rl.arXiv preprint arXiv:2510.02282, 2025. 1, 2, 3, 4

work page arXiv 2025

[50] [50]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,

work page

[51] [51]

Pika.art

Pika Art. Pika.art. https://pika.art/, 2025. Accessed: 2025-11-14. 4

work page 2025

[52] [52]

Pixverse – ai video generator from text & photos

PixVerse AI. Pixverse – ai video generator from text & photos. https://app.pixverse.ai/, 2025. Accessed: 2025- 11-14. 4

work page 2025

[53] [53]

Qwen3-vl: Sharper vision, deeper thought, broader action

Qwen Team. Qwen3-vl: Sharper vision, deeper thought, broader action. https : / / qwen . ai / blog ? id = 11 99f0335c4ad9ff6153e517418d48535ab6d8afef& from = research . latest - advancements - list,

work page

[54] [54]

Accessed: 2025-10-23. 1, 3

work page 2025

[55] [55]

Introducing runway gen-4

Runway AI, Inc. Introducing runway gen-4. https: / / runwayml . com / research / introducing - runway-gen-4, 2025. Accessed: 2025-11-14. 4

work page 2025

[56] [56]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[57] [57]

Perceptual-cognitive universals as reflec- tions of the world.Psychonomic Bulletin & Review, 1(1): 2–28, 1994

Roger N Shepard. Perceptual-cognitive universals as reflec- tions of the world.Psychonomic Bulletin & Review, 1(1): 2–28, 1994. 5

work page 1994

[58] [58]

Deepfakes, misinformation, and disinformation in the era of frontier ai, generative ai, and large ai models

Mohamed R Shoaib, Zefan Wang, Milad Taleby Ahvanooey, and Jun Zhao. Deepfakes, misinformation, and disinformation in the era of frontier ai, generative ai, and large ai models. In 2023 international conference on computer and applications (ICCA), pages 1–7. IEEE, 2023. 1

work page 2023

[59] [59]

Initial knowledge: Six suggestions.Cogni- tion, 50(1-3):431–445, 1994

Elizabeth Spelke. Initial knowledge: Six suggestions.Cogni- tion, 50(1-3):431–445, 1994. 5

work page 1994

[60] [60]

Core knowledge

Elizabeth S Spelke and Katherine D Kinzler. Core knowledge. Developmental science, 10(1):89–96, 2007. 5

work page 2007

[61] [61]

OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning

Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Jun- tao Li, Xiaoye Qu, et al. Openthinkimg: Learning to think with images via visual tool reinforcement learning.arXiv preprint arXiv:2505.08617, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[62] [62]

Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, et al. Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers.arXiv preprint arXiv:2506.23918, 2025. 3, 9

work page internal anchor Pith review Pith/arXiv arXiv 2025

[63] [63]

Forgerysleuth: Em- powering multimodal large language models for image ma- nipulation detection.arXiv preprint arXiv:2411.19466, 2024

Zhihao Sun, Haoran Jiang, Haoran Chen, Yixin Cao, Xipeng Qiu, Zuxuan Wu, and Yu-Gang Jiang. Forgerysleuth: Em- powering multimodal large language models for image ma- nipulation detection.arXiv preprint arXiv:2411.19466, 2024. 2

work page arXiv 2024

[64] [64]

Veritas: Generalizable deepfake detection via pattern- aware reasoning.arXiv preprint arXiv:2508.21048, 2025

Hao Tan, Jun Lan, Zichang Tan, Ajian Liu, Chuanbiao Song, Senyuan Shi, Huijia Zhu, Weiqiang Wang, Jun Wan, and Zhen Lei. Veritas: Generalizable deepfake detection via pattern- aware reasoning.arXiv preprint arXiv:2508.21048, 2025. 1, 2

work page arXiv 2025

[65] [65]

Video-lmm post-training: A deep dive into video reasoning with large multimodal models.arXiv preprint arXiv:2510.05034, 2025

Yunlong Tang, Jing Bi, Pinxin Liu, Zhenyu Pan, Zhangyun Tan, Qianxiang Shen, Jiani Liu, Hang Hua, Junjia Guo, Yun- zhong Xiao, et al. Video-lmm post-training: A deep dive into video reasoning with large multimodal models.arXiv preprint arXiv:2510.05034, 2025. 3, 9

work page arXiv 2025

[66] [66]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[67] [67]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 1, 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2025

[68] [68]

ModelScope Text-to-Video Technical Report

Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report.arXiv preprint arXiv:2308.06571, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023

[69] [69]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[70] [70]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025. 3, 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025

[71] [71]

Lvbench: An extreme long video understanding benchmark

Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Ming Ding, Xiaotao Gu, Shiyu Huang, Bin Xu, et al. Lvbench: An extreme long video understanding benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22958–22967, 2025. 3

work page 2025

[72] [72]

Chain-of- thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of- thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824– 24837, 2022. 1, 5

work page 2022

[73] [73]

Busterx: Mllm-powered ai-generated video forgery detection and explanation.arXiv preprint arXiv:2505.12620, 2025

Haiquan Wen, Yiwei He, Zhenglin Huang, Tianxiao Li, Zihan Yu, Xingru Huang, Lu Qi, Baoyuan Wu, Xiangtai Li, and Guangliang Cheng. Busterx: Mllm-powered ai-generated video forgery detection and explanation.arXiv preprint arXiv:2505.12620, 2025. 1, 2, 3, 4, 17

work page arXiv 2025

[74] [74]

Busterx++: Towards unified cross-modal ai-generated content detection and explanation with mllm

Haiquan Wen, Tianxiao Li, Zhenglin Huang, Yiwei He, and Guangliang Cheng. Busterx++: Towards unified cross-modal ai-generated content detection and explanation with mllm. arXiv preprint arXiv:2507.14632, 2025. 1, 2, 4, 6, 7, 17, 18

work page arXiv 2025

[75] [75]

Spot the fake: Large multimodal model-based synthetic image detection with artifact explanation.arXiv preprint arXiv:2503.14905, 2025

Siwei Wen, Junyan Ye, Peilin Feng, Hengrui Kang, Zichen Wen, Yize Chen, Jiang Wu, Wenjun Wu, Conghui He, and Weijia Li. Spot the fake: Large multimodal model-based synthetic image detection with artifact explanation.arXiv preprint arXiv:2503.14905, 2025. 2

work page arXiv 2025

[76] [76]

Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Informa- tion Processing Systems, 37:28828–28857, 2024

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Informa- tion Processing Systems, 37:28828–28857, 2024. 3

work page 2024

[77] [77]

Combat- ing misinformation in the era of generative ai models

Danni Xu, Shaojing Fan, and Mohan Kankanhalli. Combat- ing misinformation in the era of generative ai models. In Proceedings of the 31st ACM International Conference on Multimedia, pages 9291–9298, 2023. 1

work page 2023

[78] [78]

Fakeshield: Explainable image forgery detection and localization via multi-modal large lan- guage models.arXiv preprint arXiv:2410.02761, 2024

Zhipei Xu, Xuanyu Zhang, Runyi Li, Zecheng Tang, Qing Huang, and Jian Zhang. Fakeshield: Explainable image forgery detection and localization via multi-modal large lan- guage models.arXiv preprint arXiv:2410.02761, 2024. 2

work page arXiv 2024

[79] [79]

Advanc- ing high-resolution video-language representation with large- scale video transcriptions

Hongwei Xue, Tiankai Hang, Yanhong Zeng, Yuchong Sun, Bei Liu, Huan Yang, Jianlong Fu, and Baining Guo. Advanc- ing high-resolution video-language representation with large- scale video transcriptions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5036–5045, 2022. 4, 19

work page 2022

[80] [80]

12 Videochat-r1

Ziang Yan, Xinhao Li, Yinan He, Zhengrong Yue, Xiangyu Zeng, Yali Wang, Yu Qiao, Limin Wang, and Yi Wang. 12 Videochat-r1. 5: Visual test-time scaling to reinforce mul- timodal reasoning by iterative perception.arXiv preprint arXiv:2509.21100, 2025. 3

work page arXiv 2025