Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning
Pith reviewed 2026-05-21 16:38 UTC · model grok-4.3
The pith
Skyra detects AI-generated videos by spotting and explaining human-perceivable visual artifacts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Skyra is a multimodal large language model that identifies human-perceivable spatio-temporal visual artifacts in AI-generated videos and treats those artifacts as grounded evidence for accurate detection together with human-readable explanations.
What carries the argument
Two-stage training on a large-scale dataset of fine-grained human annotations for visual artifacts, which strengthens the model's perception of inconsistencies and its ability to verbalize them as detection rationale.
If this is right
- Detection outputs include specific visual reasons that humans can verify instead of a single yes-or-no label.
- Performance gains appear across benchmarks that include videos from more than ten current generators.
- The evaluation process surfaces patterns in artifact types that can inform refinements to future detectors.
- Explainable outputs become available for applications that require human oversight of AI video content.
Where Pith is reading between the lines
- The same artifact-grounding idea could be tested on other media such as AI-generated images to see whether explanations remain effective.
- Deployment in practice would likely need periodic retraining whenever new video generators introduce previously unseen artifact patterns.
- Collecting ongoing human annotations might prove more scalable if automated proposals for candidate artifacts are first generated by the model itself.
Load-bearing premise
The human annotations on the training videos accurately identify the artifacts that will appear and remain useful in videos made by generators never seen during development.
What would settle it
Run Skyra on a fresh collection of videos produced by an entirely new video generator outside the training set and the evaluation benchmark, then measure whether artifact identification accuracy and overall detection performance stay high.
Figures
read the original abstract
The misuse of AI-driven video generation technologies has raised serious social concerns, highlighting the urgent need for reliable AI-generated video detectors. However, most existing methods are limited to binary classification and lack the necessary explanations for human interpretation. In this paper, we present Skyra, a specialized multimodal large language model (MLLM) that identifies human-perceivable visual artifacts in AI-generated videos and leverages them as grounded evidence for both detection and explanation. To support this objective, we construct ViF-CoT-4K for Supervised Fine-Tuning (SFT), which represents the first large-scale AI-generated video artifact dataset with fine-grained human annotations. We then develop a two-stage training strategy that systematically enhances our model's spatio-temporal artifact perception, explanation capability, and detection accuracy. To comprehensively evaluate Skyra, we introduce ViF-Bench, a benchmark comprising 3K high-quality samples generated by over ten state-of-the-art video generators. Extensive experiments demonstrate that Skyra surpasses existing methods across multiple benchmarks, while our evaluation yields valuable insights for advancing explainable AI-generated video detection.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Skyra, a multimodal large language model (MLLM) for AI-generated video detection that identifies human-perceivable spatio-temporal visual artifacts and uses them as grounded evidence for both binary detection and natural-language explanations. It constructs the ViF-CoT-4K dataset (4K samples with fine-grained human annotations) for supervised fine-tuning, applies a two-stage training procedure to improve artifact perception and reasoning, and evaluates on the newly introduced ViF-Bench (3K high-quality samples from >10 generators), claiming superior detection accuracy and explanation quality over prior methods.
Significance. If the central claims hold, the work advances explainable detection of synthetic video, a timely problem given rapid progress in generative models. The emphasis on human-perceivable artifacts and the release of annotated datasets plus a multi-generator benchmark could support more interpretable and robust detectors than current binary classifiers. The two-stage training and grounded CoT reasoning represent a concrete methodological direction worth further exploration.
major comments (2)
- [Section 5] ViF-Bench evaluation (Section 5): The central claim that fine-grained annotations in ViF-CoT-4K capture transferable spatio-temporal artifacts relies on generalization to videos from >10 unseen generators. No cross-generator ablation or leave-one-generator-out analysis is reported; without it, reported gains in accuracy and explanation quality could arise from the base MLLM visual encoder or generic reasoning rather than the intended artifact-grounding mechanism.
- [Section 4.2] Two-stage training description (Section 4.2): The first stage is described as enhancing spatio-temporal artifact perception, yet the precise supervision signals, loss terms, and how artifact labels are converted into training targets are not specified. This makes it impossible to determine whether the second-stage detection and CoT improvements are attributable to the proposed grounded reasoning or to standard SFT effects.
minor comments (2)
- [Section 5] The exact list of generators and generation parameters used for ViF-Bench should be tabulated for reproducibility; the abstract's phrase 'over ten' is insufficient.
- Figure captions and axis labels in the qualitative results should explicitly indicate which frames or regions correspond to the cited artifacts to aid reader verification.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which has helped us identify areas for improvement in our manuscript. Below, we provide point-by-point responses to the major comments and indicate the revisions we plan to implement.
read point-by-point responses
-
Referee: [Section 5] ViF-Bench evaluation (Section 5): The central claim that fine-grained annotations in ViF-CoT-4K capture transferable spatio-temporal artifacts relies on generalization to videos from >10 unseen generators. No cross-generator ablation or leave-one-generator-out analysis is reported; without it, reported gains in accuracy and explanation quality could arise from the base MLLM visual encoder or generic reasoning rather than the intended artifact-grounding mechanism.
Authors: We appreciate this observation. Our ViF-Bench does evaluate on videos generated by more than 10 state-of-the-art models, many of which were not used in creating the ViF-CoT-4K training set, providing evidence of generalization to unseen generators. However, we acknowledge that an explicit cross-generator ablation would more rigorously isolate the contribution of our artifact-grounding approach. In the revised manuscript, we will add a leave-one-generator-out analysis to demonstrate that the performance gains persist when excluding specific generators from training. revision: yes
-
Referee: [Section 4.2] Two-stage training description (Section 4.2): The first stage is described as enhancing spatio-temporal artifact perception, yet the precise supervision signals, loss terms, and how artifact labels are converted into training targets are not specified. This makes it impossible to determine whether the second-stage detection and CoT improvements are attributable to the proposed grounded reasoning or to standard SFT effects.
Authors: We agree that additional details on the training procedure are necessary for reproducibility and to clarify the contributions. In the revised version of the paper, we will expand Section 4.2 to include the specific supervision signals used in the first stage (such as artifact localization and description tasks derived from the human annotations), the loss functions employed (including the primary language modeling loss and any auxiliary objectives), and how the annotations are processed into training targets. revision: yes
Circularity Check
No circularity; standard data collection, training, and benchmarking pipeline
full rationale
The paper constructs a new human-annotated dataset ViF-CoT-4K for supervised fine-tuning, applies a two-stage training process to improve artifact perception and detection, and evaluates on a separately introduced benchmark ViF-Bench containing videos from over ten generators. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the derivation. The central claims rest on empirical training and external human annotations rather than reducing outputs to inputs by construction. This is a self-contained ML pipeline with independent evaluation data.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human annotations in ViF-CoT-4K accurately and comprehensively identify the human-perceivable visual artifacts that distinguish AI-generated videos.
invented entities (3)
-
Skyra MLLM
no independent evidence
-
ViF-CoT-4K dataset
no independent evidence
-
ViF-Bench
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose a hierarchical taxonomy... Layer 1 (L1) defines two high-level categories: Low-level forgery... and Violation of Laws (physical and logical inconsistencies).
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Skyra... identifies human-perceivable visual artifacts... two-stage training strategy... ViF-Bench... over ten state-of-the-art video generators.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating
CaC is a hierarchical spatiotemporal concentrating reward model for video anomalies that reports 25.7% accuracy gains on fine-grained benchmarks and 11.7% anomaly reduction in generated videos via a new dataset and GR...
Reference graph
Works this paper leans on
-
[1]
Ai-generated video detection via spatial-temporal anomaly learning
Jianfa Bai, Man Lin, Gang Cao, and Zijie Lou. Ai-generated video detection via spatial-temporal anomaly learning. In Chinese Conference on Pattern Recognition and Computer Vision (PRCV), pages 460–470. Springer, 2024. 1, 2, 5, 7
work page 2024
-
[2]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 2, 4, 5, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 1, 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
SkyReels-V2: Infinite-length Film Generative Model
Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, et al. Skyreels-v2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025. 4
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Haoxing Chen, Yan Hong, Zizheng Huang, Zhuoer Xu, Zhangxuan Gu, Yaohui Li, Jun Lan, Huijia Zhu, Jianfu Zhang, Weiqiang Wang, et al. Demamba: Ai-generated video detec- tion on million-scale genvideo benchmark.arXiv preprint arXiv:2405.19707, 2024. 1, 2, 3, 4, 5, 6, 7, 16
-
[6]
Panda-70m: Captioning 70m videos with multiple cross- modality teachers
Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Eka- terina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al. Panda-70m: Captioning 70m videos with multiple cross- modality teachers. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 13320–13331, 2024. 4, 19
work page 2024
-
[7]
Weiliang Chen, Wenzhao Zheng, Yu Zheng, Lei Chen, Jie Zhou, Jiwen Lu, and Yueqi Duan. Genworld: Towards detect- ing ai-generated real-world simulation videos.arXiv preprint arXiv:2506.10975, 2025. 1, 2, 3, 5, 7
-
[8]
Yize Chen, Zhiyuan Yan, Guangliang Cheng, Kangran Zhao, Siwei Lyu, and Baoyuan Wu. X2-dfd: A framework for explainable and extendable deepfake detection.arXiv preprint arXiv:2410.06126, 2024. 1, 2
-
[9]
Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025
Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Han- rong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, et al. Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025. 3
-
[10]
Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 2, 4
work page 2024
-
[11]
Gemini 2.5: Our most intelligent ai model
Google DeepMind. Gemini 2.5: Our most intelligent ai model. https://blog.google/technology/ google - deepmind / gemini - model - thinking - updates-march-2025/ , 2025. Accessed: 2025-11-14. 1, 4, 5, 7, 17
work page 2025
-
[12]
Veo 3: Advanced generative video model
Google DeepMind. Veo 3: Advanced generative video model. https://aistudio.google.com/models/veo- 3, 2025. Accessed: 2025-11-14. 1, 3
work page 2025
-
[13]
Video-R1: Reinforcing Video Reasoning in MLLMs
Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Shenghao Fu, Qize Yang, Yuan-Ming Li, Xihan Wei, Xiaohua Xie, and Wei-Shi Zheng. Love-r1: Advancing long video understanding with an adaptive zoom-in mechanism via multi- step reasoning.arXiv preprint arXiv:2509.24786, 2025. 3
-
[15]
Yifeng Gao, Yifan Ding, Hongyu Su, Juncheng Li, Yunhan Zhao, Lin Luo, Zixing Chen, Li Wang, Xin Wang, Yixu Wang, et al. David-xr1: Detecting ai-generated videos with explain- able reasoning.arXiv preprint arXiv:2506.14827, 2025. 1, 2, 4
-
[16]
Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 6, 9
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
Shasha Guo, Liang Pang, Xi Wang, Yanling Wang, Huawei Shen, and Jing Zhang. Geovlmath: Enhancing geometry reasoning in vision-language models via cross-modal reward for auxiliary line creation.arXiv preprint arXiv:2510.11020,
-
[19]
Rethinking vision-language model in face forensics: Multi-modal interpretable forged face detector
Xiao Guo, Xiufeng Song, Yue Zhang, Xiaohong Liu, and Xiaoming Liu. Rethinking vision-language model in face forensics: Multi-modal interpretable forged face detector. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 105–116, 2025. 1, 2
work page 2025
-
[20]
LTX-Video: Realtime Video Latent Diffusion
Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024. 4
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Zefeng He, Xiaoye Qu, Yafu Li, Siyuan Huang, Daizong Liu, and Yu Cheng. Framethinker: Learning to think with long videos via multi-turn frame spotlighting.arXiv preprint arXiv:2509.24304, 2025. 3
-
[22]
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video gen- eration via transformers.arXiv preprint arXiv:2205.15868,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for mul- timodal language models.Advances in Neural Information Processing Systems, 37:139348–139379, 2024. 3
work page 2024
-
[24]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. 1, 2, 4
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
Ai-generated 10 video detection via perceptual straightening.arXiv preprint arXiv:2507.00583, 2025
Christian Internò, Robert Geirhos, Markus Olhofer, Sunny Liu, Barbara Hammer, and David Klindt. Ai-generated 10 video detection via perceptual straightening.arXiv preprint arXiv:2507.00583, 2025. 1, 2, 5
-
[26]
VACE: All-in-One Video Creation and Editing
Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing. arXiv preprint arXiv:2503.07598, 2025. 4
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
Hengrui Kang, Siwei Wen, Zichen Wen, Junyan Ye, Weijia Li, Peilin Feng, Baichuan Zhou, Bin Wang, Dahua Lin, Linfeng Zhang, et al. Legion: Learning to ground and explain for synthetic image detection.arXiv preprint arXiv:2503.15264,
-
[28]
The Kinetics Human Action Video Dataset
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics hu- man action video dataset.arXiv preprint arXiv:1705.06950,
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Text2video-zero: Text- to-image diffusion models are zero-shot video generators
Levon Khachatryan, Andranik Movsisyan, Vahram Tade- vosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text- to-image diffusion models are zero-shot video generators. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15954–15964, 2023. 4
work page 2023
-
[30]
Klingai: Creative video generation platform
KlingAI. Klingai: Creative video generation platform. https://klingai.com/ , 2025. Accessed: 2025-11-
work page 2025
-
[31]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 1, 3, 4
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search
Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, and Heng- shuang Zhao. Mini-o3: Scaling up reasoning patterns and interaction turns for visual search.arXiv preprint arXiv:2509.07969, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip- 2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR,
-
[34]
Mvbench: A comprehensive multi-modal video understand- ing benchmark
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195– 22206, 2024. 3
work page 2024
-
[35]
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via re- inforcement fine-tuning.arXiv preprint arXiv:2504.06958,
work page internal anchor Pith review Pith/arXiv arXiv
-
[36]
Fakebench: Probing explainable fake image detection via large multimodal models
Yixuan Li, Xuelin Liu, Xiaoyang Wang, Bu Sung Lee, Shiqi Wang, Anderson Rocha, and Weisi Lin. Fakebench: Probing explainable fake image detection via large multimodal models. IEEE Transactions on Information Forensics and Security,
-
[37]
Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1, 2
work page 2023
-
[38]
Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement
Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, and Jiaya Jia. Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement.arXiv preprint arXiv:2503.06520, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Long Ma, Jiajia Zhang, Hongping Deng, Ningyu Zhang, Qinglang Guo, Haiyang Yu, Yong Liao, and Pengyuan Zhou. Decof: Generated video detection via frame consistency: The first benchmark dataset.arXiv e-prints, pages arXiv–2402,
-
[40]
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models.arXiv preprint arXiv:2306.05424, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[41]
Jiahao Meng, Xiangtai Li, Haochen Wang, Yue Tan, Tao Zhang, Lingdong Kong, Yunhai Tong, Anran Wang, Zhiyang Teng, Yujing Wang, et al. Open-o3 video: Grounded video reasoning with explicit spatio-temporal evidence.arXiv preprint arXiv:2510.20579, 2025. 6
-
[42]
Hailuo 02: Global ai video generation model by minimax
MiniMax. Hailuo 02: Global ai video generation model by minimax. https://hailuo-02.com/, 2025. Accessed: 2025-11-14. 4
work page 2025
-
[43]
Zhenliang Ni, Qiangyu Yan, Mouxiao Huang, Tianning Yuan, Yehui Tang, Hailin Hu, Xinghao Chen, and Yunhe Wang. Genvidbench: A challenging benchmark for detecting ai- generated video.arXiv preprint arXiv:2501.11340, 2025. 1, 2, 3, 4
-
[44]
Junbo Niu, Yifei Li, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao He, Xiaoyi Dong, Haodong Duan, Shuangrui Ding, Rui Qian, et al. Ovo-bench: How far is your video-llms from real-world online video understanding? InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18902–18913, 2025. 3
work page 2025
-
[45]
Gpt-4o mini: Advancing cost-efficient intelligence
OpenAI. Gpt-4o mini: Advancing cost-efficient intelligence. https : / / openai . com / index / gpt - 4o - mini - advancing - cost - efficient - intelligence/,
-
[46]
Accessed: 2025-11-14. 4
work page 2025
-
[47]
Sora 2 is here: Next-generation video-and-audio gen- eration model
OpenAI. Sora 2 is here: Next-generation video-and-audio gen- eration model. https://openai.com/index/sora- 2/, 2025. Accessed: 2025-11-14. 1, 3, 4
work page 2025
-
[48]
Introducing gpt-4.1 in the api
OpenAI. Introducing gpt-4.1 in the api. https://openai. com/index/gpt-4-1/, 2025. Accessed: 2025-11-14. 1, 4, 5, 6, 7
work page 2025
-
[49]
Kyoungjun Park, Yifan Yang, Juheon Yi, Shicheng Zheng, Yifei Shen, Dongqi Han, Caihua Shan, Muhammad Muaz, and Lili Qiu. Vidguard-r1: Ai-generated video detection and explanation via reasoning mllms and rl.arXiv preprint arXiv:2510.02282, 2025. 1, 2, 3, 4
-
[50]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,
- [51]
-
[52]
Pixverse – ai video generator from text & photos
PixVerse AI. Pixverse – ai video generator from text & photos. https://app.pixverse.ai/, 2025. Accessed: 2025- 11-14. 4
work page 2025
-
[53]
Qwen3-vl: Sharper vision, deeper thought, broader action
Qwen Team. Qwen3-vl: Sharper vision, deeper thought, broader action. https : / / qwen . ai / blog ? id = 11 99f0335c4ad9ff6153e517418d48535ab6d8afef& from = research . latest - advancements - list,
-
[54]
Accessed: 2025-10-23. 1, 3
work page 2025
-
[55]
Runway AI, Inc. Introducing runway gen-4. https: / / runwayml . com / research / introducing - runway-gen-4, 2025. Accessed: 2025-11-14. 4
work page 2025
-
[56]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[57]
Roger N Shepard. Perceptual-cognitive universals as reflec- tions of the world.Psychonomic Bulletin & Review, 1(1): 2–28, 1994. 5
work page 1994
-
[58]
Mohamed R Shoaib, Zefan Wang, Milad Taleby Ahvanooey, and Jun Zhao. Deepfakes, misinformation, and disinformation in the era of frontier ai, generative ai, and large ai models. In 2023 international conference on computer and applications (ICCA), pages 1–7. IEEE, 2023. 1
work page 2023
-
[59]
Initial knowledge: Six suggestions.Cogni- tion, 50(1-3):431–445, 1994
Elizabeth Spelke. Initial knowledge: Six suggestions.Cogni- tion, 50(1-3):431–445, 1994. 5
work page 1994
-
[60]
Elizabeth S Spelke and Katherine D Kinzler. Core knowledge. Developmental science, 10(1):89–96, 2007. 5
work page 2007
-
[61]
OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning
Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Jun- tao Li, Xiaoye Qu, et al. Openthinkimg: Learning to think with images via visual tool reinforcement learning.arXiv preprint arXiv:2505.08617, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[62]
Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers
Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, et al. Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers.arXiv preprint arXiv:2506.23918, 2025. 3, 9
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[63]
Zhihao Sun, Haoran Jiang, Haoran Chen, Yixin Cao, Xipeng Qiu, Zuxuan Wu, and Yu-Gang Jiang. Forgerysleuth: Em- powering multimodal large language models for image ma- nipulation detection.arXiv preprint arXiv:2411.19466, 2024. 2
-
[64]
Hao Tan, Jun Lan, Zichang Tan, Ajian Liu, Chuanbiao Song, Senyuan Shi, Huijia Zhu, Weiqiang Wang, Jun Wan, and Zhen Lei. Veritas: Generalizable deepfake detection via pattern- aware reasoning.arXiv preprint arXiv:2508.21048, 2025. 1, 2
-
[65]
Yunlong Tang, Jing Bi, Pinxin Liu, Zhenyu Pan, Zhangyun Tan, Qianxiang Shen, Jiani Liu, Hang Hua, Junjia Guo, Yun- zhong Xiao, et al. Video-lmm post-training: A deep dive into video reasoning with large multimodal models.arXiv preprint arXiv:2510.05034, 2025. 3, 9
-
[66]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[67]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 1, 3, 4
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[68]
ModelScope Text-to-Video Technical Report
Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report.arXiv preprint arXiv:2308.06571, 2023. 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[69]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[70]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025. 3, 5, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[71]
Lvbench: An extreme long video understanding benchmark
Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Ming Ding, Xiaotao Gu, Shiyu Huang, Bin Xu, et al. Lvbench: An extreme long video understanding benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22958–22967, 2025. 3
work page 2025
-
[72]
Chain-of- thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of- thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824– 24837, 2022. 1, 5
work page 2022
-
[73]
Haiquan Wen, Yiwei He, Zhenglin Huang, Tianxiao Li, Zihan Yu, Xingru Huang, Lu Qi, Baoyuan Wu, Xiangtai Li, and Guangliang Cheng. Busterx: Mllm-powered ai-generated video forgery detection and explanation.arXiv preprint arXiv:2505.12620, 2025. 1, 2, 3, 4, 17
-
[74]
Busterx++: Towards unified cross-modal ai-generated content detection and explanation with mllm
Haiquan Wen, Tianxiao Li, Zhenglin Huang, Yiwei He, and Guangliang Cheng. Busterx++: Towards unified cross-modal ai-generated content detection and explanation with mllm. arXiv preprint arXiv:2507.14632, 2025. 1, 2, 4, 6, 7, 17, 18
-
[75]
Siwei Wen, Junyan Ye, Peilin Feng, Hengrui Kang, Zichen Wen, Yize Chen, Jiang Wu, Wenjun Wu, Conghui He, and Weijia Li. Spot the fake: Large multimodal model-based synthetic image detection with artifact explanation.arXiv preprint arXiv:2503.14905, 2025. 2
-
[76]
Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Informa- tion Processing Systems, 37:28828–28857, 2024. 3
work page 2024
-
[77]
Combat- ing misinformation in the era of generative ai models
Danni Xu, Shaojing Fan, and Mohan Kankanhalli. Combat- ing misinformation in the era of generative ai models. In Proceedings of the 31st ACM International Conference on Multimedia, pages 9291–9298, 2023. 1
work page 2023
-
[78]
Zhipei Xu, Xuanyu Zhang, Runyi Li, Zecheng Tang, Qing Huang, and Jian Zhang. Fakeshield: Explainable image forgery detection and localization via multi-modal large lan- guage models.arXiv preprint arXiv:2410.02761, 2024. 2
-
[79]
Advanc- ing high-resolution video-language representation with large- scale video transcriptions
Hongwei Xue, Tiankai Hang, Yanhong Zeng, Yuchong Sun, Bei Liu, Huan Yang, Jianlong Fu, and Baining Guo. Advanc- ing high-resolution video-language representation with large- scale video transcriptions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5036–5045, 2022. 4, 19
work page 2022
-
[80]
Ziang Yan, Xinhao Li, Yinan He, Zhengrong Yue, Xiangyu Zeng, Yali Wang, Yu Qiao, Limin Wang, and Yi Wang. 12 Videochat-r1. 5: Visual test-time scaling to reinforce mul- timodal reasoning by iterative perception.arXiv preprint arXiv:2509.21100, 2025. 3
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.