Video as Natural Augmentation: Towards Unified AI-Generated Image and Video Detection
Pith reviewed 2026-05-22 07:56 UTC · model grok-4.3
The pith
Treating video frames as natural augmentations plus a cross-modal contrastive loss lets one detector handle both AI-generated images and videos with better robustness.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VINA jointly trains on image and video data by using video frames as physically grounded natural augmentations and introduces a cross-modal supervised contrastive objective to align image and video representations under a shared real/fake decision boundary, delivering bidirectional gains, improved robustness and transferability, and state-of-the-art performance across 14 image, video, and in-the-wild benchmarks without complex augmentation or dataset-specific tuning.
What carries the argument
VINA framework that treats video frames as physically grounded natural augmentations combined with a cross-modal supervised contrastive objective to align representations.
If this is right
- A single model achieves improved detection accuracy on both image and video content.
- Robustness increases against common video processing variations such as compression and resizing.
- Transfer performance rises across different generation sources and pipelines without extra tuning.
- State-of-the-art results hold on nearly all tested image, video, and mixed real-world benchmarks.
Where Pith is reading between the lines
- The work implies that many current detectors are overly sensitive to modality-specific pipelines rather than learning core generative traces.
- The same natural-augmentation idea could be tested on other shifting domains such as social-media compressed images or audio clips.
- A unified detector reduces the engineering overhead of maintaining separate systems for mixed image-and-video environments.
Load-bearing premise
The cross-modal gap stems mainly from synthesis-agnostic processing shifts and model-specific video fingerprints, and that video frames as augmentations plus contrastive alignment will close the gap without creating new failure modes.
What would settle it
Apply the trained VINA model to frames from a previously unseen video generator that introduces new compression artifacts or fingerprints and measure whether the accuracy gap between image and video inputs reappears or grows larger than baseline methods.
Figures
read the original abstract
AI-generated content (AIGC) is rapidly improving, creating an urgent need for detectors that generalize across data sources, deployment pipelines, and visual modalities. A strongly generalizable detector should remain robust under distributional variations. However, we identify a consistent failure mode: SOTA AI-generated image detectors often collapse when applied to frames extracted from videos. Through systematic analysis, we show that this cross-modal gap arises from both entangled synthesis-agnostic video processing shifts, including color conversion, codec compression, resizing, and blur, and model-specific fingerprints introduced by modern video generators. Motivated by these findings, we propose VINA (Video as Natural Augmentation), a unified AIGC detection framework that jointly trains on image and video data. VINA uses video frames as physically grounded natural augmentations and further introduces a cross-modal supervised contrastive objective to align image and video representations under a shared real/fake decision boundary. Extensive experiments on 14 image, video, and in-the-wild benchmarks show that VINA delivers bidirectional gains, improves robustness and transferability, and achieves state-of-the-art performance across nearly all evaluated settings without complex augmentation or dataset-specific tuning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript identifies a consistent cross-modal failure mode where SOTA AI-generated image detectors collapse on frames extracted from AI-generated videos. It attributes this gap to synthesis-agnostic processing shifts (color conversion, codec compression, resizing, blur) together with model-specific video-generator fingerprints. To close the gap, the authors propose VINA, a joint image-video training framework that treats video frames as physically grounded natural augmentations and adds a cross-modal supervised contrastive objective to enforce a shared real/fake decision boundary. Experiments on 14 image, video, and in-the-wild benchmarks are reported to yield bidirectional gains, improved robustness/transferability, and SOTA performance without complex augmentations or dataset-specific tuning.
Significance. If the reported gains prove reproducible, the work would be a useful practical contribution to AIGC detection by offering a lightweight, modality-bridging recipe that exploits naturally occurring video variations rather than hand-crafted augmentations. The bidirectional improvement and emphasis on in-the-wild settings address a real deployment need.
major comments (3)
- [Experiments section] Experiments section (results on 14 benchmarks): the central claim of consistent SOTA performance and bidirectional gains rests on the reported numbers, yet no error bars, standard deviations across runs, exact train/val/test splits, or confirmation that post-hoc model selection was avoided are supplied. This information is load-bearing for assessing whether the gains are reliable and generalizable.
- [§3.3] §3.3 (cross-modal supervised contrastive objective): the design assumes that explicit alignment under a shared boundary will close the identified gap without suppressing modality-specific cues (e.g., generator fingerprints that remain useful on certain video synthesizers or on in-the-wild data whose shift distribution differs from the training videos). No ablation or analysis tests this assumption or quantifies whether over-alignment introduces new failure modes.
- [§2] §2 (gap analysis): the systematic attribution of the cross-modal collapse to the listed processing shifts plus fingerprints is plausible, but the paper provides no controlled decomposition (e.g., isolating each shift's contribution via synthetic pipelines) that would justify why video frames as natural augmentations plus contrastive alignment is the targeted remedy rather than other interventions.
minor comments (2)
- Figure captions describing the processing shifts or the contrastive loss could be expanded with explicit annotations or step-by-step illustrations to improve clarity for readers unfamiliar with the video pipeline.
- Notation for the supervised contrastive loss (positive/negative pair construction across modalities) would benefit from a small diagram or explicit pseudocode to avoid ambiguity.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments, which help strengthen the paper. We address each major comment below and indicate the revisions we will make.
read point-by-point responses
-
Referee: [Experiments section] Experiments section (results on 14 benchmarks): the central claim of consistent SOTA performance and bidirectional gains rests on the reported numbers, yet no error bars, standard deviations across runs, exact train/val/test splits, or confirmation that post-hoc model selection was avoided are supplied. This information is load-bearing for assessing whether the gains are reliable and generalizable.
Authors: We agree that statistical reliability measures are essential. In the revised manuscript we will add error bars and standard deviations computed over multiple independent runs (different random seeds) for all reported results on the 14 benchmarks. We will also document the exact train/val/test splits and confirm that hyperparameter and model selection were performed exclusively on validation data with test sets strictly held out. revision: yes
-
Referee: [§3.3] §3.3 (cross-modal supervised contrastive objective): the design assumes that explicit alignment under a shared boundary will close the identified gap without suppressing modality-specific cues (e.g., generator fingerprints that remain useful on certain video synthesizers or on in-the-wild data whose shift distribution differs from the training videos). No ablation or analysis tests this assumption or quantifies whether over-alignment introduces new failure modes.
Authors: We acknowledge the need to verify that the contrastive objective does not erase useful modality-specific signals. We will add an ablation that removes the cross-modal contrastive loss and directly compares performance on video-generator-specific and in-the-wild subsets. We will also include representation analysis (e.g., t-SNE or fingerprint retention metrics) to quantify whether generator-specific cues are preserved and to check for any new failure modes introduced by over-alignment. revision: yes
-
Referee: [§2] §2 (gap analysis): the systematic attribution of the cross-modal collapse to the listed processing shifts plus fingerprints is plausible, but the paper provides no controlled decomposition (e.g., isolating each shift's contribution via synthetic pipelines) that would justify why video frames as natural augmentations plus contrastive alignment is the targeted remedy rather than other interventions.
Authors: We agree a more granular decomposition would be informative. While constructing full synthetic pipelines for every individual shift is beyond the scope of the current work, we will expand Section 2 with quantitative measurements of detector degradation under each observed processing shift (color conversion, compression, resizing, blur) drawn from our existing data. This will provide stronger empirical grounding for why video frames serve as effective natural augmentations and why the contrastive alignment is a suitable remedy. revision: partial
Circularity Check
No significant circularity; derivation remains self-contained
full rationale
The paper identifies a cross-modal performance gap through systematic analysis of processing shifts and generator fingerprints, then proposes VINA as a training framework that treats video frames as natural augmentations and adds a supervised contrastive loss to align representations. No load-bearing step reduces the claimed bidirectional gains or SOTA results to a fitted parameter defined inside the paper, a self-citation chain, or a re-expression of the evaluation metric. The contrastive objective is introduced as an independent alignment signal rather than a tautological restatement of the real/fake boundary, and the method is presented without invoking uniqueness theorems or ansatzes from prior author work that would collapse the argument. The overall chain is therefore independent of its own outputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Video frames extracted from AI-generated videos can be treated as physically grounded natural augmentations of the corresponding image content.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
VINA uses video frames as physically grounded natural augmentations and further introduces a cross-modal supervised contrastive objective to align image and video representations under a shared real/fake decision boundary.
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanembed_strictMono_of_one_lt unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We model video frames as image signals subjected to a complex, non-differentiable degradation function T(·).
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis, 2024. arXiv:2403.03206
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
Video generation models as world simulators, 2024
Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators, 2024. URLhttps://openai.com/index/sora/
work page 2024
-
[3]
Security and privacy on generative data in aigc: A survey.ACM Computing Surveys, 57(4):1–34, 2024
Tao Wang, Yushu Zhang, Shuren Qi, Ruoyu Zhao, Zhihua Xia, and Jian Weng. Security and privacy on generative data in aigc: A survey.ACM Computing Surveys, 57(4):1–34, 2024
work page 2024
-
[4]
Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. Rethink- ing the up-sampling operations in cnn-based generative network for generalizable deepfake detection. InCVPR, pages 28130–28139, 2024
work page 2024
-
[5]
Improving synthetic image detection towards generalization: An image transformation perspective
Ouxiang Li, Jiayin Cai, Yanbin Hao, Xiaolong Jiang, Yao Hu, and Fuli Feng. Improving synthetic image detection towards generalization: An image transformation perspective. In KDD, 2025
work page 2025
-
[6]
Co-spy: Combining semantic and pixel features to detect synthetic images by ai
Siyuan Cheng, Lingjuan Lyu, Zhenting Wang, Xiangyu Zhang, and Vikash Sehwag. Co-spy: Combining semantic and pixel features to detect synthetic images by ai. InCVPR, 2025
work page 2025
-
[7]
Towards universal fake image detectors that generalize across generative models
Utkarsh Ojha, Yuheng Li, and Yong Jae Lee. Towards universal fake image detectors that generalize across generative models. InCVPR, pages 24480–24489, 2023
work page 2023
-
[8]
Orthogonal subspace decomposition for generalizable ai-generated image detection
Zhiyuan Yan, Jiangming Wang, Peng Jin, Ke-Yue Zhang, Chengchun Liu, Shen Chen, Taiping Yao, Shouhong Ding, Baoyuan Wu, and Li Yuan. Orthogonal subspace decomposition for generalizable ai-generated image detection. InICML, 2025
work page 2025
-
[9]
Dual data alignment makes ai-generated image detector easier generalizable
Ruoxin Chen, Junwei Xi, Zhiyuan Yan, Ke-Yue Zhang, Shuang Wu, Jingyi Xie, Xu Chen, Lei Xu, Isabel Guan, Taiping Yao, and Shouhong Ding. Dual data alignment makes ai-generated image detector easier generalizable. InNeurIPS, 2025
work page 2025
-
[10]
A bias-free training paradigm for more general ai-generated image detection
Fabrizio Guillaro, Giada Zingarini, Ben Usman, Avneesh Sud, Davide Cozzolino, and Luisa Verdoliva. A bias-free training paradigm for more general ai-generated image detection. In CVPR, 2025
work page 2025
-
[11]
Seeing what matters: Generalizable ai-generated video detection with forensic- oriented augmentation
Riccardo Corvi, Davide Cozzolino, Ekta Prashnani, Shalini De Mello, Koki Nagano, and Luisa Verdoliva. Seeing what matters: Generalizable ai-generated video detection with forensic- oriented augmentation. InNeurIPS, 2025
work page 2025
-
[12]
Zero-shot detection of ai-generated images
Davide Cozzolino, Giovanni Poggi, Matthias Nießner, and Luisa Verdoliva. Zero-shot detection of ai-generated images. InECCV, 2024
work page 2024
-
[13]
A sanity check for ai-generated image detection
Shilin Yan, Ouxiang Li, Jiayin Cai, Yanbin Hao, Xiaolong Jiang, Yao Hu, and Weidi Xie. A sanity check for ai-generated image detection. InICLR, 2025
work page 2025
-
[14]
Chunxiao Li, Xiaoxiao Wang, Meiling Li, Boming Miao, Peng Sun, Yunjian Zhang, Xiangyang Ji, and Yao Zhu. Bridging the gap between ideal and real-world evaluation: Benchmarking ai-generated image detection in challenging scenarios. InICCV, 2025. 10
work page 2025
-
[15]
Demamba: Ai-generated video detection on million-scale genvideo benchmark, 2024
Haoxing Chen, Yan Hong, Zizheng Huang, Zhuoer Xu, Zhangxuan Gu, Yaohui Li, Jun Lan, Huijia Zhu, Jianfu Zhang, Weiqiang Wang, and Huaxiong Li. Demamba: Ai-generated video detection on million-scale genvideo benchmark, 2024. arXiv:2405.19707
-
[16]
Physics-driven spatiotemporal modeling for ai-generated video detection
Shuhai Zhang, ZiHao Lian, Jiahao Yang, Daiyuan Li, Guoxuan Pang, Feng Liu, Bo Han, Shutao Li, and Mingkui Tan. Physics-driven spatiotemporal modeling for ai-generated video detection. InNeurIPS, 2025
work page 2025
-
[17]
Weiyun Wang, Min Shi, Qingyun Li, Wenhai Wang, Zhenhang Huang, Linjie Xing, Zhe Chen, Hao Li, Xizhou Zhu, Zhiguo Cao, Yushi Chen, Tong Lu, Jifeng Dai, and Yu Qiao. The all- seeing project: Towards panoptic visual recognition and understanding of the open world, 2023. arXiv:2308.01907
-
[18]
Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A. Efros. Cnn- generated images are surprisingly easy to spot... for now. InCVPR, pages 8695–8704, 2020
work page 2020
-
[19]
Genimage: A million-scale benchmark for detecting ai-generated image
Mingjian Zhu, Hanting Chen, Qiangyu Yan, Xudong Huang, Guanyu Lin, Wei Li, Zhijun Tu, Hailin Hu, Jie Hu, and Yunhe Wang. Genimage: A million-scale benchmark for detecting ai-generated image. InNeurIPS, 2023
work page 2023
-
[20]
Preserving forgery artifacts: AI-generated video detection at native scale
Zhengcen Li, Chenyang Jiang, Hang Zhao, Shiyang Zhou, Yunyang Mo, Feng Gao, Fan Yang, Qiben Shan, Shaocong Wu, and Jingyong Su. Preserving forgery artifacts: AI-generated video detection at native scale. InICLR, 2026. URL https://openreview.net/forum?id= XD43lfRCg6
work page 2026
-
[21]
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. InNeurIPS, 2014
work page 2014
-
[22]
Auto-Encoding Variational Bayes
Diederik P. Kingma and Max Welling. Auto-encoding variational bayes, 2022. arXiv:1312.6114
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[23]
Neural Discrete Representation Learning
Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning, 2018. arXiv:1711.00937
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[24]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In NeurIPS, 2020
work page 2020
-
[25]
High- resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InCVPR, pages 10684–10695, 2022
work page 2022
-
[26]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023
work page 2023
-
[27]
Visual autoregressive modeling: Scalable image generation via next-scale prediction
Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction. InNeurIPS, 2024
work page 2024
-
[28]
Emerging Properties in Unified Multimodal Pretraining
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining, 2025. arXiv:2505.14683
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
Make-A-Video: Text-to-Video Generation without Text-Video Data
Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Make-a-video: Text-to-video generation without text-video data, 2022. arXiv:2209.14792
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[30]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets, 2023. arXiv:2311.15127
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[31]
Wan: Open and Advanced Large-Scale Video Generative Models
WanTeam, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, Ti...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
https://klingai.kuaishou.com, 2024
Kuaishou. https://klingai.kuaishou.com, 2024. URLhttps://klingai.kuaishou.com
work page 2024
-
[33]
Seedance 1.0: Exploring the boundaries of video generation models, 2025
Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, Xunsong Li, Yifu Li, Shanchuan Lin, Zhijie Lin, Jiawei Liu, Shu Liu, Xiaonan Nie, Zhiwu Qing, Yuxi Ren, Li Sun, Zhi Tian, Rui Wang, Sen Wang, Guoqiang Wei, Guohong Wu, Jie Wu, Ruiqi Xia, Fei Xiao, Xuefeng Xiao, Jiangqiao Yan, Ceyuan Yang, J...
work page 2025
-
[34]
Dire for diffusion-generated image detection
Zhendong Wang, Jianmin Bao, Wengang Zhou, Weilun Wang, Hezhen Hu, Hong Chen, and Houqiang Li. Dire for diffusion-generated image detection. InICCV, pages 22445–22455, 2023
work page 2023
-
[35]
Baoying Chen, Jishen Zeng, Jianquan Yang, and Rui Yang. Drct: Diffusion reconstruction contrastive training towards universal detection of diffusion generated images. InICML, 2024
work page 2024
-
[36]
Yanran Zhang, Bingyao Yu, Yu Zheng, Wenzhao Zheng, Yueqi Duan, Lei Chen, Jie Zhou, and Jiwen Lu. $\bf{d^3}$qe: Learning discrete distribution discrepancy-aware quantization error for autoregressive-generated image detection. InICCV, 2025
work page 2025
-
[37]
Forgery-aware adaptive transformer for generalizable synthetic image detection
Huan Liu, Zichang Tan, Chuangchuang Tan, Yunchao Wei, Jingdong Wang, and Yao Zhao. Forgery-aware adaptive transformer for generalizable synthetic image detection. InCVPR, pages 10770–10780, 2024
work page 2024
-
[38]
Leveraging representations from intermediate encoder-blocks for synthetic image detection
Christos Koutlis and Symeon Papadopoulos. Leveraging representations from intermediate encoder-blocks for synthetic image detection. InECCV, 2024
work page 2024
-
[39]
C2p-clip: Injecting category common prompt in clip to enhance generalization in deepfake detection
Chuangchuang Tan, Renshuai Tao, Huan Liu, Guanghua Gu, Baoyuan Wu, Yao Zhao, and Yunchao Wei. C2p-clip: Injecting category common prompt in clip to enhance generalization in deepfake detection. InAAAI, volume 39, pages 7184–7192, 2025
work page 2025
-
[40]
Progressive growing of gans for improved quality, stability, and variation
Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. InICLR, 2018
work page 2018
-
[41]
Rohit Kundu, Hao Xiong, Vishal Mohanty, Athula Balachandran, and Amit K. Roy-Chowdhury. Towards a universal synthetic video detector: From face or background manipulations to fully ai-generated content. InCVPR, 2025
work page 2025
-
[42]
Genvidbench: A challenging benchmark for detecting ai-generated video, 2024
Zhen-Liang Ni, Qiangyu Yan, Tianning Yuan, Mouxiao Huang, Hailin Hu, Xinghao Chen, and Yunhe Wang. Genvidbench: A challenging benchmark for detecting ai-generated video, 2024
work page 2024
-
[43]
D3: Training-free ai-generated video detection using second-order features
Chende Zheng, Ruiqi suo, Chenhao Lin, Zhengyu Zhao, Le Yang, Shuai Liu, Minghui Yang, Cong Wang, and Chao Shen. D3: Training-free ai-generated video detection using second-order features. InICCV, 2025
work page 2025
-
[44]
Vidguard-r1: Ai-generated video detection and explanation via reasoning mllms and rl, 2025
Kyoungjun Park, Yifan Yang, Juheon Yi, Shicheng Zheng, Yifei Shen, Dongqi Han, Caihua Shan, Muhammad Muaz, and Lili Qiu. Vidguard-r1: Ai-generated video detection and explanation via reasoning mllms and rl, 2025. arXiv:2510.02282
-
[45]
Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning
Yifei Li, Wenzhao Zheng, Yanran Zhang, Runze Sun, Yu Zheng, Lei Chen, Jie Zhou, and Jiwen Lu. Skyra: Ai-generated video detection via grounded artifact reasoning, 2025. arXiv:2512.15693
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
Videoveritas: Ai-generated video detection via perception pretext reinforcement learning, 2026
Hao Tan, Jun Lan, Senyuan Shi, Zichang Tan, Zijian Yu, Huijia Zhu, Weiqiang Wang, Jun Wan, and Zhen Lei. Videoveritas: Ai-generated video detection via perception pretext reinforcement learning, 2026. arXiv:2602.08828. 12
-
[47]
Haiquan Wen, Tianxiao Li, Zhenglin Huang, Yiwei He, and Guangliang Cheng. Busterx++: Towards unified cross-modal ai-generated content detection and explanation with mllm, 2025. arXiv:2507.14632
-
[48]
Loki: A comprehensive synthetic data detection benchmark using large multimodal models, 2025
Junyan Ye, Baichuan Zhou, Zilong Huang, Junan Zhang, Tianyi Bai, Hengrui Kang, Jun He, Honglin Lin, Zihao Wang, Tong Wu, Zhizheng Wu, Yiping Chen, Dahua Lin, Conghui He, and Weijia Li. Loki: A comprehensive synthetic data detection benchmark using large multimodal models, 2025. arXiv:2410.09732
-
[49]
Ivy-fake: A unified explainable framework and benchmark for image and video aigc detection,
Wayne Zhang, Changjiang Jiang, Zhonghao Zhang, Chenyang Si, Fengchang Yu, and Wei Peng. Ivy-fake: A unified explainable framework and benchmark for image and video aigc detection,
-
[50]
Aligngemini: Generalizable ai-generated image detection through task-model alignment, 2026
Ruoxin Chen, Jiahui Gao, Kaiqing Lin, Keyue Zhang, Yandan Zhao, Isabel Guan, Taiping Yao, and Shouhong Ding. Aligngemini: Generalizable ai-generated image detection through task-model alignment, 2026
work page 2026
-
[51]
Aligned datasets improve detection of latent diffusion-generated images
Anirudh Sundara Rajan, Utkarsh Ojha, Jedidiah Schloesser, and Yong Jae Lee. Aligned datasets improve detection of latent diffusion-generated images. InICLR, 2025
work page 2025
-
[52]
Beyond artifacts: Real-centric envelope modeling for reliable ai-generated image detection, 2025
Ruiqi Liu, Yi Han, Zhengbo Zhang, Liwei Yao, Zhiyuan Yan, Jialiang Shen, ZhiJin Chen, Boyi Sun, Lubin Weng, Jing Dong, Yan Wang, and Shu Wu. Beyond artifacts: Real-centric envelope modeling for reliable ai-generated image detection, 2025. arXiv:2512.20937
-
[53]
The Kinetics Human Action Video Dataset
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijaya- narasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. The kinetics human action video dataset.arXiv preprint arXiv:1705.06950, pages 1–22, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[54]
Vbench: Comprehensive benchmark suite for video generative models, 2023
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Vbench: Comprehensive benchmark suite for video generative models, 2023. arXiv:2311.17982
-
[55]
Ricard Durall, Margret Keuper, and Janis Keuper. Watch your up-convolution: Cnn based generative deep neural networks are failing to reproduce spectral distributions. InCVPR, 2020
work page 2020
-
[56]
Secret lies in color: Enhancing ai-generated images detection with color distribution analysis
Zexi Jia, Chuanwei Huang, Yeshuang Zhu, Hongyan Fei, Xiaoyue Duan, Zhiqiang Yuan, Ying Deng, Jiapei Zhang, Jinchao Zhang, and Jie Zhou. Secret lies in color: Enhancing ai-generated images detection with color distribution analysis. InCVPR, pages 13445–13454, 2025
work page 2025
-
[57]
SAGA: Source Attribution of Generative AI Videos
Rohit Kundu, Vishal Mohanty, Hao Xiong, Shan Jia, Athula Balachandran, and Amit K. Roy- Chowdhury. Saga: Source attribution of generative ai videos, 2025. arXiv:2511.12834
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[58]
Learning human-perceived fakeness in ai-generated videos via multimodal llms, 2025
Xingyu Fu, Siyi Liu, Yinuo Xu, Pan Lu, Guangqiuse Hu, Tianbo Yang, Taran Anantasagar, Christopher Shen, Yikai Mao, Yuanzhe Liu, Keyush Shah, Chung Un Lee, Yejin Choi, James Zou, Dan Roth, and Chris Callison-Burch. Learning human-perceived fakeness in ai-generated videos via multimodal llms, 2025. arXiv:2509.22646
-
[59]
Supervised contrastive learning
Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. InNeurIPS, volume 33, pages 18661–18673, 2020
work page 2020
- [60]
-
[61]
Raising the bar of ai-generated image detection with clip
Davide Cozzolino, Giovanni Poggi, Riccardo Corvi, Matthias Nießner, and Luisa Verdoliva. Raising the bar of ai-generated image detection with clip. InCVPRW, 2024
work page 2024
-
[62]
Real-time deepfake detection in the real-world, 2024
Bar Cavia, Eliahu Horwitz, Tal Reiss, and Yedid Hoshen. Real-time deepfake detection in the real-world, 2024. arXiv:2406.09398
-
[63]
Is artificial intelligence generated image detection a solved problem? InNeurIPS, 2025
Ziqiang Li, Jiazhen Yan, Ziwen He, Kai Zeng, Weiwei Jiang, Lizhi Xiong, and Zhangjie Fu. Is artificial intelligence generated image detection a solved problem? InNeurIPS, 2025. 13
work page 2025
-
[64]
Mirror: Manifold ideal reference reconstructor for generalizable ai-generated image detection, 2026
Ruiqi Liu, Manni Cui, Ziheng Qin, Zhiyuan Yan, Ruoxin Chen, Yi Han, Zhiheng Li, Junkai Chen, ZhiJin Chen, Kaiqing Lin, Jialiang Shen, Lubin Weng, Jing Dong, Yan Wang, and Shu Wu. Mirror: Manifold ideal reference reconstructor for generalizable ai-generated image detection, 2026. arXiv:2602.02222
-
[65]
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. arXiv:2103.00020
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[66]
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patrick L...
work page 2023
-
[67]
Oriane Siméoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julie...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[68]
Effort[ 8] is a parameter-efficient method that leverages SVD to construct orthogonal subspaces. This design prevents catastrophic forgetting of pre-trained semantic knowledge while efficiently learning forgery patterns. Our implementation uses their released weights trained on SD1.4 with identical preprocessing
-
[69]
It implements this approach by training a ResNet-50 model for identification
NPR[ 4] distinguishes synthetic images by analyzing low-level neighboring pixel rela- tionships, which are characteristic of upsampling patterns in AI-generated content. It implements this approach by training a ResNet-50 model for identification. We use their official checkpoint for evaluation
-
[70]
RINE[ 38] improves AI-generated image detection by mapping intermediate CLIP features to a forgery-aware vector space. We utilize the author-released weights trained on the 16 Table 9:Overview of the evaluation benchmarks.Our evaluation covers 14 benchmarks (4 video benchmarks, 5 image benchmarks, and 5 in-the-wild benchmarks), spanning a diverse range of...
-
[71]
B-Free[ 10] employs a self-conditioned diffusion reconstruction paradigm to enforce seman- tic alignment between real and synthetic images, thereby isolating differences to generation artifacts. The method incorporates content augmentation via inpainting and fine-tunes a DINOv2+reg ViT using large crops to preserve forensic signals. We utilize their provi...
-
[72]
CO-SPY[ 6] enhances semantic and artifact features for robust synthetic image detection. Following the official code, we evaluate the two author-released implementations trained on SD1.4 and ProGAN
-
[73]
DDA[ 9] aligns synthetic and real images in both pixel and frequency domains to mitigate spurious correlations and improve detector generalization. For evaluation, we utilize the author-released weights and identical settings, with the sole exception that we forgo applying the same JPEG Q=96 compression on the test sets to ensure a fair comparison with ot...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.