pith. sign in

arxiv: 2605.21977 · v1 · pith:AYRHEXY6new · submitted 2026-05-21 · 💻 cs.CV · cs.AI

Video as Natural Augmentation: Towards Unified AI-Generated Image and Video Detection

Pith reviewed 2026-05-22 07:56 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords AI-generated content detectionunified image video detectionnatural augmentationcross-modal contrastive learningdeepfake detectiongeneralization in detectorsrobustness to processing shifts
0
0 comments X

The pith

Treating video frames as natural augmentations plus a cross-modal contrastive loss lets one detector handle both AI-generated images and videos with better robustness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies that top image detectors collapse on video frames due to everyday processing changes like compression, resizing, and color shifts, along with extra artifacts from video generators. It responds by training jointly on images and videos, using extracted frames as built-in physical augmentations and adding a contrastive objective that forces image and video features to share the same real-versus-fake boundary. This produces gains in both directions, stronger transfer to new sources, and top results on a wide range of benchmarks. A reader cares because the approach removes the need for separate image-only and video-only detectors while avoiding heavy hand-crafted augmentations or per-dataset retuning.

Core claim

VINA jointly trains on image and video data by using video frames as physically grounded natural augmentations and introduces a cross-modal supervised contrastive objective to align image and video representations under a shared real/fake decision boundary, delivering bidirectional gains, improved robustness and transferability, and state-of-the-art performance across 14 image, video, and in-the-wild benchmarks without complex augmentation or dataset-specific tuning.

What carries the argument

VINA framework that treats video frames as physically grounded natural augmentations combined with a cross-modal supervised contrastive objective to align representations.

If this is right

  • A single model achieves improved detection accuracy on both image and video content.
  • Robustness increases against common video processing variations such as compression and resizing.
  • Transfer performance rises across different generation sources and pipelines without extra tuning.
  • State-of-the-art results hold on nearly all tested image, video, and mixed real-world benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The work implies that many current detectors are overly sensitive to modality-specific pipelines rather than learning core generative traces.
  • The same natural-augmentation idea could be tested on other shifting domains such as social-media compressed images or audio clips.
  • A unified detector reduces the engineering overhead of maintaining separate systems for mixed image-and-video environments.

Load-bearing premise

The cross-modal gap stems mainly from synthesis-agnostic processing shifts and model-specific video fingerprints, and that video frames as augmentations plus contrastive alignment will close the gap without creating new failure modes.

What would settle it

Apply the trained VINA model to frames from a previously unseen video generator that introduces new compression artifacts or fingerprints and measure whether the accuracy gap between image and video inputs reappears or grows larger than baseline methods.

Figures

Figures reproduced from arXiv: 2605.21977 by Chenyang Jiang, Jingyong Su, Liangxu Su, Ming Tao, Shiyang Zhou, Tong Shao, Zhengcen Li.

Figure 1
Figure 1. Figure 1: Motivation and benchmark performance of unified AIGC detection. Left: Asymmetric cross-modal failures motivate VINA to learn from both image and video data. Right: Average accuracy across image, video, and in-the-wild AIGC benchmarks shows that image-based detectors degrade sharply on videos, while joint training improves cross-modal generalization. and media subjected to heavy compression or platform degr… view at source ↗
Figure 2
Figure 2. Figure 2: DCT AC coefficient distributions. H.264 and recompressed video frames exhibit sharper near-zero peaks than original JPEG im￾ages. Compression Artifacts. Quantization noise in videos differs fundamentally from that in still images. As shown in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: RAPSD analysis of video and image datasets. Real videos exhibit sig￾nificant high-frequency decay, reflecting compression and motion blur. 0 50 100 150 200 250 Pixel Value (0-255) 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 Normalized Frequency Image (GenImage Real JPEG) Video (Kinetics) Video Limited Range (Kinetics) 0 50 100 150 200 250 Pixel Value (0-255) 0.000 0.005 0.010 0.015 0.020 0.025 0.… view at source ↗
Figure 5
Figure 5. Figure 5: Robustness Analysis Across Various Perturbations on Established Image Benchmarks. Degradations include JPEG compression, HEIF compression, scaling, and cropping. Our method consistently outperforms other detectors, demonstrating superior robustness [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: 2D power spectra of reconstructed noise. Spectra are averaged over 1,000 samples. Videos exhibit a distinct axial cross pattern, in contrast to the broader spectral spread of images. Generated images show periodic grid artifacts, while generated videos display a different configuration of spectral anomalies. This highlights a fundamental divergence in the frequency-space fingerprints of video and image gen… view at source ↗
Figure 7
Figure 7. Figure 7: Average Fourier power spectra across different generative models and datasets. The visualizations compare the frequency-domain characteristics of real data (leftmost column) against various image and video generation frameworks. Note the unique structural artifacts (e.g., grid patterns) present in the generated samples across different model families. 1. WaveRep [11] trains DINO classifiers toward intrinsi… view at source ↗
Figure 8
Figure 8. Figure 8: Ablation Study on SupCon Loss Weight. We vary the supervised contrastive loss weight λ and report AVG ACC, computed as the average accuracy over all 14 benchmarks. CM-SupCon defines positives across modalities with the same real/fake label, whereas vanilla SupCon uses all same-label samples as positives. CM-SupCon achieves the best result at λ = 0.05. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗
read the original abstract

AI-generated content (AIGC) is rapidly improving, creating an urgent need for detectors that generalize across data sources, deployment pipelines, and visual modalities. A strongly generalizable detector should remain robust under distributional variations. However, we identify a consistent failure mode: SOTA AI-generated image detectors often collapse when applied to frames extracted from videos. Through systematic analysis, we show that this cross-modal gap arises from both entangled synthesis-agnostic video processing shifts, including color conversion, codec compression, resizing, and blur, and model-specific fingerprints introduced by modern video generators. Motivated by these findings, we propose VINA (Video as Natural Augmentation), a unified AIGC detection framework that jointly trains on image and video data. VINA uses video frames as physically grounded natural augmentations and further introduces a cross-modal supervised contrastive objective to align image and video representations under a shared real/fake decision boundary. Extensive experiments on 14 image, video, and in-the-wild benchmarks show that VINA delivers bidirectional gains, improves robustness and transferability, and achieves state-of-the-art performance across nearly all evaluated settings without complex augmentation or dataset-specific tuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript identifies a consistent cross-modal failure mode where SOTA AI-generated image detectors collapse on frames extracted from AI-generated videos. It attributes this gap to synthesis-agnostic processing shifts (color conversion, codec compression, resizing, blur) together with model-specific video-generator fingerprints. To close the gap, the authors propose VINA, a joint image-video training framework that treats video frames as physically grounded natural augmentations and adds a cross-modal supervised contrastive objective to enforce a shared real/fake decision boundary. Experiments on 14 image, video, and in-the-wild benchmarks are reported to yield bidirectional gains, improved robustness/transferability, and SOTA performance without complex augmentations or dataset-specific tuning.

Significance. If the reported gains prove reproducible, the work would be a useful practical contribution to AIGC detection by offering a lightweight, modality-bridging recipe that exploits naturally occurring video variations rather than hand-crafted augmentations. The bidirectional improvement and emphasis on in-the-wild settings address a real deployment need.

major comments (3)
  1. [Experiments section] Experiments section (results on 14 benchmarks): the central claim of consistent SOTA performance and bidirectional gains rests on the reported numbers, yet no error bars, standard deviations across runs, exact train/val/test splits, or confirmation that post-hoc model selection was avoided are supplied. This information is load-bearing for assessing whether the gains are reliable and generalizable.
  2. [§3.3] §3.3 (cross-modal supervised contrastive objective): the design assumes that explicit alignment under a shared boundary will close the identified gap without suppressing modality-specific cues (e.g., generator fingerprints that remain useful on certain video synthesizers or on in-the-wild data whose shift distribution differs from the training videos). No ablation or analysis tests this assumption or quantifies whether over-alignment introduces new failure modes.
  3. [§2] §2 (gap analysis): the systematic attribution of the cross-modal collapse to the listed processing shifts plus fingerprints is plausible, but the paper provides no controlled decomposition (e.g., isolating each shift's contribution via synthetic pipelines) that would justify why video frames as natural augmentations plus contrastive alignment is the targeted remedy rather than other interventions.
minor comments (2)
  1. Figure captions describing the processing shifts or the contrastive loss could be expanded with explicit annotations or step-by-step illustrations to improve clarity for readers unfamiliar with the video pipeline.
  2. Notation for the supervised contrastive loss (positive/negative pair construction across modalities) would benefit from a small diagram or explicit pseudocode to avoid ambiguity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments, which help strengthen the paper. We address each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Experiments section] Experiments section (results on 14 benchmarks): the central claim of consistent SOTA performance and bidirectional gains rests on the reported numbers, yet no error bars, standard deviations across runs, exact train/val/test splits, or confirmation that post-hoc model selection was avoided are supplied. This information is load-bearing for assessing whether the gains are reliable and generalizable.

    Authors: We agree that statistical reliability measures are essential. In the revised manuscript we will add error bars and standard deviations computed over multiple independent runs (different random seeds) for all reported results on the 14 benchmarks. We will also document the exact train/val/test splits and confirm that hyperparameter and model selection were performed exclusively on validation data with test sets strictly held out. revision: yes

  2. Referee: [§3.3] §3.3 (cross-modal supervised contrastive objective): the design assumes that explicit alignment under a shared boundary will close the identified gap without suppressing modality-specific cues (e.g., generator fingerprints that remain useful on certain video synthesizers or on in-the-wild data whose shift distribution differs from the training videos). No ablation or analysis tests this assumption or quantifies whether over-alignment introduces new failure modes.

    Authors: We acknowledge the need to verify that the contrastive objective does not erase useful modality-specific signals. We will add an ablation that removes the cross-modal contrastive loss and directly compares performance on video-generator-specific and in-the-wild subsets. We will also include representation analysis (e.g., t-SNE or fingerprint retention metrics) to quantify whether generator-specific cues are preserved and to check for any new failure modes introduced by over-alignment. revision: yes

  3. Referee: [§2] §2 (gap analysis): the systematic attribution of the cross-modal collapse to the listed processing shifts plus fingerprints is plausible, but the paper provides no controlled decomposition (e.g., isolating each shift's contribution via synthetic pipelines) that would justify why video frames as natural augmentations plus contrastive alignment is the targeted remedy rather than other interventions.

    Authors: We agree a more granular decomposition would be informative. While constructing full synthetic pipelines for every individual shift is beyond the scope of the current work, we will expand Section 2 with quantitative measurements of detector degradation under each observed processing shift (color conversion, compression, resizing, blur) drawn from our existing data. This will provide stronger empirical grounding for why video frames serve as effective natural augmentations and why the contrastive alignment is a suitable remedy. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation remains self-contained

full rationale

The paper identifies a cross-modal performance gap through systematic analysis of processing shifts and generator fingerprints, then proposes VINA as a training framework that treats video frames as natural augmentations and adds a supervised contrastive loss to align representations. No load-bearing step reduces the claimed bidirectional gains or SOTA results to a fitted parameter defined inside the paper, a self-citation chain, or a re-expression of the evaluation metric. The contrastive objective is introduced as an independent alignment signal rather than a tautological restatement of the real/fake boundary, and the method is presented without invoking uniqueness theorems or ansatzes from prior author work that would collapse the argument. The overall chain is therefore independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework assumes that video processing artifacts and generator fingerprints are the dominant sources of the observed cross-modal gap and that a shared contrastive boundary will transfer without new domain-specific failure modes. No new physical entities are introduced.

axioms (1)
  • domain assumption Video frames extracted from AI-generated videos can be treated as physically grounded natural augmentations of the corresponding image content.
    Invoked in the motivation and method description to justify joint training.

pith-pipeline@v0.9.0 · 5753 in / 1242 out tokens · 33850 ms · 2026-05-22T07:56:48.118827+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · 12 internal anchors

  1. [1]

    Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis, 2024. arXiv:2403.03206

  2. [2]

    Video generation models as world simulators, 2024

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators, 2024. URLhttps://openai.com/index/sora/

  3. [3]

    Security and privacy on generative data in aigc: A survey.ACM Computing Surveys, 57(4):1–34, 2024

    Tao Wang, Yushu Zhang, Shuren Qi, Ruoyu Zhao, Zhihua Xia, and Jian Weng. Security and privacy on generative data in aigc: A survey.ACM Computing Surveys, 57(4):1–34, 2024

  4. [4]

    Rethink- ing the up-sampling operations in cnn-based generative network for generalizable deepfake detection

    Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. Rethink- ing the up-sampling operations in cnn-based generative network for generalizable deepfake detection. InCVPR, pages 28130–28139, 2024

  5. [5]

    Improving synthetic image detection towards generalization: An image transformation perspective

    Ouxiang Li, Jiayin Cai, Yanbin Hao, Xiaolong Jiang, Yao Hu, and Fuli Feng. Improving synthetic image detection towards generalization: An image transformation perspective. In KDD, 2025

  6. [6]

    Co-spy: Combining semantic and pixel features to detect synthetic images by ai

    Siyuan Cheng, Lingjuan Lyu, Zhenting Wang, Xiangyu Zhang, and Vikash Sehwag. Co-spy: Combining semantic and pixel features to detect synthetic images by ai. InCVPR, 2025

  7. [7]

    Towards universal fake image detectors that generalize across generative models

    Utkarsh Ojha, Yuheng Li, and Yong Jae Lee. Towards universal fake image detectors that generalize across generative models. InCVPR, pages 24480–24489, 2023

  8. [8]

    Orthogonal subspace decomposition for generalizable ai-generated image detection

    Zhiyuan Yan, Jiangming Wang, Peng Jin, Ke-Yue Zhang, Chengchun Liu, Shen Chen, Taiping Yao, Shouhong Ding, Baoyuan Wu, and Li Yuan. Orthogonal subspace decomposition for generalizable ai-generated image detection. InICML, 2025

  9. [9]

    Dual data alignment makes ai-generated image detector easier generalizable

    Ruoxin Chen, Junwei Xi, Zhiyuan Yan, Ke-Yue Zhang, Shuang Wu, Jingyi Xie, Xu Chen, Lei Xu, Isabel Guan, Taiping Yao, and Shouhong Ding. Dual data alignment makes ai-generated image detector easier generalizable. InNeurIPS, 2025

  10. [10]

    A bias-free training paradigm for more general ai-generated image detection

    Fabrizio Guillaro, Giada Zingarini, Ben Usman, Avneesh Sud, Davide Cozzolino, and Luisa Verdoliva. A bias-free training paradigm for more general ai-generated image detection. In CVPR, 2025

  11. [11]

    Seeing what matters: Generalizable ai-generated video detection with forensic- oriented augmentation

    Riccardo Corvi, Davide Cozzolino, Ekta Prashnani, Shalini De Mello, Koki Nagano, and Luisa Verdoliva. Seeing what matters: Generalizable ai-generated video detection with forensic- oriented augmentation. InNeurIPS, 2025

  12. [12]

    Zero-shot detection of ai-generated images

    Davide Cozzolino, Giovanni Poggi, Matthias Nießner, and Luisa Verdoliva. Zero-shot detection of ai-generated images. InECCV, 2024

  13. [13]

    A sanity check for ai-generated image detection

    Shilin Yan, Ouxiang Li, Jiayin Cai, Yanbin Hao, Xiaolong Jiang, Yao Hu, and Weidi Xie. A sanity check for ai-generated image detection. InICLR, 2025

  14. [14]

    Bridging the gap between ideal and real-world evaluation: Benchmarking ai-generated image detection in challenging scenarios

    Chunxiao Li, Xiaoxiao Wang, Meiling Li, Boming Miao, Peng Sun, Yunjian Zhang, Xiangyang Ji, and Yao Zhu. Bridging the gap between ideal and real-world evaluation: Benchmarking ai-generated image detection in challenging scenarios. InICCV, 2025. 10

  15. [15]

    Demamba: Ai-generated video detection on million-scale genvideo benchmark, 2024

    Haoxing Chen, Yan Hong, Zizheng Huang, Zhuoer Xu, Zhangxuan Gu, Yaohui Li, Jun Lan, Huijia Zhu, Jianfu Zhang, Weiqiang Wang, and Huaxiong Li. Demamba: Ai-generated video detection on million-scale genvideo benchmark, 2024. arXiv:2405.19707

  16. [16]

    Physics-driven spatiotemporal modeling for ai-generated video detection

    Shuhai Zhang, ZiHao Lian, Jiahao Yang, Daiyuan Li, Guoxuan Pang, Feng Liu, Bo Han, Shutao Li, and Mingkui Tan. Physics-driven spatiotemporal modeling for ai-generated video detection. InNeurIPS, 2025

  17. [17]

    The all- seeing project: Towards panoptic visual recognition and understanding of the open world, 2023

    Weiyun Wang, Min Shi, Qingyun Li, Wenhai Wang, Zhenhang Huang, Linjie Xing, Zhe Chen, Hao Li, Xizhou Zhu, Zhiguo Cao, Yushi Chen, Tong Lu, Jifeng Dai, and Yu Qiao. The all- seeing project: Towards panoptic visual recognition and understanding of the open world, 2023. arXiv:2308.01907

  18. [18]

    Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A. Efros. Cnn- generated images are surprisingly easy to spot... for now. InCVPR, pages 8695–8704, 2020

  19. [19]

    Genimage: A million-scale benchmark for detecting ai-generated image

    Mingjian Zhu, Hanting Chen, Qiangyu Yan, Xudong Huang, Guanyu Lin, Wei Li, Zhijun Tu, Hailin Hu, Jie Hu, and Yunhe Wang. Genimage: A million-scale benchmark for detecting ai-generated image. InNeurIPS, 2023

  20. [20]

    Preserving forgery artifacts: AI-generated video detection at native scale

    Zhengcen Li, Chenyang Jiang, Hang Zhao, Shiyang Zhou, Yunyang Mo, Feng Gao, Fan Yang, Qiben Shan, Shaocong Wu, and Jingyong Su. Preserving forgery artifacts: AI-generated video detection at native scale. InICLR, 2026. URL https://openreview.net/forum?id= XD43lfRCg6

  21. [21]

    Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio

    Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. InNeurIPS, 2014

  22. [22]

    Auto-Encoding Variational Bayes

    Diederik P. Kingma and Max Welling. Auto-encoding variational bayes, 2022. arXiv:1312.6114

  23. [23]

    Neural Discrete Representation Learning

    Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning, 2018. arXiv:1711.00937

  24. [24]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In NeurIPS, 2020

  25. [25]

    High- resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InCVPR, pages 10684–10695, 2022

  26. [26]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023

  27. [27]

    Visual autoregressive modeling: Scalable image generation via next-scale prediction

    Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction. InNeurIPS, 2024

  28. [28]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining, 2025. arXiv:2505.14683

  29. [29]

    Make-A-Video: Text-to-Video Generation without Text-Video Data

    Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Make-a-video: Text-to-video generation without text-video data, 2022. arXiv:2209.14792

  30. [30]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets, 2023. arXiv:2311.15127

  31. [31]

    Wan: Open and Advanced Large-Scale Video Generative Models

    WanTeam, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, Ti...

  32. [32]

    https://klingai.kuaishou.com, 2024

    Kuaishou. https://klingai.kuaishou.com, 2024. URLhttps://klingai.kuaishou.com

  33. [33]

    Seedance 1.0: Exploring the boundaries of video generation models, 2025

    Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, Xunsong Li, Yifu Li, Shanchuan Lin, Zhijie Lin, Jiawei Liu, Shu Liu, Xiaonan Nie, Zhiwu Qing, Yuxi Ren, Li Sun, Zhi Tian, Rui Wang, Sen Wang, Guoqiang Wei, Guohong Wu, Jie Wu, Ruiqi Xia, Fei Xiao, Xuefeng Xiao, Jiangqiao Yan, Ceyuan Yang, J...

  34. [34]

    Dire for diffusion-generated image detection

    Zhendong Wang, Jianmin Bao, Wengang Zhou, Weilun Wang, Hezhen Hu, Hong Chen, and Houqiang Li. Dire for diffusion-generated image detection. InICCV, pages 22445–22455, 2023

  35. [35]

    Drct: Diffusion reconstruction contrastive training towards universal detection of diffusion generated images

    Baoying Chen, Jishen Zeng, Jianquan Yang, and Rui Yang. Drct: Diffusion reconstruction contrastive training towards universal detection of diffusion generated images. InICML, 2024

  36. [36]

    $\bf{d^3}$qe: Learning discrete distribution discrepancy-aware quantization error for autoregressive-generated image detection

    Yanran Zhang, Bingyao Yu, Yu Zheng, Wenzhao Zheng, Yueqi Duan, Lei Chen, Jie Zhou, and Jiwen Lu. $\bf{d^3}$qe: Learning discrete distribution discrepancy-aware quantization error for autoregressive-generated image detection. InICCV, 2025

  37. [37]

    Forgery-aware adaptive transformer for generalizable synthetic image detection

    Huan Liu, Zichang Tan, Chuangchuang Tan, Yunchao Wei, Jingdong Wang, and Yao Zhao. Forgery-aware adaptive transformer for generalizable synthetic image detection. InCVPR, pages 10770–10780, 2024

  38. [38]

    Leveraging representations from intermediate encoder-blocks for synthetic image detection

    Christos Koutlis and Symeon Papadopoulos. Leveraging representations from intermediate encoder-blocks for synthetic image detection. InECCV, 2024

  39. [39]

    C2p-clip: Injecting category common prompt in clip to enhance generalization in deepfake detection

    Chuangchuang Tan, Renshuai Tao, Huan Liu, Guanghua Gu, Baoyuan Wu, Yao Zhao, and Yunchao Wei. C2p-clip: Injecting category common prompt in clip to enhance generalization in deepfake detection. InAAAI, volume 39, pages 7184–7192, 2025

  40. [40]

    Progressive growing of gans for improved quality, stability, and variation

    Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. InICLR, 2018

  41. [41]

    Roy-Chowdhury

    Rohit Kundu, Hao Xiong, Vishal Mohanty, Athula Balachandran, and Amit K. Roy-Chowdhury. Towards a universal synthetic video detector: From face or background manipulations to fully ai-generated content. InCVPR, 2025

  42. [42]

    Genvidbench: A challenging benchmark for detecting ai-generated video, 2024

    Zhen-Liang Ni, Qiangyu Yan, Tianning Yuan, Mouxiao Huang, Hailin Hu, Xinghao Chen, and Yunhe Wang. Genvidbench: A challenging benchmark for detecting ai-generated video, 2024

  43. [43]

    D3: Training-free ai-generated video detection using second-order features

    Chende Zheng, Ruiqi suo, Chenhao Lin, Zhengyu Zhao, Le Yang, Shuai Liu, Minghui Yang, Cong Wang, and Chao Shen. D3: Training-free ai-generated video detection using second-order features. InICCV, 2025

  44. [44]

    Vidguard-r1: Ai-generated video detection and explanation via reasoning mllms and rl, 2025

    Kyoungjun Park, Yifan Yang, Juheon Yi, Shicheng Zheng, Yifei Shen, Dongqi Han, Caihua Shan, Muhammad Muaz, and Lili Qiu. Vidguard-r1: Ai-generated video detection and explanation via reasoning mllms and rl, 2025. arXiv:2510.02282

  45. [45]

    Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning

    Yifei Li, Wenzhao Zheng, Yanran Zhang, Runze Sun, Yu Zheng, Lei Chen, Jie Zhou, and Jiwen Lu. Skyra: Ai-generated video detection via grounded artifact reasoning, 2025. arXiv:2512.15693

  46. [46]

    Videoveritas: Ai-generated video detection via perception pretext reinforcement learning, 2026

    Hao Tan, Jun Lan, Senyuan Shi, Zichang Tan, Zijian Yu, Huijia Zhu, Weiqiang Wang, Jun Wan, and Zhen Lei. Videoveritas: Ai-generated video detection via perception pretext reinforcement learning, 2026. arXiv:2602.08828. 12

  47. [47]

    Busterx++: Towards unified cross-modal ai-generated content detection and explanation with mllm, 2025

    Haiquan Wen, Tianxiao Li, Zhenglin Huang, Yiwei He, and Guangliang Cheng. Busterx++: Towards unified cross-modal ai-generated content detection and explanation with mllm, 2025. arXiv:2507.14632

  48. [48]

    Loki: A comprehensive synthetic data detection benchmark using large multimodal models, 2025

    Junyan Ye, Baichuan Zhou, Zilong Huang, Junan Zhang, Tianyi Bai, Hengrui Kang, Jun He, Honglin Lin, Zihao Wang, Tong Wu, Zhizheng Wu, Yiping Chen, Dahua Lin, Conghui He, and Weijia Li. Loki: A comprehensive synthetic data detection benchmark using large multimodal models, 2025. arXiv:2410.09732

  49. [49]

    Ivy-fake: A unified explainable framework and benchmark for image and video aigc detection,

    Wayne Zhang, Changjiang Jiang, Zhonghao Zhang, Chenyang Si, Fengchang Yu, and Wei Peng. Ivy-fake: A unified explainable framework and benchmark for image and video aigc detection,

  50. [50]

    Aligngemini: Generalizable ai-generated image detection through task-model alignment, 2026

    Ruoxin Chen, Jiahui Gao, Kaiqing Lin, Keyue Zhang, Yandan Zhao, Isabel Guan, Taiping Yao, and Shouhong Ding. Aligngemini: Generalizable ai-generated image detection through task-model alignment, 2026

  51. [51]

    Aligned datasets improve detection of latent diffusion-generated images

    Anirudh Sundara Rajan, Utkarsh Ojha, Jedidiah Schloesser, and Yong Jae Lee. Aligned datasets improve detection of latent diffusion-generated images. InICLR, 2025

  52. [52]

    Beyond artifacts: Real-centric envelope modeling for reliable ai-generated image detection, 2025

    Ruiqi Liu, Yi Han, Zhengbo Zhang, Liwei Yao, Zhiyuan Yan, Jialiang Shen, ZhiJin Chen, Boyi Sun, Lubin Weng, Jing Dong, Yan Wang, and Shu Wu. Beyond artifacts: Real-centric envelope modeling for reliable ai-generated image detection, 2025. arXiv:2512.20937

  53. [53]

    The Kinetics Human Action Video Dataset

    Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijaya- narasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. The kinetics human action video dataset.arXiv preprint arXiv:1705.06950, pages 1–22, 2017

  54. [54]

    Vbench: Comprehensive benchmark suite for video generative models, 2023

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Vbench: Comprehensive benchmark suite for video generative models, 2023. arXiv:2311.17982

  55. [55]

    Watch your up-convolution: Cnn based generative deep neural networks are failing to reproduce spectral distributions

    Ricard Durall, Margret Keuper, and Janis Keuper. Watch your up-convolution: Cnn based generative deep neural networks are failing to reproduce spectral distributions. InCVPR, 2020

  56. [56]

    Secret lies in color: Enhancing ai-generated images detection with color distribution analysis

    Zexi Jia, Chuanwei Huang, Yeshuang Zhu, Hongyan Fei, Xiaoyue Duan, Zhiqiang Yuan, Ying Deng, Jiapei Zhang, Jinchao Zhang, and Jie Zhou. Secret lies in color: Enhancing ai-generated images detection with color distribution analysis. InCVPR, pages 13445–13454, 2025

  57. [57]

    SAGA: Source Attribution of Generative AI Videos

    Rohit Kundu, Vishal Mohanty, Hao Xiong, Shan Jia, Athula Balachandran, and Amit K. Roy- Chowdhury. Saga: Source attribution of generative ai videos, 2025. arXiv:2511.12834

  58. [58]

    Learning human-perceived fakeness in ai-generated videos via multimodal llms, 2025

    Xingyu Fu, Siyi Liu, Yinuo Xu, Pan Lu, Guangqiuse Hu, Tianbo Yang, Taran Anantasagar, Christopher Shen, Yikai Mao, Yuanzhe Liu, Keyush Shah, Chung Un Lee, Yejin Choi, James Zou, Dan Roth, and Chris Callison-Burch. Learning human-perceived fakeness in ai-generated videos via multimodal llms, 2025. arXiv:2509.22646

  59. [59]

    Supervised contrastive learning

    Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. InNeurIPS, volume 33, pages 18661–18673, 2020

  60. [60]

    https://pika.art/, 2023

    Pika Labs. https://pika.art/, 2023. URLhttps://pika.art/

  61. [61]

    Raising the bar of ai-generated image detection with clip

    Davide Cozzolino, Giovanni Poggi, Riccardo Corvi, Matthias Nießner, and Luisa Verdoliva. Raising the bar of ai-generated image detection with clip. InCVPRW, 2024

  62. [62]

    Real-time deepfake detection in the real-world, 2024

    Bar Cavia, Eliahu Horwitz, Tal Reiss, and Yedid Hoshen. Real-time deepfake detection in the real-world, 2024. arXiv:2406.09398

  63. [63]

    Is artificial intelligence generated image detection a solved problem? InNeurIPS, 2025

    Ziqiang Li, Jiazhen Yan, Ziwen He, Kai Zeng, Weiwei Jiang, Lizhi Xiong, and Zhangjie Fu. Is artificial intelligence generated image detection a solved problem? InNeurIPS, 2025. 13

  64. [64]

    Mirror: Manifold ideal reference reconstructor for generalizable ai-generated image detection, 2026

    Ruiqi Liu, Manni Cui, Ziheng Qin, Zhiyuan Yan, Ruoxin Chen, Yi Han, Zhiheng Li, Junkai Chen, ZhiJin Chen, Kaiqing Lin, Jialiang Shen, Lubin Weng, Jing Dong, Yan Wang, and Shu Wu. Mirror: Manifold ideal reference reconstructor for generalizable ai-generated image detection, 2026. arXiv:2602.02222

  65. [65]

    Learning Transferable Visual Models From Natural Language Supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. arXiv:2103.00020

  66. [66]

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patrick L...

  67. [67]

    Oriane Siméoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julie...

  68. [68]

    This design prevents catastrophic forgetting of pre-trained semantic knowledge while efficiently learning forgery patterns

    Effort[ 8] is a parameter-efficient method that leverages SVD to construct orthogonal subspaces. This design prevents catastrophic forgetting of pre-trained semantic knowledge while efficiently learning forgery patterns. Our implementation uses their released weights trained on SD1.4 with identical preprocessing

  69. [69]

    It implements this approach by training a ResNet-50 model for identification

    NPR[ 4] distinguishes synthetic images by analyzing low-level neighboring pixel rela- tionships, which are characteristic of upsampling patterns in AI-generated content. It implements this approach by training a ResNet-50 model for identification. We use their official checkpoint for evaluation

  70. [70]

    RINE[ 38] improves AI-generated image detection by mapping intermediate CLIP features to a forgery-aware vector space. We utilize the author-released weights trained on the 16 Table 9:Overview of the evaluation benchmarks.Our evaluation covers 14 benchmarks (4 video benchmarks, 5 image benchmarks, and 5 in-the-wild benchmarks), spanning a diverse range of...

  71. [71]

    The method incorporates content augmentation via inpainting and fine-tunes a DINOv2+reg ViT using large crops to preserve forensic signals

    B-Free[ 10] employs a self-conditioned diffusion reconstruction paradigm to enforce seman- tic alignment between real and synthetic images, thereby isolating differences to generation artifacts. The method incorporates content augmentation via inpainting and fine-tunes a DINOv2+reg ViT using large crops to preserve forensic signals. We utilize their provi...

  72. [72]

    Following the official code, we evaluate the two author-released implementations trained on SD1.4 and ProGAN

    CO-SPY[ 6] enhances semantic and artifact features for robust synthetic image detection. Following the official code, we evaluate the two author-released implementations trained on SD1.4 and ProGAN

  73. [73]

    DDA[ 9] aligns synthetic and real images in both pixel and frequency domains to mitigate spurious correlations and improve detector generalization. For evaluation, we utilize the author-released weights and identical settings, with the sole exception that we forgo applying the same JPEG Q=96 compression on the test sets to ensure a fair comparison with ot...