Video as Natural Augmentation: Towards Unified AI-Generated Image and Video Detection

Chenyang Jiang; Jingyong Su; Liangxu Su; Ming Tao; Shiyang Zhou; Tong Shao; Zhengcen Li

arxiv: 2605.21977 · v1 · pith:AYRHEXY6new · submitted 2026-05-21 · 💻 cs.CV · cs.AI

Video as Natural Augmentation: Towards Unified AI-Generated Image and Video Detection

Zhengcen Li , Chenyang Jiang , Liangxu Su , Tong Shao , Shiyang Zhou , Ming Tao , Jingyong Su This is my paper

Pith reviewed 2026-05-22 07:56 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords AI-generated content detectionunified image video detectionnatural augmentationcross-modal contrastive learningdeepfake detectiongeneralization in detectorsrobustness to processing shifts

0 comments

The pith

Treating video frames as natural augmentations plus a cross-modal contrastive loss lets one detector handle both AI-generated images and videos with better robustness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies that top image detectors collapse on video frames due to everyday processing changes like compression, resizing, and color shifts, along with extra artifacts from video generators. It responds by training jointly on images and videos, using extracted frames as built-in physical augmentations and adding a contrastive objective that forces image and video features to share the same real-versus-fake boundary. This produces gains in both directions, stronger transfer to new sources, and top results on a wide range of benchmarks. A reader cares because the approach removes the need for separate image-only and video-only detectors while avoiding heavy hand-crafted augmentations or per-dataset retuning.

Core claim

VINA jointly trains on image and video data by using video frames as physically grounded natural augmentations and introduces a cross-modal supervised contrastive objective to align image and video representations under a shared real/fake decision boundary, delivering bidirectional gains, improved robustness and transferability, and state-of-the-art performance across 14 image, video, and in-the-wild benchmarks without complex augmentation or dataset-specific tuning.

What carries the argument

VINA framework that treats video frames as physically grounded natural augmentations combined with a cross-modal supervised contrastive objective to align representations.

If this is right

A single model achieves improved detection accuracy on both image and video content.
Robustness increases against common video processing variations such as compression and resizing.
Transfer performance rises across different generation sources and pipelines without extra tuning.
State-of-the-art results hold on nearly all tested image, video, and mixed real-world benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The work implies that many current detectors are overly sensitive to modality-specific pipelines rather than learning core generative traces.
The same natural-augmentation idea could be tested on other shifting domains such as social-media compressed images or audio clips.
A unified detector reduces the engineering overhead of maintaining separate systems for mixed image-and-video environments.

Load-bearing premise

The cross-modal gap stems mainly from synthesis-agnostic processing shifts and model-specific video fingerprints, and that video frames as augmentations plus contrastive alignment will close the gap without creating new failure modes.

What would settle it

Apply the trained VINA model to frames from a previously unseen video generator that introduces new compression artifacts or fingerprints and measure whether the accuracy gap between image and video inputs reappears or grows larger than baseline methods.

Figures

Figures reproduced from arXiv: 2605.21977 by Chenyang Jiang, Jingyong Su, Liangxu Su, Ming Tao, Shiyang Zhou, Tong Shao, Zhengcen Li.

**Figure 1.** Figure 1: Motivation and benchmark performance of unified AIGC detection. Left: Asymmetric cross-modal failures motivate VINA to learn from both image and video data. Right: Average accuracy across image, video, and in-the-wild AIGC benchmarks shows that image-based detectors degrade sharply on videos, while joint training improves cross-modal generalization. and media subjected to heavy compression or platform degr… view at source ↗

**Figure 2.** Figure 2: DCT AC coefficient distributions. H.264 and recompressed video frames exhibit sharper near-zero peaks than original JPEG images. Compression Artifacts. Quantization noise in videos differs fundamentally from that in still images. As shown in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: RAPSD analysis of video and image datasets. Real videos exhibit significant high-frequency decay, reflecting compression and motion blur. 0 50 100 150 200 250 Pixel Value (0-255) 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 Normalized Frequency Image (GenImage Real JPEG) Video (Kinetics) Video Limited Range (Kinetics) 0 50 100 150 200 250 Pixel Value (0-255) 0.000 0.005 0.010 0.015 0.020 0.025 0.… view at source ↗

**Figure 5.** Figure 5: Robustness Analysis Across Various Perturbations on Established Image Benchmarks. Degradations include JPEG compression, HEIF compression, scaling, and cropping. Our method consistently outperforms other detectors, demonstrating superior robustness [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: 2D power spectra of reconstructed noise. Spectra are averaged over 1,000 samples. Videos exhibit a distinct axial cross pattern, in contrast to the broader spectral spread of images. Generated images show periodic grid artifacts, while generated videos display a different configuration of spectral anomalies. This highlights a fundamental divergence in the frequency-space fingerprints of video and image gen… view at source ↗

**Figure 7.** Figure 7: Average Fourier power spectra across different generative models and datasets. The visualizations compare the frequency-domain characteristics of real data (leftmost column) against various image and video generation frameworks. Note the unique structural artifacts (e.g., grid patterns) present in the generated samples across different model families. 1. WaveRep [11] trains DINO classifiers toward intrinsi… view at source ↗

**Figure 8.** Figure 8: Ablation Study on SupCon Loss Weight. We vary the supervised contrastive loss weight λ and report AVG ACC, computed as the average accuracy over all 14 benchmarks. CM-SupCon defines positives across modalities with the same real/fake label, whereas vanilla SupCon uses all same-label samples as positives. CM-SupCon achieves the best result at λ = 0.05. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗

read the original abstract

AI-generated content (AIGC) is rapidly improving, creating an urgent need for detectors that generalize across data sources, deployment pipelines, and visual modalities. A strongly generalizable detector should remain robust under distributional variations. However, we identify a consistent failure mode: SOTA AI-generated image detectors often collapse when applied to frames extracted from videos. Through systematic analysis, we show that this cross-modal gap arises from both entangled synthesis-agnostic video processing shifts, including color conversion, codec compression, resizing, and blur, and model-specific fingerprints introduced by modern video generators. Motivated by these findings, we propose VINA (Video as Natural Augmentation), a unified AIGC detection framework that jointly trains on image and video data. VINA uses video frames as physically grounded natural augmentations and further introduces a cross-modal supervised contrastive objective to align image and video representations under a shared real/fake decision boundary. Extensive experiments on 14 image, video, and in-the-wild benchmarks show that VINA delivers bidirectional gains, improves robustness and transferability, and achieves state-of-the-art performance across nearly all evaluated settings without complex augmentation or dataset-specific tuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VINA gives a workable way to train one detector on both images and videos by treating frames as natural augmentations and adding cross-modal contrastive alignment, but the reported gains rest on thin experimental details.

read the letter

The main thing to know is that this paper spots a clear failure mode where strong image AIGC detectors drop when fed video frames, then fixes it by joint training that uses real video processing as augmentation and a supervised contrastive term to keep the real/fake boundary consistent across modalities. They report SOTA numbers on 14 benchmarks with gains in both directions and better robustness, all without heavy tuning or synthetic augmentations. That combination of analysis plus the specific training recipe looks like the actual new piece here rather than a full paradigm shift. The analysis of synthesis-agnostic shifts like codec and resize plus generator fingerprints is straightforward and useful for motivating the method. Treating video frames as grounded augmentations is a clean idea that avoids inventing new transforms, and the contrastive objective is a direct way to share the decision boundary. If the numbers check out with proper controls, the approach could cut down on maintaining separate models for platforms that see both images and video. The soft spots sit mostly in the experimental side. The abstract claims broad wins but gives no error bars, exact splits, or checks for post-hoc selection, so the central performance story still needs verification from the full results and ablations. The stress-test concern about the contrastive term possibly washing out modality-specific cues is worth watching; if alignment over-smooths, it could hurt on generators with strong video-only fingerprints or on in-the-wild data whose shifts differ from the training set. Minor implementation details like how they balance the losses or sample frames would also help judge reproducibility. This paper is for people building or evaluating detectors for generated media who deal with mixed image-video pipelines in practice. A reader focused on generalization fixes and simple training tricks would get concrete value from the method and the benchmark spread. It deserves a serious referee because the problem is timely, the motivation is grounded in observed failures, and the proposed fix is testable even if the current evidence is preliminary. I would send it out for review to get the experiments properly stress-tested.

Referee Report

3 major / 2 minor

Summary. The manuscript identifies a consistent cross-modal failure mode where SOTA AI-generated image detectors collapse on frames extracted from AI-generated videos. It attributes this gap to synthesis-agnostic processing shifts (color conversion, codec compression, resizing, blur) together with model-specific video-generator fingerprints. To close the gap, the authors propose VINA, a joint image-video training framework that treats video frames as physically grounded natural augmentations and adds a cross-modal supervised contrastive objective to enforce a shared real/fake decision boundary. Experiments on 14 image, video, and in-the-wild benchmarks are reported to yield bidirectional gains, improved robustness/transferability, and SOTA performance without complex augmentations or dataset-specific tuning.

Significance. If the reported gains prove reproducible, the work would be a useful practical contribution to AIGC detection by offering a lightweight, modality-bridging recipe that exploits naturally occurring video variations rather than hand-crafted augmentations. The bidirectional improvement and emphasis on in-the-wild settings address a real deployment need.

major comments (3)

[Experiments section] Experiments section (results on 14 benchmarks): the central claim of consistent SOTA performance and bidirectional gains rests on the reported numbers, yet no error bars, standard deviations across runs, exact train/val/test splits, or confirmation that post-hoc model selection was avoided are supplied. This information is load-bearing for assessing whether the gains are reliable and generalizable.
[§3.3] §3.3 (cross-modal supervised contrastive objective): the design assumes that explicit alignment under a shared boundary will close the identified gap without suppressing modality-specific cues (e.g., generator fingerprints that remain useful on certain video synthesizers or on in-the-wild data whose shift distribution differs from the training videos). No ablation or analysis tests this assumption or quantifies whether over-alignment introduces new failure modes.
[§2] §2 (gap analysis): the systematic attribution of the cross-modal collapse to the listed processing shifts plus fingerprints is plausible, but the paper provides no controlled decomposition (e.g., isolating each shift's contribution via synthetic pipelines) that would justify why video frames as natural augmentations plus contrastive alignment is the targeted remedy rather than other interventions.

minor comments (2)

Figure captions describing the processing shifts or the contrastive loss could be expanded with explicit annotations or step-by-step illustrations to improve clarity for readers unfamiliar with the video pipeline.
Notation for the supervised contrastive loss (positive/negative pair construction across modalities) would benefit from a small diagram or explicit pseudocode to avoid ambiguity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments, which help strengthen the paper. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Experiments section] Experiments section (results on 14 benchmarks): the central claim of consistent SOTA performance and bidirectional gains rests on the reported numbers, yet no error bars, standard deviations across runs, exact train/val/test splits, or confirmation that post-hoc model selection was avoided are supplied. This information is load-bearing for assessing whether the gains are reliable and generalizable.

Authors: We agree that statistical reliability measures are essential. In the revised manuscript we will add error bars and standard deviations computed over multiple independent runs (different random seeds) for all reported results on the 14 benchmarks. We will also document the exact train/val/test splits and confirm that hyperparameter and model selection were performed exclusively on validation data with test sets strictly held out. revision: yes
Referee: [§3.3] §3.3 (cross-modal supervised contrastive objective): the design assumes that explicit alignment under a shared boundary will close the identified gap without suppressing modality-specific cues (e.g., generator fingerprints that remain useful on certain video synthesizers or on in-the-wild data whose shift distribution differs from the training videos). No ablation or analysis tests this assumption or quantifies whether over-alignment introduces new failure modes.

Authors: We acknowledge the need to verify that the contrastive objective does not erase useful modality-specific signals. We will add an ablation that removes the cross-modal contrastive loss and directly compares performance on video-generator-specific and in-the-wild subsets. We will also include representation analysis (e.g., t-SNE or fingerprint retention metrics) to quantify whether generator-specific cues are preserved and to check for any new failure modes introduced by over-alignment. revision: yes
Referee: [§2] §2 (gap analysis): the systematic attribution of the cross-modal collapse to the listed processing shifts plus fingerprints is plausible, but the paper provides no controlled decomposition (e.g., isolating each shift's contribution via synthetic pipelines) that would justify why video frames as natural augmentations plus contrastive alignment is the targeted remedy rather than other interventions.

Authors: We agree a more granular decomposition would be informative. While constructing full synthetic pipelines for every individual shift is beyond the scope of the current work, we will expand Section 2 with quantitative measurements of detector degradation under each observed processing shift (color conversion, compression, resizing, blur) drawn from our existing data. This will provide stronger empirical grounding for why video frames serve as effective natural augmentations and why the contrastive alignment is a suitable remedy. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation remains self-contained

full rationale

The paper identifies a cross-modal performance gap through systematic analysis of processing shifts and generator fingerprints, then proposes VINA as a training framework that treats video frames as natural augmentations and adds a supervised contrastive loss to align representations. No load-bearing step reduces the claimed bidirectional gains or SOTA results to a fitted parameter defined inside the paper, a self-citation chain, or a re-expression of the evaluation metric. The contrastive objective is introduced as an independent alignment signal rather than a tautological restatement of the real/fake boundary, and the method is presented without invoking uniqueness theorems or ansatzes from prior author work that would collapse the argument. The overall chain is therefore independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework assumes that video processing artifacts and generator fingerprints are the dominant sources of the observed cross-modal gap and that a shared contrastive boundary will transfer without new domain-specific failure modes. No new physical entities are introduced.

axioms (1)

domain assumption Video frames extracted from AI-generated videos can be treated as physically grounded natural augmentations of the corresponding image content.
Invoked in the motivation and method description to justify joint training.

pith-pipeline@v0.9.0 · 5753 in / 1242 out tokens · 33850 ms · 2026-05-22T07:56:48.118827+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

VINA uses video frames as physically grounded natural augmentations and further introduces a cross-modal supervised contrastive objective to align image and video representations under a shared real/fake decision boundary.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We model video frames as image signals subjected to a complex, non-differentiable degradation function T(·).

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · 12 internal anchors

[1]

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis, 2024. arXiv:2403.03206

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Video generation models as world simulators, 2024

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators, 2024. URLhttps://openai.com/index/sora/

work page 2024
[3]

Security and privacy on generative data in aigc: A survey.ACM Computing Surveys, 57(4):1–34, 2024

Tao Wang, Yushu Zhang, Shuren Qi, Ruoyu Zhao, Zhihua Xia, and Jian Weng. Security and privacy on generative data in aigc: A survey.ACM Computing Surveys, 57(4):1–34, 2024

work page 2024
[4]

Rethink- ing the up-sampling operations in cnn-based generative network for generalizable deepfake detection

Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. Rethink- ing the up-sampling operations in cnn-based generative network for generalizable deepfake detection. InCVPR, pages 28130–28139, 2024

work page 2024
[5]

Improving synthetic image detection towards generalization: An image transformation perspective

Ouxiang Li, Jiayin Cai, Yanbin Hao, Xiaolong Jiang, Yao Hu, and Fuli Feng. Improving synthetic image detection towards generalization: An image transformation perspective. In KDD, 2025

work page 2025
[6]

Co-spy: Combining semantic and pixel features to detect synthetic images by ai

Siyuan Cheng, Lingjuan Lyu, Zhenting Wang, Xiangyu Zhang, and Vikash Sehwag. Co-spy: Combining semantic and pixel features to detect synthetic images by ai. InCVPR, 2025

work page 2025
[7]

Towards universal fake image detectors that generalize across generative models

Utkarsh Ojha, Yuheng Li, and Yong Jae Lee. Towards universal fake image detectors that generalize across generative models. InCVPR, pages 24480–24489, 2023

work page 2023
[8]

Orthogonal subspace decomposition for generalizable ai-generated image detection

Zhiyuan Yan, Jiangming Wang, Peng Jin, Ke-Yue Zhang, Chengchun Liu, Shen Chen, Taiping Yao, Shouhong Ding, Baoyuan Wu, and Li Yuan. Orthogonal subspace decomposition for generalizable ai-generated image detection. InICML, 2025

work page 2025
[9]

Dual data alignment makes ai-generated image detector easier generalizable

Ruoxin Chen, Junwei Xi, Zhiyuan Yan, Ke-Yue Zhang, Shuang Wu, Jingyi Xie, Xu Chen, Lei Xu, Isabel Guan, Taiping Yao, and Shouhong Ding. Dual data alignment makes ai-generated image detector easier generalizable. InNeurIPS, 2025

work page 2025
[10]

A bias-free training paradigm for more general ai-generated image detection

Fabrizio Guillaro, Giada Zingarini, Ben Usman, Avneesh Sud, Davide Cozzolino, and Luisa Verdoliva. A bias-free training paradigm for more general ai-generated image detection. In CVPR, 2025

work page 2025
[11]

Seeing what matters: Generalizable ai-generated video detection with forensic- oriented augmentation

Riccardo Corvi, Davide Cozzolino, Ekta Prashnani, Shalini De Mello, Koki Nagano, and Luisa Verdoliva. Seeing what matters: Generalizable ai-generated video detection with forensic- oriented augmentation. InNeurIPS, 2025

work page 2025
[12]

Zero-shot detection of ai-generated images

Davide Cozzolino, Giovanni Poggi, Matthias Nießner, and Luisa Verdoliva. Zero-shot detection of ai-generated images. InECCV, 2024

work page 2024
[13]

A sanity check for ai-generated image detection

Shilin Yan, Ouxiang Li, Jiayin Cai, Yanbin Hao, Xiaolong Jiang, Yao Hu, and Weidi Xie. A sanity check for ai-generated image detection. InICLR, 2025

work page 2025
[14]

Bridging the gap between ideal and real-world evaluation: Benchmarking ai-generated image detection in challenging scenarios

Chunxiao Li, Xiaoxiao Wang, Meiling Li, Boming Miao, Peng Sun, Yunjian Zhang, Xiangyang Ji, and Yao Zhu. Bridging the gap between ideal and real-world evaluation: Benchmarking ai-generated image detection in challenging scenarios. InICCV, 2025. 10

work page 2025
[15]

Demamba: Ai-generated video detection on million-scale genvideo benchmark, 2024

Haoxing Chen, Yan Hong, Zizheng Huang, Zhuoer Xu, Zhangxuan Gu, Yaohui Li, Jun Lan, Huijia Zhu, Jianfu Zhang, Weiqiang Wang, and Huaxiong Li. Demamba: Ai-generated video detection on million-scale genvideo benchmark, 2024. arXiv:2405.19707

work page arXiv 2024
[16]

Physics-driven spatiotemporal modeling for ai-generated video detection

Shuhai Zhang, ZiHao Lian, Jiahao Yang, Daiyuan Li, Guoxuan Pang, Feng Liu, Bo Han, Shutao Li, and Mingkui Tan. Physics-driven spatiotemporal modeling for ai-generated video detection. InNeurIPS, 2025

work page 2025
[17]

The all- seeing project: Towards panoptic visual recognition and understanding of the open world, 2023

Weiyun Wang, Min Shi, Qingyun Li, Wenhai Wang, Zhenhang Huang, Linjie Xing, Zhe Chen, Hao Li, Xizhou Zhu, Zhiguo Cao, Yushi Chen, Tong Lu, Jifeng Dai, and Yu Qiao. The all- seeing project: Towards panoptic visual recognition and understanding of the open world, 2023. arXiv:2308.01907

work page arXiv 2023
[18]

Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A. Efros. Cnn- generated images are surprisingly easy to spot... for now. InCVPR, pages 8695–8704, 2020

work page 2020
[19]

Genimage: A million-scale benchmark for detecting ai-generated image

Mingjian Zhu, Hanting Chen, Qiangyu Yan, Xudong Huang, Guanyu Lin, Wei Li, Zhijun Tu, Hailin Hu, Jie Hu, and Yunhe Wang. Genimage: A million-scale benchmark for detecting ai-generated image. InNeurIPS, 2023

work page 2023
[20]

Preserving forgery artifacts: AI-generated video detection at native scale

Zhengcen Li, Chenyang Jiang, Hang Zhao, Shiyang Zhou, Yunyang Mo, Feng Gao, Fan Yang, Qiben Shan, Shaocong Wu, and Jingyong Su. Preserving forgery artifacts: AI-generated video detection at native scale. InICLR, 2026. URL https://openreview.net/forum?id= XD43lfRCg6

work page 2026
[21]

Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. InNeurIPS, 2014

work page 2014
[22]

Auto-Encoding Variational Bayes

Diederik P. Kingma and Max Welling. Auto-encoding variational bayes, 2022. arXiv:1312.6114

work page internal anchor Pith review Pith/arXiv arXiv 2022
[23]

Neural Discrete Representation Learning

Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning, 2018. arXiv:1711.00937

work page internal anchor Pith review Pith/arXiv arXiv 2018
[24]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In NeurIPS, 2020

work page 2020
[25]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InCVPR, pages 10684–10695, 2022

work page 2022
[26]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023

work page 2023
[27]

Visual autoregressive modeling: Scalable image generation via next-scale prediction

Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction. InNeurIPS, 2024

work page 2024
[28]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining, 2025. arXiv:2505.14683

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Make-a-video: Text-to-video generation without text-video data, 2022. arXiv:2209.14792

work page internal anchor Pith review Pith/arXiv arXiv 2022
[30]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets, 2023. arXiv:2311.15127

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Wan: Open and Advanced Large-Scale Video Generative Models

WanTeam, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, Ti...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

https://klingai.kuaishou.com, 2024

Kuaishou. https://klingai.kuaishou.com, 2024. URLhttps://klingai.kuaishou.com

work page 2024
[33]

Seedance 1.0: Exploring the boundaries of video generation models, 2025

Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, Xunsong Li, Yifu Li, Shanchuan Lin, Zhijie Lin, Jiawei Liu, Shu Liu, Xiaonan Nie, Zhiwu Qing, Yuxi Ren, Li Sun, Zhi Tian, Rui Wang, Sen Wang, Guoqiang Wei, Guohong Wu, Jie Wu, Ruiqi Xia, Fei Xiao, Xuefeng Xiao, Jiangqiao Yan, Ceyuan Yang, J...

work page 2025
[34]

Dire for diffusion-generated image detection

Zhendong Wang, Jianmin Bao, Wengang Zhou, Weilun Wang, Hezhen Hu, Hong Chen, and Houqiang Li. Dire for diffusion-generated image detection. InICCV, pages 22445–22455, 2023

work page 2023
[35]

Drct: Diffusion reconstruction contrastive training towards universal detection of diffusion generated images

Baoying Chen, Jishen Zeng, Jianquan Yang, and Rui Yang. Drct: Diffusion reconstruction contrastive training towards universal detection of diffusion generated images. InICML, 2024

work page 2024
[36]

$\bf{d^3}$qe: Learning discrete distribution discrepancy-aware quantization error for autoregressive-generated image detection

Yanran Zhang, Bingyao Yu, Yu Zheng, Wenzhao Zheng, Yueqi Duan, Lei Chen, Jie Zhou, and Jiwen Lu. $\bf{d^3}$qe: Learning discrete distribution discrepancy-aware quantization error for autoregressive-generated image detection. InICCV, 2025

work page 2025
[37]

Forgery-aware adaptive transformer for generalizable synthetic image detection

Huan Liu, Zichang Tan, Chuangchuang Tan, Yunchao Wei, Jingdong Wang, and Yao Zhao. Forgery-aware adaptive transformer for generalizable synthetic image detection. InCVPR, pages 10770–10780, 2024

work page 2024
[38]

Leveraging representations from intermediate encoder-blocks for synthetic image detection

Christos Koutlis and Symeon Papadopoulos. Leveraging representations from intermediate encoder-blocks for synthetic image detection. InECCV, 2024

work page 2024
[39]

C2p-clip: Injecting category common prompt in clip to enhance generalization in deepfake detection

Chuangchuang Tan, Renshuai Tao, Huan Liu, Guanghua Gu, Baoyuan Wu, Yao Zhao, and Yunchao Wei. C2p-clip: Injecting category common prompt in clip to enhance generalization in deepfake detection. InAAAI, volume 39, pages 7184–7192, 2025

work page 2025
[40]

Progressive growing of gans for improved quality, stability, and variation

Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. InICLR, 2018

work page 2018
[41]

Roy-Chowdhury

Rohit Kundu, Hao Xiong, Vishal Mohanty, Athula Balachandran, and Amit K. Roy-Chowdhury. Towards a universal synthetic video detector: From face or background manipulations to fully ai-generated content. InCVPR, 2025

work page 2025
[42]

Genvidbench: A challenging benchmark for detecting ai-generated video, 2024

Zhen-Liang Ni, Qiangyu Yan, Tianning Yuan, Mouxiao Huang, Hailin Hu, Xinghao Chen, and Yunhe Wang. Genvidbench: A challenging benchmark for detecting ai-generated video, 2024

work page 2024
[43]

D3: Training-free ai-generated video detection using second-order features

Chende Zheng, Ruiqi suo, Chenhao Lin, Zhengyu Zhao, Le Yang, Shuai Liu, Minghui Yang, Cong Wang, and Chao Shen. D3: Training-free ai-generated video detection using second-order features. InICCV, 2025

work page 2025
[44]

Vidguard-r1: Ai-generated video detection and explanation via reasoning mllms and rl, 2025

Kyoungjun Park, Yifan Yang, Juheon Yi, Shicheng Zheng, Yifei Shen, Dongqi Han, Caihua Shan, Muhammad Muaz, and Lili Qiu. Vidguard-r1: Ai-generated video detection and explanation via reasoning mllms and rl, 2025. arXiv:2510.02282

work page arXiv 2025
[45]

Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning

Yifei Li, Wenzhao Zheng, Yanran Zhang, Runze Sun, Yu Zheng, Lei Chen, Jie Zhou, and Jiwen Lu. Skyra: Ai-generated video detection via grounded artifact reasoning, 2025. arXiv:2512.15693

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Videoveritas: Ai-generated video detection via perception pretext reinforcement learning, 2026

Hao Tan, Jun Lan, Senyuan Shi, Zichang Tan, Zijian Yu, Huijia Zhu, Weiqiang Wang, Jun Wan, and Zhen Lei. Videoveritas: Ai-generated video detection via perception pretext reinforcement learning, 2026. arXiv:2602.08828. 12

work page arXiv 2026
[47]

Busterx++: Towards unified cross-modal ai-generated content detection and explanation with mllm, 2025

Haiquan Wen, Tianxiao Li, Zhenglin Huang, Yiwei He, and Guangliang Cheng. Busterx++: Towards unified cross-modal ai-generated content detection and explanation with mllm, 2025. arXiv:2507.14632

work page arXiv 2025
[48]

Loki: A comprehensive synthetic data detection benchmark using large multimodal models, 2025

Junyan Ye, Baichuan Zhou, Zilong Huang, Junan Zhang, Tianyi Bai, Hengrui Kang, Jun He, Honglin Lin, Zihao Wang, Tong Wu, Zhizheng Wu, Yiping Chen, Dahua Lin, Conghui He, and Weijia Li. Loki: A comprehensive synthetic data detection benchmark using large multimodal models, 2025. arXiv:2410.09732

work page arXiv 2025
[49]

Ivy-fake: A unified explainable framework and benchmark for image and video aigc detection,

Wayne Zhang, Changjiang Jiang, Zhonghao Zhang, Chenyang Si, Fengchang Yu, and Wei Peng. Ivy-fake: A unified explainable framework and benchmark for image and video aigc detection,

work page
[50]

Aligngemini: Generalizable ai-generated image detection through task-model alignment, 2026

Ruoxin Chen, Jiahui Gao, Kaiqing Lin, Keyue Zhang, Yandan Zhao, Isabel Guan, Taiping Yao, and Shouhong Ding. Aligngemini: Generalizable ai-generated image detection through task-model alignment, 2026

work page 2026
[51]

Aligned datasets improve detection of latent diffusion-generated images

Anirudh Sundara Rajan, Utkarsh Ojha, Jedidiah Schloesser, and Yong Jae Lee. Aligned datasets improve detection of latent diffusion-generated images. InICLR, 2025

work page 2025
[52]

Beyond artifacts: Real-centric envelope modeling for reliable ai-generated image detection, 2025

Ruiqi Liu, Yi Han, Zhengbo Zhang, Liwei Yao, Zhiyuan Yan, Jialiang Shen, ZhiJin Chen, Boyi Sun, Lubin Weng, Jing Dong, Yan Wang, and Shu Wu. Beyond artifacts: Real-centric envelope modeling for reliable ai-generated image detection, 2025. arXiv:2512.20937

work page arXiv 2025
[53]

The Kinetics Human Action Video Dataset

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijaya- narasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. The kinetics human action video dataset.arXiv preprint arXiv:1705.06950, pages 1–22, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[54]

Vbench: Comprehensive benchmark suite for video generative models, 2023

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Vbench: Comprehensive benchmark suite for video generative models, 2023. arXiv:2311.17982

work page arXiv 2023
[55]

Watch your up-convolution: Cnn based generative deep neural networks are failing to reproduce spectral distributions

Ricard Durall, Margret Keuper, and Janis Keuper. Watch your up-convolution: Cnn based generative deep neural networks are failing to reproduce spectral distributions. InCVPR, 2020

work page 2020
[56]

Secret lies in color: Enhancing ai-generated images detection with color distribution analysis

Zexi Jia, Chuanwei Huang, Yeshuang Zhu, Hongyan Fei, Xiaoyue Duan, Zhiqiang Yuan, Ying Deng, Jiapei Zhang, Jinchao Zhang, and Jie Zhou. Secret lies in color: Enhancing ai-generated images detection with color distribution analysis. InCVPR, pages 13445–13454, 2025

work page 2025
[57]

SAGA: Source Attribution of Generative AI Videos

Rohit Kundu, Vishal Mohanty, Hao Xiong, Shan Jia, Athula Balachandran, and Amit K. Roy- Chowdhury. Saga: Source attribution of generative ai videos, 2025. arXiv:2511.12834

work page internal anchor Pith review Pith/arXiv arXiv 2025
[58]

Learning human-perceived fakeness in ai-generated videos via multimodal llms, 2025

Xingyu Fu, Siyi Liu, Yinuo Xu, Pan Lu, Guangqiuse Hu, Tianbo Yang, Taran Anantasagar, Christopher Shen, Yikai Mao, Yuanzhe Liu, Keyush Shah, Chung Un Lee, Yejin Choi, James Zou, Dan Roth, and Chris Callison-Burch. Learning human-perceived fakeness in ai-generated videos via multimodal llms, 2025. arXiv:2509.22646

work page arXiv 2025
[59]

Supervised contrastive learning

Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. InNeurIPS, volume 33, pages 18661–18673, 2020

work page 2020
[60]

https://pika.art/, 2023

Pika Labs. https://pika.art/, 2023. URLhttps://pika.art/

work page 2023
[61]

Raising the bar of ai-generated image detection with clip

Davide Cozzolino, Giovanni Poggi, Riccardo Corvi, Matthias Nießner, and Luisa Verdoliva. Raising the bar of ai-generated image detection with clip. InCVPRW, 2024

work page 2024
[62]

Real-time deepfake detection in the real-world, 2024

Bar Cavia, Eliahu Horwitz, Tal Reiss, and Yedid Hoshen. Real-time deepfake detection in the real-world, 2024. arXiv:2406.09398

work page arXiv 2024
[63]

Is artificial intelligence generated image detection a solved problem? InNeurIPS, 2025

Ziqiang Li, Jiazhen Yan, Ziwen He, Kai Zeng, Weiwei Jiang, Lizhi Xiong, and Zhangjie Fu. Is artificial intelligence generated image detection a solved problem? InNeurIPS, 2025. 13

work page 2025
[64]

Mirror: Manifold ideal reference reconstructor for generalizable ai-generated image detection, 2026

Ruiqi Liu, Manni Cui, Ziheng Qin, Zhiyuan Yan, Ruoxin Chen, Yi Han, Zhiheng Li, Junkai Chen, ZhiJin Chen, Kaiqing Lin, Jialiang Shen, Lubin Weng, Jing Dong, Yan Wang, and Shu Wu. Mirror: Manifold ideal reference reconstructor for generalizable ai-generated image detection, 2026. arXiv:2602.02222

work page arXiv 2026
[65]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. arXiv:2103.00020

work page internal anchor Pith review Pith/arXiv arXiv 2021
[66]

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patrick L...

work page 2023
[67]

Oriane Siméoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julie...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[68]

This design prevents catastrophic forgetting of pre-trained semantic knowledge while efficiently learning forgery patterns

Effort[ 8] is a parameter-efficient method that leverages SVD to construct orthogonal subspaces. This design prevents catastrophic forgetting of pre-trained semantic knowledge while efficiently learning forgery patterns. Our implementation uses their released weights trained on SD1.4 with identical preprocessing

work page
[69]

It implements this approach by training a ResNet-50 model for identification

NPR[ 4] distinguishes synthetic images by analyzing low-level neighboring pixel rela- tionships, which are characteristic of upsampling patterns in AI-generated content. It implements this approach by training a ResNet-50 model for identification. We use their official checkpoint for evaluation

work page
[70]

RINE[ 38] improves AI-generated image detection by mapping intermediate CLIP features to a forgery-aware vector space. We utilize the author-released weights trained on the 16 Table 9:Overview of the evaluation benchmarks.Our evaluation covers 14 benchmarks (4 video benchmarks, 5 image benchmarks, and 5 in-the-wild benchmarks), spanning a diverse range of...

work page
[71]

The method incorporates content augmentation via inpainting and fine-tunes a DINOv2+reg ViT using large crops to preserve forensic signals

B-Free[ 10] employs a self-conditioned diffusion reconstruction paradigm to enforce seman- tic alignment between real and synthetic images, thereby isolating differences to generation artifacts. The method incorporates content augmentation via inpainting and fine-tunes a DINOv2+reg ViT using large crops to preserve forensic signals. We utilize their provi...

work page
[72]

Following the official code, we evaluate the two author-released implementations trained on SD1.4 and ProGAN

CO-SPY[ 6] enhances semantic and artifact features for robust synthetic image detection. Following the official code, we evaluate the two author-released implementations trained on SD1.4 and ProGAN

work page
[73]

DDA[ 9] aligns synthetic and real images in both pixel and frequency domains to mitigate spurious correlations and improve detector generalization. For evaluation, we utilize the author-released weights and identical settings, with the sole exception that we forgo applying the same JPEG Q=96 compression on the test sets to ensure a fair comparison with ot...

work page arXiv

[1] [1]

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis, 2024. arXiv:2403.03206

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

Video generation models as world simulators, 2024

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators, 2024. URLhttps://openai.com/index/sora/

work page 2024

[3] [3]

Security and privacy on generative data in aigc: A survey.ACM Computing Surveys, 57(4):1–34, 2024

Tao Wang, Yushu Zhang, Shuren Qi, Ruoyu Zhao, Zhihua Xia, and Jian Weng. Security and privacy on generative data in aigc: A survey.ACM Computing Surveys, 57(4):1–34, 2024

work page 2024

[4] [4]

Rethink- ing the up-sampling operations in cnn-based generative network for generalizable deepfake detection

Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. Rethink- ing the up-sampling operations in cnn-based generative network for generalizable deepfake detection. InCVPR, pages 28130–28139, 2024

work page 2024

[5] [5]

Improving synthetic image detection towards generalization: An image transformation perspective

Ouxiang Li, Jiayin Cai, Yanbin Hao, Xiaolong Jiang, Yao Hu, and Fuli Feng. Improving synthetic image detection towards generalization: An image transformation perspective. In KDD, 2025

work page 2025

[6] [6]

Co-spy: Combining semantic and pixel features to detect synthetic images by ai

Siyuan Cheng, Lingjuan Lyu, Zhenting Wang, Xiangyu Zhang, and Vikash Sehwag. Co-spy: Combining semantic and pixel features to detect synthetic images by ai. InCVPR, 2025

work page 2025

[7] [7]

Towards universal fake image detectors that generalize across generative models

Utkarsh Ojha, Yuheng Li, and Yong Jae Lee. Towards universal fake image detectors that generalize across generative models. InCVPR, pages 24480–24489, 2023

work page 2023

[8] [8]

Orthogonal subspace decomposition for generalizable ai-generated image detection

Zhiyuan Yan, Jiangming Wang, Peng Jin, Ke-Yue Zhang, Chengchun Liu, Shen Chen, Taiping Yao, Shouhong Ding, Baoyuan Wu, and Li Yuan. Orthogonal subspace decomposition for generalizable ai-generated image detection. InICML, 2025

work page 2025

[9] [9]

Dual data alignment makes ai-generated image detector easier generalizable

Ruoxin Chen, Junwei Xi, Zhiyuan Yan, Ke-Yue Zhang, Shuang Wu, Jingyi Xie, Xu Chen, Lei Xu, Isabel Guan, Taiping Yao, and Shouhong Ding. Dual data alignment makes ai-generated image detector easier generalizable. InNeurIPS, 2025

work page 2025

[10] [10]

A bias-free training paradigm for more general ai-generated image detection

Fabrizio Guillaro, Giada Zingarini, Ben Usman, Avneesh Sud, Davide Cozzolino, and Luisa Verdoliva. A bias-free training paradigm for more general ai-generated image detection. In CVPR, 2025

work page 2025

[11] [11]

Seeing what matters: Generalizable ai-generated video detection with forensic- oriented augmentation

Riccardo Corvi, Davide Cozzolino, Ekta Prashnani, Shalini De Mello, Koki Nagano, and Luisa Verdoliva. Seeing what matters: Generalizable ai-generated video detection with forensic- oriented augmentation. InNeurIPS, 2025

work page 2025

[12] [12]

Zero-shot detection of ai-generated images

Davide Cozzolino, Giovanni Poggi, Matthias Nießner, and Luisa Verdoliva. Zero-shot detection of ai-generated images. InECCV, 2024

work page 2024

[13] [13]

A sanity check for ai-generated image detection

Shilin Yan, Ouxiang Li, Jiayin Cai, Yanbin Hao, Xiaolong Jiang, Yao Hu, and Weidi Xie. A sanity check for ai-generated image detection. InICLR, 2025

work page 2025

[14] [14]

Bridging the gap between ideal and real-world evaluation: Benchmarking ai-generated image detection in challenging scenarios

Chunxiao Li, Xiaoxiao Wang, Meiling Li, Boming Miao, Peng Sun, Yunjian Zhang, Xiangyang Ji, and Yao Zhu. Bridging the gap between ideal and real-world evaluation: Benchmarking ai-generated image detection in challenging scenarios. InICCV, 2025. 10

work page 2025

[15] [15]

Demamba: Ai-generated video detection on million-scale genvideo benchmark, 2024

Haoxing Chen, Yan Hong, Zizheng Huang, Zhuoer Xu, Zhangxuan Gu, Yaohui Li, Jun Lan, Huijia Zhu, Jianfu Zhang, Weiqiang Wang, and Huaxiong Li. Demamba: Ai-generated video detection on million-scale genvideo benchmark, 2024. arXiv:2405.19707

work page arXiv 2024

[16] [16]

Physics-driven spatiotemporal modeling for ai-generated video detection

Shuhai Zhang, ZiHao Lian, Jiahao Yang, Daiyuan Li, Guoxuan Pang, Feng Liu, Bo Han, Shutao Li, and Mingkui Tan. Physics-driven spatiotemporal modeling for ai-generated video detection. InNeurIPS, 2025

work page 2025

[17] [17]

The all- seeing project: Towards panoptic visual recognition and understanding of the open world, 2023

Weiyun Wang, Min Shi, Qingyun Li, Wenhai Wang, Zhenhang Huang, Linjie Xing, Zhe Chen, Hao Li, Xizhou Zhu, Zhiguo Cao, Yushi Chen, Tong Lu, Jifeng Dai, and Yu Qiao. The all- seeing project: Towards panoptic visual recognition and understanding of the open world, 2023. arXiv:2308.01907

work page arXiv 2023

[18] [18]

Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A. Efros. Cnn- generated images are surprisingly easy to spot... for now. InCVPR, pages 8695–8704, 2020

work page 2020

[19] [19]

Genimage: A million-scale benchmark for detecting ai-generated image

Mingjian Zhu, Hanting Chen, Qiangyu Yan, Xudong Huang, Guanyu Lin, Wei Li, Zhijun Tu, Hailin Hu, Jie Hu, and Yunhe Wang. Genimage: A million-scale benchmark for detecting ai-generated image. InNeurIPS, 2023

work page 2023

[20] [20]

Preserving forgery artifacts: AI-generated video detection at native scale

Zhengcen Li, Chenyang Jiang, Hang Zhao, Shiyang Zhou, Yunyang Mo, Feng Gao, Fan Yang, Qiben Shan, Shaocong Wu, and Jingyong Su. Preserving forgery artifacts: AI-generated video detection at native scale. InICLR, 2026. URL https://openreview.net/forum?id= XD43lfRCg6

work page 2026

[21] [21]

Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. InNeurIPS, 2014

work page 2014

[22] [22]

Auto-Encoding Variational Bayes

Diederik P. Kingma and Max Welling. Auto-encoding variational bayes, 2022. arXiv:1312.6114

work page internal anchor Pith review Pith/arXiv arXiv 2022

[23] [23]

Neural Discrete Representation Learning

Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning, 2018. arXiv:1711.00937

work page internal anchor Pith review Pith/arXiv arXiv 2018

[24] [24]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In NeurIPS, 2020

work page 2020

[25] [25]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InCVPR, pages 10684–10695, 2022

work page 2022

[26] [26]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023

work page 2023

[27] [27]

Visual autoregressive modeling: Scalable image generation via next-scale prediction

Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction. InNeurIPS, 2024

work page 2024

[28] [28]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining, 2025. arXiv:2505.14683

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Make-a-video: Text-to-video generation without text-video data, 2022. arXiv:2209.14792

work page internal anchor Pith review Pith/arXiv arXiv 2022

[30] [30]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets, 2023. arXiv:2311.15127

work page internal anchor Pith review Pith/arXiv arXiv 2023

[31] [31]

Wan: Open and Advanced Large-Scale Video Generative Models

WanTeam, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, Ti...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

https://klingai.kuaishou.com, 2024

Kuaishou. https://klingai.kuaishou.com, 2024. URLhttps://klingai.kuaishou.com

work page 2024

[33] [33]

Seedance 1.0: Exploring the boundaries of video generation models, 2025

Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, Xunsong Li, Yifu Li, Shanchuan Lin, Zhijie Lin, Jiawei Liu, Shu Liu, Xiaonan Nie, Zhiwu Qing, Yuxi Ren, Li Sun, Zhi Tian, Rui Wang, Sen Wang, Guoqiang Wei, Guohong Wu, Jie Wu, Ruiqi Xia, Fei Xiao, Xuefeng Xiao, Jiangqiao Yan, Ceyuan Yang, J...

work page 2025

[34] [34]

Dire for diffusion-generated image detection

Zhendong Wang, Jianmin Bao, Wengang Zhou, Weilun Wang, Hezhen Hu, Hong Chen, and Houqiang Li. Dire for diffusion-generated image detection. InICCV, pages 22445–22455, 2023

work page 2023

[35] [35]

Drct: Diffusion reconstruction contrastive training towards universal detection of diffusion generated images

Baoying Chen, Jishen Zeng, Jianquan Yang, and Rui Yang. Drct: Diffusion reconstruction contrastive training towards universal detection of diffusion generated images. InICML, 2024

work page 2024

[36] [36]

$\bf{d^3}$qe: Learning discrete distribution discrepancy-aware quantization error for autoregressive-generated image detection

Yanran Zhang, Bingyao Yu, Yu Zheng, Wenzhao Zheng, Yueqi Duan, Lei Chen, Jie Zhou, and Jiwen Lu. $\bf{d^3}$qe: Learning discrete distribution discrepancy-aware quantization error for autoregressive-generated image detection. InICCV, 2025

work page 2025

[37] [37]

Forgery-aware adaptive transformer for generalizable synthetic image detection

Huan Liu, Zichang Tan, Chuangchuang Tan, Yunchao Wei, Jingdong Wang, and Yao Zhao. Forgery-aware adaptive transformer for generalizable synthetic image detection. InCVPR, pages 10770–10780, 2024

work page 2024

[38] [38]

Leveraging representations from intermediate encoder-blocks for synthetic image detection

Christos Koutlis and Symeon Papadopoulos. Leveraging representations from intermediate encoder-blocks for synthetic image detection. InECCV, 2024

work page 2024

[39] [39]

C2p-clip: Injecting category common prompt in clip to enhance generalization in deepfake detection

Chuangchuang Tan, Renshuai Tao, Huan Liu, Guanghua Gu, Baoyuan Wu, Yao Zhao, and Yunchao Wei. C2p-clip: Injecting category common prompt in clip to enhance generalization in deepfake detection. InAAAI, volume 39, pages 7184–7192, 2025

work page 2025

[40] [40]

Progressive growing of gans for improved quality, stability, and variation

Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. InICLR, 2018

work page 2018

[41] [41]

Roy-Chowdhury

Rohit Kundu, Hao Xiong, Vishal Mohanty, Athula Balachandran, and Amit K. Roy-Chowdhury. Towards a universal synthetic video detector: From face or background manipulations to fully ai-generated content. InCVPR, 2025

work page 2025

[42] [42]

Genvidbench: A challenging benchmark for detecting ai-generated video, 2024

Zhen-Liang Ni, Qiangyu Yan, Tianning Yuan, Mouxiao Huang, Hailin Hu, Xinghao Chen, and Yunhe Wang. Genvidbench: A challenging benchmark for detecting ai-generated video, 2024

work page 2024

[43] [43]

D3: Training-free ai-generated video detection using second-order features

Chende Zheng, Ruiqi suo, Chenhao Lin, Zhengyu Zhao, Le Yang, Shuai Liu, Minghui Yang, Cong Wang, and Chao Shen. D3: Training-free ai-generated video detection using second-order features. InICCV, 2025

work page 2025

[44] [44]

Vidguard-r1: Ai-generated video detection and explanation via reasoning mllms and rl, 2025

Kyoungjun Park, Yifan Yang, Juheon Yi, Shicheng Zheng, Yifei Shen, Dongqi Han, Caihua Shan, Muhammad Muaz, and Lili Qiu. Vidguard-r1: Ai-generated video detection and explanation via reasoning mllms and rl, 2025. arXiv:2510.02282

work page arXiv 2025

[45] [45]

Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning

Yifei Li, Wenzhao Zheng, Yanran Zhang, Runze Sun, Yu Zheng, Lei Chen, Jie Zhou, and Jiwen Lu. Skyra: Ai-generated video detection via grounded artifact reasoning, 2025. arXiv:2512.15693

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [46]

Videoveritas: Ai-generated video detection via perception pretext reinforcement learning, 2026

Hao Tan, Jun Lan, Senyuan Shi, Zichang Tan, Zijian Yu, Huijia Zhu, Weiqiang Wang, Jun Wan, and Zhen Lei. Videoveritas: Ai-generated video detection via perception pretext reinforcement learning, 2026. arXiv:2602.08828. 12

work page arXiv 2026

[47] [47]

Busterx++: Towards unified cross-modal ai-generated content detection and explanation with mllm, 2025

Haiquan Wen, Tianxiao Li, Zhenglin Huang, Yiwei He, and Guangliang Cheng. Busterx++: Towards unified cross-modal ai-generated content detection and explanation with mllm, 2025. arXiv:2507.14632

work page arXiv 2025

[48] [48]

Loki: A comprehensive synthetic data detection benchmark using large multimodal models, 2025

Junyan Ye, Baichuan Zhou, Zilong Huang, Junan Zhang, Tianyi Bai, Hengrui Kang, Jun He, Honglin Lin, Zihao Wang, Tong Wu, Zhizheng Wu, Yiping Chen, Dahua Lin, Conghui He, and Weijia Li. Loki: A comprehensive synthetic data detection benchmark using large multimodal models, 2025. arXiv:2410.09732

work page arXiv 2025

[49] [49]

Ivy-fake: A unified explainable framework and benchmark for image and video aigc detection,

Wayne Zhang, Changjiang Jiang, Zhonghao Zhang, Chenyang Si, Fengchang Yu, and Wei Peng. Ivy-fake: A unified explainable framework and benchmark for image and video aigc detection,

work page

[50] [50]

Aligngemini: Generalizable ai-generated image detection through task-model alignment, 2026

Ruoxin Chen, Jiahui Gao, Kaiqing Lin, Keyue Zhang, Yandan Zhao, Isabel Guan, Taiping Yao, and Shouhong Ding. Aligngemini: Generalizable ai-generated image detection through task-model alignment, 2026

work page 2026

[51] [51]

Aligned datasets improve detection of latent diffusion-generated images

Anirudh Sundara Rajan, Utkarsh Ojha, Jedidiah Schloesser, and Yong Jae Lee. Aligned datasets improve detection of latent diffusion-generated images. InICLR, 2025

work page 2025

[52] [52]

Beyond artifacts: Real-centric envelope modeling for reliable ai-generated image detection, 2025

Ruiqi Liu, Yi Han, Zhengbo Zhang, Liwei Yao, Zhiyuan Yan, Jialiang Shen, ZhiJin Chen, Boyi Sun, Lubin Weng, Jing Dong, Yan Wang, and Shu Wu. Beyond artifacts: Real-centric envelope modeling for reliable ai-generated image detection, 2025. arXiv:2512.20937

work page arXiv 2025

[53] [53]

The Kinetics Human Action Video Dataset

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijaya- narasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. The kinetics human action video dataset.arXiv preprint arXiv:1705.06950, pages 1–22, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[54] [54]

Vbench: Comprehensive benchmark suite for video generative models, 2023

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Vbench: Comprehensive benchmark suite for video generative models, 2023. arXiv:2311.17982

work page arXiv 2023

[55] [55]

Watch your up-convolution: Cnn based generative deep neural networks are failing to reproduce spectral distributions

Ricard Durall, Margret Keuper, and Janis Keuper. Watch your up-convolution: Cnn based generative deep neural networks are failing to reproduce spectral distributions. InCVPR, 2020

work page 2020

[56] [56]

Secret lies in color: Enhancing ai-generated images detection with color distribution analysis

Zexi Jia, Chuanwei Huang, Yeshuang Zhu, Hongyan Fei, Xiaoyue Duan, Zhiqiang Yuan, Ying Deng, Jiapei Zhang, Jinchao Zhang, and Jie Zhou. Secret lies in color: Enhancing ai-generated images detection with color distribution analysis. InCVPR, pages 13445–13454, 2025

work page 2025

[57] [57]

SAGA: Source Attribution of Generative AI Videos

Rohit Kundu, Vishal Mohanty, Hao Xiong, Shan Jia, Athula Balachandran, and Amit K. Roy- Chowdhury. Saga: Source attribution of generative ai videos, 2025. arXiv:2511.12834

work page internal anchor Pith review Pith/arXiv arXiv 2025

[58] [58]

Learning human-perceived fakeness in ai-generated videos via multimodal llms, 2025

Xingyu Fu, Siyi Liu, Yinuo Xu, Pan Lu, Guangqiuse Hu, Tianbo Yang, Taran Anantasagar, Christopher Shen, Yikai Mao, Yuanzhe Liu, Keyush Shah, Chung Un Lee, Yejin Choi, James Zou, Dan Roth, and Chris Callison-Burch. Learning human-perceived fakeness in ai-generated videos via multimodal llms, 2025. arXiv:2509.22646

work page arXiv 2025

[59] [59]

Supervised contrastive learning

Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. InNeurIPS, volume 33, pages 18661–18673, 2020

work page 2020

[60] [60]

https://pika.art/, 2023

Pika Labs. https://pika.art/, 2023. URLhttps://pika.art/

work page 2023

[61] [61]

Raising the bar of ai-generated image detection with clip

Davide Cozzolino, Giovanni Poggi, Riccardo Corvi, Matthias Nießner, and Luisa Verdoliva. Raising the bar of ai-generated image detection with clip. InCVPRW, 2024

work page 2024

[62] [62]

Real-time deepfake detection in the real-world, 2024

Bar Cavia, Eliahu Horwitz, Tal Reiss, and Yedid Hoshen. Real-time deepfake detection in the real-world, 2024. arXiv:2406.09398

work page arXiv 2024

[63] [63]

Is artificial intelligence generated image detection a solved problem? InNeurIPS, 2025

Ziqiang Li, Jiazhen Yan, Ziwen He, Kai Zeng, Weiwei Jiang, Lizhi Xiong, and Zhangjie Fu. Is artificial intelligence generated image detection a solved problem? InNeurIPS, 2025. 13

work page 2025

[64] [64]

Mirror: Manifold ideal reference reconstructor for generalizable ai-generated image detection, 2026

Ruiqi Liu, Manni Cui, Ziheng Qin, Zhiyuan Yan, Ruoxin Chen, Yi Han, Zhiheng Li, Junkai Chen, ZhiJin Chen, Kaiqing Lin, Jialiang Shen, Lubin Weng, Jing Dong, Yan Wang, and Shu Wu. Mirror: Manifold ideal reference reconstructor for generalizable ai-generated image detection, 2026. arXiv:2602.02222

work page arXiv 2026

[65] [65]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. arXiv:2103.00020

work page internal anchor Pith review Pith/arXiv arXiv 2021

[66] [66]

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patrick L...

work page 2023

[67] [67]

Oriane Siméoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julie...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[68] [68]

This design prevents catastrophic forgetting of pre-trained semantic knowledge while efficiently learning forgery patterns

Effort[ 8] is a parameter-efficient method that leverages SVD to construct orthogonal subspaces. This design prevents catastrophic forgetting of pre-trained semantic knowledge while efficiently learning forgery patterns. Our implementation uses their released weights trained on SD1.4 with identical preprocessing

work page

[69] [69]

It implements this approach by training a ResNet-50 model for identification

NPR[ 4] distinguishes synthetic images by analyzing low-level neighboring pixel rela- tionships, which are characteristic of upsampling patterns in AI-generated content. It implements this approach by training a ResNet-50 model for identification. We use their official checkpoint for evaluation

work page

[70] [70]

RINE[ 38] improves AI-generated image detection by mapping intermediate CLIP features to a forgery-aware vector space. We utilize the author-released weights trained on the 16 Table 9:Overview of the evaluation benchmarks.Our evaluation covers 14 benchmarks (4 video benchmarks, 5 image benchmarks, and 5 in-the-wild benchmarks), spanning a diverse range of...

work page

[71] [71]

The method incorporates content augmentation via inpainting and fine-tunes a DINOv2+reg ViT using large crops to preserve forensic signals

B-Free[ 10] employs a self-conditioned diffusion reconstruction paradigm to enforce seman- tic alignment between real and synthetic images, thereby isolating differences to generation artifacts. The method incorporates content augmentation via inpainting and fine-tunes a DINOv2+reg ViT using large crops to preserve forensic signals. We utilize their provi...

work page

[72] [72]

Following the official code, we evaluate the two author-released implementations trained on SD1.4 and ProGAN

CO-SPY[ 6] enhances semantic and artifact features for robust synthetic image detection. Following the official code, we evaluate the two author-released implementations trained on SD1.4 and ProGAN

work page

[73] [73]

DDA[ 9] aligns synthetic and real images in both pixel and frequency domains to mitigate spurious correlations and improve detector generalization. For evaluation, we utilize the author-released weights and identical settings, with the sole exception that we forgo applying the same JPEG Q=96 compression on the test sets to ensure a fair comparison with ot...

work page arXiv