pith. sign in

arxiv: 2606.08634 · v1 · pith:HLQHOYV5new · submitted 2026-06-07 · 💻 cs.CV

SSAFE: Simple and Strong AI-Generated Image Detection via Frozen Vision Encoders

Pith reviewed 2026-06-27 18:44 UTC · model grok-4.3

classification 💻 cs.CV
keywords AI-generated image detectiondeepfake detectionfrozen encodersmultimodal representationslinear classifierdata curationrobustness to unseen generatorsRealWorldBench
0
0 comments X

The pith

Frozen multimodal encoders already separate real images from AI-generated ones in embedding space, so a linear classifier detects deepfakes without fine-tuning the encoder.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether modern multimodal vision representations already contain signals that distinguish real photographs from synthetic images. It reports that frozen encoders produce embeddings in which real and generated images form distinct clusters. A linear classifier trained on those embeddings reaches strong detection accuracy across benchmarks. The work further shows that a representation-aware curation step can reduce the required training set to 10K images drawn from a small number of representative generators. This compact set yields better robustness to unseen generators and distribution shifts than training sets that are orders of magnitude larger.

Core claim

Frozen multimodal encoders naturally separate real and synthetic images in their embedding space, enabling a simple linear classifier to achieve strong performance without task-specific fine-tuning. A representation-aware curation strategy selects a compact set of 10K images from representative generators; the resulting detector improves robustness to unseen generators and distribution shifts compared with models trained on 288K or 4M images.

What carries the argument

Embedding-space separation produced by frozen multimodal vision encoders, together with representation-aware data curation that selects generators for training.

If this is right

  • A linear probe on frozen multimodal features suffices for competitive detection accuracy.
  • Training data volume can be reduced from millions to 10K images while increasing robustness to new generators.
  • The same frozen encoder plus curated set performs well on a new benchmark of camera photos, stock images, and recent commercial outputs.
  • Distribution shifts between training and test generators become less harmful when the training generators are chosen by representation similarity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the separation holds for encoders trained on different multimodal objectives, authenticity cues may be an emergent property of large-scale image-text pretraining.
  • The curation method could be applied to other detection tasks such as video or audio generation if comparable frozen encoders exist.
  • Detection pipelines might shift from periodic full retraining to periodic re-curation of a small generator set.

Load-bearing premise

The observed separation in embedding space is driven by image authenticity rather than by correlated low-level statistics or generator-specific artifacts.

What would settle it

Measure accuracy of the linear classifier on images produced by a generator architecture absent from the curation set; a substantial drop below reported performance would falsify the claim of generalization.

Figures

Figures reproduced from arXiv: 2606.08634 by Byoungkwon Kim, Jaehyun Nam, Jinwoo Shin, Kyungmin Lee, Seunghyun Lee.

Figure 1
Figure 1. Figure 1: t-SNE visualization of embeddings from two vision [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Distributional differences across generators in Re￾alWorldBench. MMD distances computed in the PE-Core em￾bedding space reveal substantial variation among recent T2I gen￾erators, highlighting the need for generator-diverse curation and evaluation. 3.1. Problem Setup Our goal is to train a classifier that discriminates the real and fake images. Formally, let h : X → [0, 1] be a clas￾sifier that assigns each… view at source ↗
Figure 3
Figure 3. Figure 3: Examples from RealWorldBench. We leverage the T2I human-preference dataset to collect images generated by various commercial models. Additionally, we synthesize more samples using open-source generative models with the same captions as in the preference dataset to preserve the overall image distribution [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Generator-level similarity analysis using PE-Core embeddings. (a) MMD distance matrix between real and synthetic gen￾erators, (b) hierarchical clustering revealing hypercluster structure, and (c) 2D projection of generator centroids. The analysis shows that synthetic generators form three major clusters, from which we select four representative domains (highlighted in blue). For SD3.5, we additionally incl… view at source ↗
Figure 5
Figure 5. Figure 5: Supplemental real-image examples. Representative samples collected from Pexels, Pixabay, and Unsplash. These images provide high-resolution, high-quality object-centric, scene-centric, and portrait-centric real photographs that expand the real-image distri￾bution used during training [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: RealWorldBench real-image examples. Representative high-resolution real images from (a) modern smartphone cameras and (b) curated high-quality web sources. These images illustrate the distribution shift relative to commonly used datasets such as ImageNet or LAION. This balanced composition ensures that the curated training set captures (1) high intra-real diversity across real-world domains, (2) hyperclust… view at source ↗
Figure 7
Figure 7. Figure 7: MMD structures of multimodally pretrained encoders. PE-Core-G14-448 and SigLIP2-G16-384 exhibit clear separation between real and fake images, with fake samples forming distinct generator-specific clusters. The effect is most pronounced in PE-Core [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: MMD structures of self-supervised encoders. Unlike multimodally pretrained models, DINOv3-L/16 and DINOv2-Giant do not show meaningful separation between real and fake images, indicating that self-supervised representations do not inherently encode generative-model artifacts [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Example images from our newly constructed RealWorldBench. To minimize semantic variation across models, all synthetic images are generated using the same prompt (“A beige pastry sitting in a white bowl next to a spoon”). The figure includes real photographs and outputs from 28 commercial and open-source text-to-image models, illustrating the wide stylistic and textural differences that persist even under i… view at source ↗
Figure 10
Figure 10. Figure 10: Example images from our newly constructed RealWorldBench. To minimize semantic variation across models, all syn￾thetic images are generated using the same prompt (“hyperrealism, man and woman, together, made of clay.”). The figure includes real photographs and outputs from 28 commercial and open-source text-to-image models, illustrating the wide stylistic and textural differences that persist even under i… view at source ↗
read the original abstract

The rapid advancement of generative models has blurred the boundary between synthetic and real imagery, creating an urgent need for reliable deepfake detection. Yet most existing approaches rely on massive real--fake datasets, which are increasingly difficult to maintain as new generators continue to emerge. In this work, we investigate how much information about image authenticity is already encoded in modern multimodal vision representations. We find that frozen multimodal encoders naturally separate real and synthetic images in their embedding space, enabling a simple linear classifier to achieve strong performance without task-specific fine-tuning. Motivated by this observation, we develop a representation-aware data curation strategy that selects a compact set of representative generators for training. The resulting training set contains only 10K images, compared to 288K in AIGIBench and 4M in OpenFake, while improving robustness to unseen generators and distribution shifts. We additionally introduce RealWorldBench, a benchmark consisting of modern camera photographs, contemporary stock images, and outputs from recent commercial generators. Experiments across multiple benchmarks show that combining frozen multimodal representations with carefully curated training data provides a simple and effective approach to AI-generated image detection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that frozen multimodal vision encoders naturally separate real and synthetic images in embedding space, allowing a simple linear classifier (trained on a representation-aware curated set of only 10K images) to achieve strong AI-generated image detection performance and robustness to unseen generators/distribution shifts without task-specific fine-tuning. It introduces RealWorldBench (modern camera photos, stock images, recent commercial generators) and reports superior results versus methods trained on AIGIBench (288K) or OpenFake (4M).

Significance. If the central empirical claim holds and the separation is shown to reflect authenticity rather than generator artifacts, the result would be significant: it demonstrates that existing multimodal representations already encode useful authenticity signals, enabling a parameter-light, data-efficient detector that generalizes better than large-scale supervised approaches. The small curated training set and new benchmark are practical contributions.

major comments (3)
  1. [§4, §3.2] §4 (Experiments) and §3.2 (representation-aware curation): the central claim that embeddings separate images 'because of authenticity' rather than correlated low-level statistics (frequency spectra, compression, upsampling) is load-bearing but unsupported by any ablation that controls for these factors (e.g., no tests on real images with synthetic-like artifacts or vice versa). Without such controls, performance on RealWorldBench and unseen generators could be explained by generator-specific cues rather than the claimed natural separation.
  2. [Table 1, §4.3] Table 1 / §4.3 (comparison to AIGIBench/OpenFake): the reported robustness gains are presented without error bars, multiple random seeds, or statistical tests; given that the linear probe has free parameters, it is unclear whether the 10K-set advantage is reliable or could be matched by larger sets with proper regularization.
  3. [§3.3] §3.3 (generator selection): the curation procedure that yields the 10K set is described at a high level; the manuscript must specify the exact similarity metric, clustering method, and how 'representative' is quantified so that the generalization claim can be reproduced or falsified.
minor comments (2)
  1. [Abstract, §1] Abstract and §1: several quantitative claims ('strong performance', 'improving robustness') appear before any numbers or tables; move at least one key metric (e.g., AUC on RealWorldBench) into the abstract for immediate verifiability.
  2. Notation: the linear classifier is referred to interchangeably as 'probe' and 'classifier'; standardize terminology and explicitly state its input dimensionality and training objective (e.g., logistic regression or SVM).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point-by-point below, acknowledging where the manuscript can be strengthened through revisions while defending the core empirical claims on the basis of the presented evidence.

read point-by-point responses
  1. Referee: [§4, §3.2] the central claim that embeddings separate images 'because of authenticity' rather than correlated low-level statistics (frequency spectra, compression, upsampling) is load-bearing but unsupported by any ablation that controls for these factors. Without such controls, performance could be explained by generator-specific cues rather than natural separation.

    Authors: We agree that explicit controls isolating authenticity signals from low-level artifacts would strengthen the interpretation. The manuscript relies on the fact that the encoders are frozen multimodal models trained on broad real-world data, and the representation-aware curation operates in embedding space rather than pixel space. However, we acknowledge the absence of targeted ablations (e.g., artifact injection on real images). In revision we will add a controlled experiment applying frequency and compression perturbations to real images and measuring classifier degradation to test whether performance tracks authenticity or low-level cues. revision: yes

  2. Referee: [Table 1, §4.3] the reported robustness gains are presented without error bars, multiple random seeds, or statistical tests; it is unclear whether the 10K-set advantage is reliable or could be matched by larger sets with proper regularization.

    Authors: We accept that variance estimates are needed for rigorous comparison. The linear probe training protocol was deterministic in the reported runs, but we will re-execute all Table 1 experiments across five random seeds, report mean and standard deviation, and add a brief statistical comparison (paired t-test) between the 10K curated set and the larger baselines. This will clarify whether the observed gains are stable. revision: yes

  3. Referee: [§3.3] the curation procedure that yields the 10K set is described at a high level; the manuscript must specify the exact similarity metric, clustering method, and how 'representative' is quantified so that the generalization claim can be reproduced or falsified.

    Authors: We agree the description is insufficiently precise. In the revised §3.3 we will state: cosine similarity on the frozen encoder embeddings, k-means clustering with k equal to the number of generator families, and selection of the sample nearest each cluster centroid as the representative. These details will be accompanied by pseudocode to enable exact reproduction. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observation of embedding separation followed by linear probe training

full rationale

The paper presents an empirical finding that frozen multimodal encoders separate real and synthetic images, followed by training a linear classifier on a curated 10K-image set. No equations, derivations, or first-principles claims are present in the abstract or described approach. The central claim rests on observed performance across benchmarks rather than any self-definitional mapping, fitted parameter renamed as prediction, or load-bearing self-citation chain. The representation-aware curation and RealWorldBench are presented as practical choices, not reductions to prior fitted results. This is a standard empirical pipeline with no load-bearing step that reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that multimodal embeddings encode authenticity signals independently of training data specifics; no free parameters or invented entities are named in the abstract.

free parameters (1)
  • linear classifier parameters
    Weights of the linear classifier are fitted to the curated 10K images.
axioms (1)
  • domain assumption Frozen multimodal encoders encode information about image authenticity in their embedding space
    Stated as the key investigation finding that motivates the linear classifier approach.

pith-pipeline@v0.9.1-grok · 5740 in / 1174 out tokens · 28852 ms · 2026-06-27T18:44:24.394088+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 11 linked inside Pith

  1. [1]

    Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond, 2023

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond, 2023. 2

  2. [2]

    FLUX.1-dev.https : / / huggingface

    Black Forest Labs. FLUX.1-dev.https : / / huggingface . co / black - forest - labs / FLUX . 1-dev, 2024. Accessed: 2024-12-01. 1, 2

  3. [3]

    Perception encoder: The best visual embeddings are not at the output of the net- work.arXiv preprint arXiv:2504.13181, 2025

    Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, et al. Perception encoder: The best visual embeddings are not at the output of the net- work.arXiv preprint arXiv:2504.13181, 2025. 2, 1

  4. [4]

    Large scale gan training for high fidelity natural image synthesis

    Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018. 1, 2

  5. [5]

    Emerg- ing properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 2

  6. [6]

    Real-time deepfake detection in the real-world.arXiv preprint arXiv:2406.09398, 2024

    Bar Cavia, Eliahu Horwitz, Tal Reiss, and Yedid Hoshen. Real-time deepfake detection in the real-world.arXiv preprint arXiv:2406.09398, 2024. 5

  7. [7]

    Drct: Diffusion reconstruction contrastive training towards universal detection of diffusion generated images

    Baoying Chen, Jishen Zeng, Jianquan Yang, and Rui Yang. Drct: Diffusion reconstruction contrastive training towards universal detection of diffusion generated images. InForty- first International Conference on Machine Learning, 2024. 1

  8. [8]

    Gemini 2.5 flash-image.https:// aistudio.google.com/models/gemini- 2- 5- flash-image, 2025

    Google DeepMind. Gemini 2.5 flash-image.https:// aistudio.google.com/models/gemini- 2- 5- flash-image, 2025. Accessed: 2025-11-14. 1

  9. [9]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. 1

  10. [10]

    Diffusion models beat gans on image synthesis.Advances in neural informa- tion processing systems, 34:8780–8794, 2021

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.Advances in neural informa- tion processing systems, 34:8780–8794, 2021. 2

  11. [11]

    Scaling recti- fied flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning,

  12. [12]

    Humanrefiner: Benchmarking abnormal human generation and refining with coarse-to-fine pose-reversible guidance

    Guian Fang, Wenbiao Yan, Yuanfan Guo, Jianhua Han, Zu- tao Jiang, Hang Xu, Shengcai Liao, and Xiaodan Liang. Humanrefiner: Benchmarking abnormal human generation and refining with coarse-to-fine pose-reversible guidance. In European Conference on Computer Vision, pages 201–217. Springer, 2024. 2

  13. [13]

    Imagen 3.https://deepmind

    Google DeepMind. Imagen 3.https://deepmind. google/technologies/imagen-3, 2024. Accessed: 2024-12-01. 1

  14. [14]

    Vec- tor quantized diffusion model for text-to-image synthesis

    Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vec- tor quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 10696–10706, 2022. 2

  15. [15]

    Progressive growing of GANs for improved quality, stabil- ity, and variation

    Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of GANs for improved quality, stabil- ity, and variation. InInternational Conference on Learning Representations, 2018. 1, 2

  16. [16]

    Improving synthetic image detection to- wards generalization: An image transformation perspective

    Ouxiang Li, Jiayin Cai, Yanbin Hao, Xiaolong Jiang, Yao Hu, and Fuli Feng. Improving synthetic image detection to- wards generalization: An image transformation perspective. arXiv preprint arXiv:2408.06741, 2024. 1, 5

  17. [17]

    Photomaker: Customizing re- alistic human photos via stacked id embedding

    Zhen Li, Mingdeng Cao, Xintao Wang, Zhongang Qi, Ming- Ming Cheng, and Ying Shan. Photomaker: Customizing re- alistic human photos via stacked id embedding. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8640–8650, 2024. 1

  18. [18]

    Is artificial intelligence gen- erated image detection a solved problem?arXiv preprint arXiv:2505.12335, 2025

    Ziqiang Li, Jiazhen Yan, Ziwen He, Kai Zeng, Weiwei Jiang, Lizhi Xiong, and Zhangjie Fu. Is artificial intelligence gen- erated image detection a solved problem?arXiv preprint arXiv:2505.12335, 2025. 1, 2, 4

  19. [19]

    Ferretnet: Efficient synthetic image detection via local pixel dependencies.arXiv preprint arXiv:2509.20890, 2025

    Shuqiao Liang, Jian Liu, Renzhang Chen, and Quanlong Guan. Ferretnet: Efficient synthetic image detection via local pixel dependencies.arXiv preprint arXiv:2509.20890, 2025. 1, 5

  20. [20]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 1

  21. [21]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 2

  22. [22]

    Global tex- ture enhancement for fake face detection in the wild

    Zhengzhe Liu, Xiaojuan Qi, and Philip HS Torr. Global tex- ture enhancement for fake face detection in the wild. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8060–8069, 2020. 5

  23. [23]

    Openfake: An open dataset and platform toward real-world deepfake detec- tion.arXiv preprint arXiv:2509.09495, 2025

    Victor Livernoche, Akshatha Arodi, Andreea Musulan, Zachary Yang, Adam Salvail, Ga´etan Marceau Caron, Jean- Franc ¸ois Godbout, and Reihaneh Rabbany. Openfake: An open dataset and platform toward real-world deepfake detec- tion.arXiv preprint arXiv:2509.09495, 2025. 1, 2, 4, 6

  24. [24]

    Seeing is not always believing: Benchmarking human and model perception of ai-generated images.Advances in neural information processing systems, 36:25435–25447, 2023

    Zeyu Lu, Di Huang, Lei Bai, Jingjing Qu, Chengyue Wu, Xi- hui Liu, and Wanli Ouyang. Seeing is not always believing: Benchmarking human and model perception of ai-generated images.Advances in neural information processing systems, 36:25435–25447, 2023. 1

  25. [25]

    Lcm-lora: A universal stable-diffusion ac- celeration module.arXiv preprint arXiv:2311.05556, 2023

    Simian Luo, Yiqin Tan, Suraj Patil, Daniel Gu, Patrick V on Platen, Apolin˜AArio Passos, Longbo Huang, Jian Li, and Hang Zhao. Lcm-lora: A universal stable-diffusion ac- celeration module.arXiv preprint arXiv:2311.05556, 2023. 2

  26. [26]

    Midjourney V6.1.https://www

    Midjourney Team. Midjourney V6.1.https://www. midjourney.com/, 2024. Accessed: 2024-12-01. 1

  27. [27]

    Glide: Towards photorealistic image generation and editing with text-guided diffusion models.arXiv preprint arXiv:2112.10741, 2021

    Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models.arXiv preprint arXiv:2112.10741, 2021. 1, 2

  28. [28]

    Towards uni- versal fake image detectors that generalize across genera- tive models

    Utkarsh Ojha, Yuheng Li, and Yong Jae Lee. Towards uni- versal fake image detectors that generalize across genera- tive models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24480– 24489, 2023. 1, 2, 5

  29. [29]

    Dall–e 3.https://dalle3.ai/, 2024

    OpenAI Team. Dall–e 3.https://dalle3.ai/, 2024. Accessed: 2024-12-01. 1

  30. [30]

    Dinov2: Learning robust visual features without supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 2

  31. [31]

    Sdxl: Improving latent diffusion mod- els for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion mod- els for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023. 2

  32. [32]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2

  33. [33]

    Recraft-v3-24-7-25 text-to-image human pref- erence dataset.https : / / huggingface

    Rapidata. Recraft-v3-24-7-25 text-to-image human pref- erence dataset.https : / / huggingface . co / datasets / Rapidata / Recraft - v3 - 24 - 7 - 25 _ t2i_human_preference, 2025. 3, 4

  34. [34]

    Kandinsky: an improved text-to-image syn- thesis with image prior and latent diffusion.arXiv preprint arXiv:2310.03502, 2023

    Anton Razzhigaev, Arseniy Shakhmatov, Anastasia Malt- seva, Vladimir Arkhipkin, Igor Pavlov, Ilya Ryabov, An- gelina Kuts, Alexander Panchenko, Andrey Kuznetsov, and Denis Dimitrov. Kandinsky: an improved text-to-image syn- thesis with image prior and latent diffusion.arXiv preprint arXiv:2310.03502, 2023. 2

  35. [35]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 1, 2

  36. [36]

    Faceforen- sics++: Learning to detect manipulated facial images

    Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Chris- tian Riess, Justus Thies, and Matthias Nießner. Faceforen- sics++: Learning to detect manipulated facial images. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1–11, 2019. 1

  37. [37]

    Dinov3.arXiv preprint arXiv:2508.10104, 2025

    Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 2, 1

  38. [38]

    Learning on gradients: Generalized arti- facts representation for gan-generated images detection

    Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, and Yunchao Wei. Learning on gradients: Generalized arti- facts representation for gan-generated images detection. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 12105–12114, 2023. 2, 5

  39. [39]

    Frequency-aware deepfake de- tection: Improving generalizability through frequency space domain learning

    Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. Frequency-aware deepfake de- tection: Improving generalizability through frequency space domain learning. InProceedings of the AAAI Conference on Artificial Intelligence, pages 5052–5060, 2024. 2, 5

  40. [40]

    Rethinking the up-sampling op- erations in cnn-based generative network for generalizable deepfake detection

    Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. Rethinking the up-sampling op- erations in cnn-based generative network for generalizable deepfake detection. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 28130–28139, 2024. 5

  41. [41]

    C2p-clip: Inject- ing category common prompt in clip to enhance generaliza- tion in deepfake detection

    Chuangchuang Tan, Renshuai Tao, Huan Liu, Guanghua Gu, Baoyuan Wu, Yao Zhao, and Yunchao Wei. C2p-clip: Inject- ing category common prompt in clip to enhance generaliza- tion in deepfake detection. InProceedings of the AAAI Con- ference on Artificial Intelligence, pages 7184–7192, 2025. 1, 2, 5

  42. [42]

    Renshuai Tao, Chuangchuang Tan, Huan Liu, Jiakai Wang, Haotong Qin, Yakun Chang, Wei Wang, Rongrong Ni, and Yao Zhao. Sagnet: Decoupling semantic-agnostic artifacts from limited training data for robust generalization in deep- fake detection.IEEE Transactions on Information Forensics and Security, 2025. 2

  43. [43]

    Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 2, 1

  44. [44]

    Lota: Bit-planes guided ai-generated image de- tection

    Hongsong Wang, Renxi Cheng, Yang Zhang, Chaolei Han, and Jie Gui. Lota: Bit-planes guided ai-generated image de- tection. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 17246–17255, 2025. 1, 5

  45. [45]

    Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 2

  46. [46]

    Instantid: Zero-shot identity-preserving generation in seconds.arXiv preprint arXiv:2401.07519, 2024

    Qixun Wang, Xu Bai, Haofan Wang, Zekui Qin, Anthony Chen, Huaxia Li, Xu Tang, and Yao Hu. Instantid: Zero-shot identity-preserving generation in seconds.arXiv preprint arXiv:2401.07519, 2024. 1

  47. [47]

    Cnn-generated images are surprisingly easy to spot

    Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A Efros. Cnn-generated images are surprisingly easy to spot... for now. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8695–8704, 2020. 1, 2, 5

  48. [48]

    Wukong — model zoo, mindspore xihe plat- form.https://xihe.mindspore.cn/modelzoo/ wukong, 2025

    Wukong. Wukong — model zoo, mindspore xihe plat- form.https://xihe.mindspore.cn/modelzoo/ wukong, 2025. Accessed: 2025-11-10. 2

  49. [49]

    Gener- alizable deepfake detection via effective local-global feature extraction.arXiv preprint arXiv:2501.15253, 2025

    Jiazhen Yan, Ziqiang Li, Ziwen He, and Zhangjie Fu. Gener- alizable deepfake detection via effective local-global feature extraction.arXiv preprint arXiv:2501.15253, 2025. 5

  50. [50]

    A sanity check for ai-generated image detection.arXiv preprint arXiv:2406.19435, 2024

    Shilin Yan, Ouxiang Li, Jiayin Cai, Yanbin Hao, Xi- aolong Jiang, Yao Hu, and Weidi Xie. A sanity check for ai-generated image detection.arXiv preprint arXiv:2406.19435, 2024. 1, 2, 5, 6

  51. [51]

    Orthogonal subspace decompo- sition for generalizable ai-generated image detection.arXiv preprint arXiv:2411.15633, 2024

    Zhiyuan Yan, Jiangming Wang, Peng Jin, Ke-Yue Zhang, Chengchun Liu, Shen Chen, Taiping Yao, Shouhong Ding, Baoyuan Wu, and Li Yuan. Orthogonal subspace decompo- sition for generalizable ai-generated image detection.arXiv preprint arXiv:2411.15633, 2024. 1, 5

  52. [52]

    Dˆ 3: Scaling up deepfake detection by learning from discrepancy

    Yongqi Yang, Zhihao Qian, Ye Zhu, Olga Russakovsky, and Yu Wu. Dˆ 3: Scaling up deepfake detection by learning from discrepancy. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23850–23859,

  53. [53]

    Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop.arXiv preprint arXiv:1506.03365, 2015

    Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop.arXiv preprint arXiv:1506.03365, 2015. 1

  54. [54]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 2

  55. [55]

    Towards universal ai-generated image detec- tion by variational information bottleneck network

    Haifeng Zhang, Qinghui He, Xiuli Bi, Weisheng Li, Bo Liu, and Bin Xiao. Towards universal ai-generated image detec- tion by variational information bottleneck network. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 23828–23837, 2025. 5

  56. [56]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023. 2

  57. [57]

    Aigi-holmes: Towards explainable and gener- alizable ai-generated image detection via multimodal large language models.arXiv preprint arXiv:2507.02664, 2025

    Ziyin Zhou, Yunpeng Luo, Yuanchen Wu, Ke Sun, Jiayi Ji, Ke Yan, Shouhong Ding, Xiaoshuai Sun, Yunsheng Wu, and Rongrong Ji. Aigi-holmes: Towards explainable and gener- alizable ai-generated image detection via multimodal large language models.arXiv preprint arXiv:2507.02664, 2025. 1, 2, 4, 5

  58. [58]

    A beige pastry sitting in a white bowl next to a spoon

    Mingjian Zhu, Hanting Chen, Qiangyu Yan, Xudong Huang, Guanyu Lin, Wei Li, Zhijun Tu, Hailin Hu, Jie Hu, and Yunhe Wang. Genimage: A million-scale benchmark for de- tecting ai-generated image.Advances in Neural Information Processing Systems, 36:77771–77782, 2023. 1 SSAFE: Simple and Strong AI-Generated Image Detection via Frozen Vision Encoders Supplemen...