pith. sign in

arxiv: 2605.26421 · v1 · pith:GHLSUVJ5new · submitted 2026-05-26 · 💻 cs.CV

HydraPrompt: An Adaptive and Asymmetric Framework of Vision-Language Models for Synthetic Image Detection

Pith reviewed 2026-06-29 18:49 UTC · model grok-4.3

classification 💻 cs.CV
keywords synthetic image detectionvision-language modelsadaptive promptingasymmetric frameworkcontrastive learningforgery detectionprompt adaptation
0
0 comments X

The pith

HydraPrompt detects synthetic images by anchoring real content with fixed prompts while using sample-specific adaptive prompts for fakes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes HydraPrompt to overcome the limits of static prompts in vision-language models for synthetic image detection. Static boundaries fail when forgery types vary at inference time, so the method creates an asymmetric setup: one shared prompt set anchors all authentic images as a consistent reference, while each fake image gets its own tailored prompts drawn from its visual details. A conditional supervised contrastive objective then pulls real representations together and spreads out the fine-grained forgery signals. The result is a framework that adjusts category centers dynamically without needing advance knowledge of the forgery method. If the approach holds, detectors could maintain accuracy as new generators appear.

Core claim

HydraPrompt is an asymmetric prompting framework that dynamically adjusts category centers by aligning with fine-grained image cues. An Asymmetric Prompt Adapter fixes one set of prompts for the authentic category as a unified anchor and builds sample-adaptive prompts for the fake category to capture diverse forgery variations. A Conditional Supervised Contrastive objective compacts authentic representations while preserving fine-grained forgery clues, producing state-of-the-art results on standard SID benchmarks.

What carries the argument

Asymmetric Prompt Adapter (APA) that applies one fixed prompt set to real images and per-sample adaptive prompts to fake images, together with the Conditional Supervised Contrastive (CSC) objective.

If this is right

  • Detection remains effective across changing forgery methods because prompts adjust to each sample's cues rather than relying on a single fixed boundary.
  • Real-image representations stay compact while forgery variations receive explicit modeling, increasing separation within the fake class.
  • No prior forgery-type labels are required at test time, allowing deployment on unseen generators.
  • State-of-the-art accuracy is reported on multiple popular synthetic image detection benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same fixed-versus-adaptive split could be tested on other vision-language classification tasks where one class is stable and the other contains high internal diversity.
  • If the adaptive prompts prove robust, training pipelines might shift toward lighter supervision focused only on the real anchor rather than exhaustive fake-type labels.
  • The approach suggests examining whether similar asymmetry helps in related domains such as deepfake video detection or adversarial example identification.

Load-bearing premise

Sample-adaptive prompts for the fake category can be built and aligned to image cues at inference time without knowing the forgery type or using extra supervision.

What would settle it

Performance drop on a benchmark containing forgery types absent from training where the adaptive prompts no longer separate real from fake better than a static-prompt baseline.

Figures

Figures reproduced from arXiv: 2605.26421 by Ajian Liu, Hao Tan, Jun Wan, Senyuan Shi, Sergio Escalera, Shuhan Feng, Zichang Tan.

Figure 1
Figure 1. Figure 1: (a) TSNE [45] visualizations of real and fake images on UniversalFakeDetect [49] benchmark. We selected seven subsets from the benchmark. The features are extracted from shallow layers of CLIP ViT-L/14. Real images share a unified distribution while fake images exhibit scattered patterns. (b) Comparison of previous VLM-based strategy and our HydraPrompt. Existing approaches adopt symmetric prompts that pro… view at source ↗
Figure 2
Figure 2. Figure 2: Analyses on the proposed APA and CSC. (a) Effectiveness of APA: [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the proposed HydraPrompt framework. During training phase, we design sample-adaptive prompts for [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: TSNE of Hydraprompt on unseen subsets from Uni [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: (a) Ablation (%) on training loss design, including Lcls, Lsc, Lalign. (b) Analysis of hyperparameter. Ablation on training loss. We compare different combinations of training loss in [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
read the original abstract

The rapid evolution of generative models has precipitated a proliferation of fabricated content, posing significant challenges to existing Synthetic Image Detection (SID) methods. Capitalizing on advancements in vision-language models (e.g., CLIP), recent attempts have leveraged learnable textual prompts to identify synthetic images. However, they still leverage static prompt as a fixed boundary for real and fake images, failing to adapt to the varying types of forgery that emerge during inference. To overcome this issue, we propose **HydraPrompt**, an asymmetric prompting framework that dynamically adjusts the category centers by aligning with fine-grained image cues. Specifically, we propose an Asymmetric Prompt Adapter (**APA**): (1) for authentic category, we introduce a single set of prompts to capture the consistent representative patterns, which serves as a unified anchor for real content. While (2) for fake category, we construct sample-adaptive prompts that specialize in capturing diverse cues from different samples, enabling adaptive modeling of forgery image variations. To increase pronounced discriminability within different synthetic images, we further introduce a Conditional Supervised Contrastive (**CSC**) objective, which compacts the authentic representations while capturing fine-grained forgery clues. Extensive experiments on popular SID benchmarks demonstrate the state-of-the-art performance of our framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes HydraPrompt, an asymmetric prompting framework for synthetic image detection (SID) with vision-language models. It introduces an Asymmetric Prompt Adapter (APA) that maintains a single fixed set of prompts for the authentic category as a unified anchor while constructing sample-adaptive prompts for the fake category to capture diverse forgery cues. A Conditional Supervised Contrastive (CSC) objective is added to compact authentic representations and enhance discriminability among synthetic images. The framework claims state-of-the-art performance on popular SID benchmarks through dynamic adjustment of category centers aligned with fine-grained image cues.

Significance. If the adaptive mechanism for fake prompts operates at inference without forgery-type labels or extra supervision, the asymmetric design could meaningfully extend static-prompt VLM approaches by handling forgery variation, representing a targeted advance in SID. The explicit separation of fixed real anchors from per-sample fake adaptation, combined with the CSC loss, offers a clear architectural hypothesis worth testing if supported by reproducible results.

major comments (2)
  1. [Abstract] Abstract: The central claim that APA constructs sample-adaptive prompts for the fake category at inference time to align with fine-grained cues without forgery-type knowledge or additional supervision is load-bearing for the asymmetry and SOTA assertion, yet the abstract supplies no equations, architecture diagram, algorithm, or conditioning mechanism showing how the adapter produces these per-sample prompts from the input image and VLM alone.
  2. [Abstract] Abstract: The assertion of state-of-the-art performance on popular SID benchmarks is presented without any quantitative results, ablation studies, error analysis, or baseline comparisons, rendering the empirical contribution unverifiable from the provided text and undermining assessment of whether the APA+CSC design delivers the claimed gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address each major comment below and will revise the abstract to better support the central claims while maintaining its concise nature.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that APA constructs sample-adaptive prompts for the fake category at inference time to align with fine-grained cues without forgery-type knowledge or additional supervision is load-bearing for the asymmetry and SOTA assertion, yet the abstract supplies no equations, architecture diagram, algorithm, or conditioning mechanism showing how the adapter produces these per-sample prompts from the input image and VLM alone.

    Authors: We agree that the abstract would benefit from a clearer high-level indication of the APA's conditioning mechanism. The full details, including how the adapter derives per-sample fake prompts from image features extracted by the VLM without forgery labels, are provided in Section 3 with accompanying equations and Figure 2. To address the concern directly in the abstract, we will revise it to concisely describe the asymmetric adaptation process and its inference-time operation based solely on the input image and VLM. revision: yes

  2. Referee: [Abstract] Abstract: The assertion of state-of-the-art performance on popular SID benchmarks is presented without any quantitative results, ablation studies, error analysis, or baseline comparisons, rendering the empirical contribution unverifiable from the provided text and undermining assessment of whether the APA+CSC design delivers the claimed gains.

    Authors: We acknowledge that the current abstract does not include numerical results. While space constraints limit full ablations or error analysis, we will revise the abstract to include key quantitative highlights (e.g., performance margins over strong baselines on primary benchmarks) to better substantiate the SOTA claim. Complete experimental results, ablations, and analyses remain in Sections 4 and 5. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with no derivations or self-referential reductions

full rationale

The paper presents HydraPrompt as an empirical architecture (APA for asymmetric prompts and CSC objective) whose performance is asserted via benchmark experiments. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central claims rest on the design of sample-adaptive prompts and contrastive loss rather than any reduction to inputs by construction, satisfying the default expectation of non-circularity for an applied method paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are detailed beyond standard VLM assumptions.

pith-pipeline@v0.9.1-grok · 5767 in / 1018 out tokens · 26631 ms · 2026-06-29T18:49:28.860981+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

95 extracted references · 21 canonical work pages · 5 internal anchors

  1. [1]

    Andrew Brock, Jeff Donahue, and Karen Simonyan. 2018. Large scale GAN training for high fidelity natural image synthesis.arXiv preprint arXiv:1809.11096 (2018)

  2. [2]

    Bar Cavia, Eliahu Horwitz, Tal Reiss, and Yedid Hoshen. 2024. Real-time deepfake detection in the real-world.arXiv preprint arXiv:2406.09398(2024)

  3. [3]

    George Cazenavette, Avneesh Sud, Thomas Leung, and Ben Usman. 2024. Fakein- version: Learning to detect images from unseen text-to-image models by in- verting stable diffusion. InProceedings of the IEEE/CVF Conference on Computer HydraPrompt: An Adaptive and Asymmetric Framework of Vision-Language Models for Synthetic Image Detection Vision and Pattern Re...

  4. [4]

    Lucy Chai, David Bau, Ser-Nam Lim, and Phillip Isola. 2020. What makes fake images detectable? understanding properties that generalize. InEuropean conference on computer vision. Springer, 103–120

  5. [5]

    Baoying Chen, Jishen Zeng, Jianquan Yang, and Rui Yang. 2024. Drct: Diffusion reconstruction contrastive training towards universal detection of diffusion generated images. InForty-first International Conference on Machine Learning

  6. [6]

    JiaXin Chen, Miao Hu, DengYong Zhang, Yun Song, and Xin Liao. 2025. LDR-Net: A Novel Framework for AI-generated Image Detection via Localized Discrepancy Representation.arXiv preprint arXiv:2501.13475(2025)

  7. [7]

    Ruoxin Chen, Junwei Xi, Zhiyuan Yan, Ke-Yue Zhang, Shuang Wu, Jingyi Xie, Xu Chen, Lei Xu, Isabel Guan, Taiping Yao, et al . 2025. Dual Data Align- ment Makes AI-Generated Image Detector Easier Generalizable.arXiv preprint arXiv:2505.14359(2025)

  8. [8]

    Yize Chen, Zhiyuan Yan, Guangliang Cheng, Kangran Zhao, Siwei Lyu, and Baoyuan Wu. 2024. X2-dfd: A framework for explainable and extendable deepfake detection.arXiv preprint arXiv:2410.06126(2024)

  9. [9]

    Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. 2018. Stargan: Unified generative adversarial networks for multi- domain image-to-image translation. InProceedings of the IEEE conference on computer vision and pattern recognition. 8789–8797

  10. [10]

    Beilin Chu, Xuan Xu, Xin Wang, Yufei Zhang, Weike You, and Linna Zhou

  11. [11]

    InProceedings of the Computer Vision and Pattern Recognition Conference

    Fire: Robust detection of diffusion-generated images via frequency-guided reconstruction error. InProceedings of the Computer Vision and Pattern Recognition Conference. 12830–12839

  12. [12]

    Casey Chu, Andrey Zhmoginov, and Mark Sandler. 2017. Cyclegan, a master of steganography.arXiv preprint arXiv:1712.02950(2017)

  13. [13]

    Riccardo Corvi, Davide Cozzolino, Ekta Prashnani, Shalini De Mello, Koki Nagano, and Luisa Verdoliva. 2025. Seeing What Matters: Generalizable AI- generated Video Detection with Forensic-Oriented Augmentation.arXiv preprint arXiv:2506.16802(2025)

  14. [14]

    Xinjie Cui, Yuezun Li, Ao Luo, Jiaran Zhou, and Junyu Dong. 2025. Forensics adapter: Adapting clip for generalizable face forgery detection. InProceedings of the Computer Vision and Pattern Recognition Conference. 19207–19217

  15. [15]

    Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion models beat gans on image synthesis.Advances in neural information processing systems34 (2021), 8780–8794

  16. [16]

    Ricard Durall, Margret Keuper, and Janis Keuper. 2020. Watch your up- convolution: Cnn based generative deep neural networks are failing to reproduce spectral distributions. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition. 7890–7899

  17. [17]

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. 2024. Scaling rectified flow transformers for high-resolution image synthesis. InForty- first international conference on machine learning

  18. [18]

    Joel Frank, Thorsten Eisenhofer, Lea Schönherr, Asja Fischer, Dorothea Kolossa, and Thorsten Holz. 2020. Leveraging frequency analysis for deep fake image recognition. InInternational conference on machine learning. PMLR, 3247–3258

  19. [19]

    Xinghe Fu, Zhiyuan Yan, Taiping Yao, Shen Chen, and Xi Li. 2025. Exploring unbiased deepfake detection via token-level shuffling and mixing. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 3040–3048

  20. [20]

    Hongchang Gao, Jian Pei, and Heng Huang. 2019. Progan: Network embedding via proximity generative adversarial network. InProceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. 1308– 1316

  21. [21]

    Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. 2024. Clip-adapter: Better vision-language models with feature adapters.International Journal of Computer Vision132, 2 (2024), 581–595

  22. [22]

    Harshayu Girase, Haiming Gang, Srikanth Malla, Jiachen Li, Akira Kanehara, Karttikeya Mangalam, and Chiho Choi. 2021. Loki: Long term and key intentions for trajectory prediction. InProceedings of the IEEE/CVF International Conference on Computer Vision. 9803–9812

  23. [23]

    Zhihao Gu, Yang Chen, Taiping Yao, Shouhong Ding, Jilin Li, Feiyue Huang, and Lizhuang Ma. 2021. Spatiotemporal inconsistency learning for deepfake video detection. InProceedings of the 29th ACM international conference on multimedia. 3473–3481

  24. [24]

    Zhihao Gu, Yang Chen, Taiping Yao, Shouhong Ding, Jilin Li, and Lizhuang Ma

  25. [25]

    InProceedings of the AAAI conference on artificial intelligence, Vol

    Delving into the local: Dynamic inconsistency learning for deepfake video detection. InProceedings of the AAAI conference on artificial intelligence, Vol. 36. 744–752

  26. [26]

    Zhihao Gu, Taiping Yao, Yang Chen, Shouhong Ding, and Lizhuang Ma. 2022. Hierarchical contrastive inconsistency learning for deepfake video detection. In European conference on computer vision. Springer, 596–613

  27. [27]

    Fabrizio Guillaro, Giada Zingarini, Ben Usman, Avneesh Sud, Davide Cozzolino, and Luisa Verdoliva. 2025. A bias-free training paradigm for more general ai- generated image detection. InProceedings of the Computer Vision and Pattern Recognition Conference. 18685–18694

  28. [28]

    Xiao Guo, Xiufeng Song, Yue Zhang, Xiaohong Liu, and Xiaoming Liu. 2025. Rethinking Vision-Language Model in Face Forensics: Multi-Modal Interpretable Forged Face Detector. InProceedings of the Computer Vision and Pattern Recogni- tion Conference. 105–116

  29. [29]

    Zhiyuan He, Pin-Yu Chen, and Tsung-Yi Ho. 2024. Rigid: A training-free and model-agnostic framework for robust ai-generated image detection.arXiv preprint arXiv:2405.20112(2024)

  30. [30]

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models.Advances in neural information processing systems33 (2020), 6840–6851

  31. [31]

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. 2022. Lora: Low-rank adaptation of large language models.ICLR1, 2 (2022), 3

  32. [32]

    Zhenglin Huang, Jinwei Hu, Xiangtai Li, Yiwei He, Xingyu Zhao, Bei Peng, Baoyuan Wu, Xiaowei Huang, and Guangliang Cheng. 2025. Sida: Social media image deepfake detection, localization and explanation with large multimodal model. InProceedings of the Computer Vision and Pattern Recognition Conference. 28831–28841

  33. [33]

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling up visual and vision- language representation learning with noisy text supervision. InInternational conference on machine learning. PMLR, 4904–4916

  34. [34]

    Yan Ju, Shan Jia, Lipeng Ke, Hongfei Xue, Koki Nagano, and Siwei Lyu. 2022. Fusing global and local features for generalized ai-synthesized image detection. In2022 IEEE International Conference on Image Processing (ICIP). IEEE, 3465–3469

  35. [35]

    Tero Karras, Samuli Laine, and Timo Aila. 2019. A style-based generator ar- chitecture for generative adversarial networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4401–4410

  36. [36]

    Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2020. Analyzing and Improving the Image Quality of StyleGAN. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

  37. [37]

    Hossein Kashiani, Niloufar Alipour Talemi, and Fatemeh Afghah. 2025. Fre- qDebias: Towards Generalizable Deepfake Detection via Consistency-Driven Frequency Debiasing. InProceedings of the Computer Vision and Pattern Recogni- tion Conference. 8775–8785

  38. [38]

    Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. 2023. Maple: Multi-modal prompt learning. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19113–19122

  39. [39]

    Christos Koutlis and Symeon Papadopoulos. 2024. Leveraging representations from intermediate encoder-blocks for synthetic image detection. InEuropean Conference on Computer Vision. Springer, 394–411

  40. [40]

    Kaiqing Lin, Yuzhen Lin, Weixiang Li, Taiping Yao, and Bin Li. 2025. Standing on the shoulders of giants: Reprogramming visual-language model for general deepfake detection. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 5262–5270

  41. [41]

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. InEuropean conference on computer vision. Springer, 740–755

  42. [42]

    Bo Liu, Fan Yang, Xiuli Bi, Bin Xiao, Weisheng Li, and Xinbo Gao. 2022. Detecting generated images by real images. InEuropean Conference on Computer Vision. Springer, 95–110

  43. [43]

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual in- struction tuning.Advances in neural information processing systems36 (2023), 34892–34916

  44. [44]

    Huan Liu, Zichang Tan, Chuangchuang Tan, Yunchao Wei, Jingdong Wang, and Yao Zhao. 2024. Forgery-aware adaptive transformer for generalizable synthetic image detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10770–10780

  45. [45]

    Zhengzhe Liu, Xiaojuan Qi, and Philip HS Torr. 2020. Global texture enhancement for fake face detection in the wild. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8060–8069

  46. [46]

    Zhihe Lu, Jiawang Bai, Xin Li, Zeyu Xiao, and Xinchao Wang. 2023. Beyond sole strength: Customized ensembles for generalized vision-language models.arXiv preprint arXiv:2311.17091(2023)

  47. [47]

    Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research9, Nov (2008), 2579–2605

  48. [48]

    Anant Mehta, Bryant McArthur, Nagarjuna Kolloju, and Zhengzhong Tu. 2025. HFMF: Hierarchical Fusion Meets Multi-Stream Models for Deepfake Detection. InProceedings of the Winter Conference on Applications of Computer Vision. 724– 733

  49. [49]

    Dat Nguyen, Nesryne Mejri, Inder Pal Singh, Polina Kuleshova, Marcella Astrid, Anis Kacem, Enjie Ghorbel, and Djamila Aouada. 2024. Laa-net: Localized artifact attention network for quality-agnostic and generalizable deepfake detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion. 17395–17405

  50. [50]

    Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. 2021. Glide: Towards photorealistic Shi et al. image generation and editing with text-guided diffusion models.arXiv preprint arXiv:2112.10741(2021)

  51. [51]

    Utkarsh Ojha, Yuheng Li, and Yong Jae Lee. 2023. Towards universal fake image detectors that generalize across generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 24480–24489

  52. [52]

    Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. 2019. Gau- gan: semantic image synthesis with spatially adaptive normalization. InACM SIGGRAPH 2019 Real-Time Live!1–1

  53. [53]

    Yuyang Qian, Guojun Yin, Lu Sheng, Zixuan Chen, and Jing Shao. 2020. Think- ing in frequency: Face forgery detection by mining frequency-aware clues. In European conference on computer vision. Springer, 86–103

  54. [54]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sand- hini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al

  55. [55]

    In International conference on machine learning

    Learning transferable visual models from natural language supervision. In International conference on machine learning. PmLR, 8748–8763

  56. [56]

    Anirudh Sundara Rajan, Utkarsh Ojha, Jedidiah Schloesser, and Yong Jae Lee

  57. [57]

    arXiv preprint arXiv:2410.11835(2024)

    Aligned datasets improve detection of latent diffusion-generated images. arXiv preprint arXiv:2410.11835(2024)

  58. [58]

    Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Rad- ford, Mark Chen, and Ilya Sutskever. 2021. Zero-shot text-to-image generation. InInternational conference on machine learning. Pmlr, 8821–8831

  59. [59]

    Jonas Ricker, Denis Lukovnikov, and Asja Fischer. 2024. Aeroblade: Training-free detection of latent diffusion images using autoencoder reconstruction error. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion. 9130–9140

  60. [60]

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-Resolution Image Synthesis With Latent Diffusion Mod- els. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10684–10695

  61. [61]

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695

  62. [62]

    Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. 2019. Faceforensics++: Learning to detect manipulated facial images. InProceedings of the IEEE/CVF international conference on computer vision. 1–11

  63. [63]

    Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedan- tam, Devi Parikh, and Dhruv Batra. 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization. InProceedings of the IEEE inter- national conference on computer vision. 618–626

  64. [64]

    Ke Sun, Shen Chen, Taiping Yao, Ziyin Zhou, Jiayi Ji, Xiaoshuai Sun, Chia-Wen Lin, and Rongrong Ji. 2025. Towards general visual-linguistic face forgery detec- tion. InProceedings of the Computer Vision and Pattern Recognition Conference. 19576–19586

  65. [65]

    Chuangchuang Tan, Renshuai Tao, Huan Liu, Guanghua Gu, Baoyuan Wu, Yao Zhao, and Yunchao Wei. 2025. C2p-clip: Injecting category common prompt in clip to enhance generalization in deepfake detection. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 7184–7192

  66. [66]

    Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. 2024. Frequency-aware deepfake detection: Improving generalizability through frequency space domain learning. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 5052–5060

  67. [67]

    Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. 2024. Rethinking the up-sampling operations in cnn-based generative network for generalizable deepfake detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 28130–28139

  68. [68]

    Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, and Yunchao Wei. 2023. Learning on gradients: Generalized artifacts representation for gan-generated images detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12105–12114

  69. [69]

    Hao Tan, Jun Lan, Senyuan Shi, Zichang Tan, Zijian Yu, Huijia Zhu, Weiqiang Wang, Jun Wan, and Zhen Lei. 2026. Videoveritas: Ai-generated video detection via perception pretext reinforcement learning.arXiv preprint arXiv:2602.08828 (2026)

  70. [70]

    Hao Tan, Jun Lan, Zichang Tan, Ajian Liu, Chuanbiao Song, Senyuan Shi, Huijia Zhu, Weiqiang Wang, Jun Wan, and Zhen Lei. 2025. Veritas: Generalizable deepfake detection via pattern-aware reasoning.arXiv preprint arXiv:2508.21048 (2025)

  71. [71]

    Hao Tan, Zichang Tan, Jun Li, Ajian Liu, Jun Wan, and Zhen Lei. 2025. Re- cover and Match: Open-Vocabulary Multi-Label Recognition through Knowledge- Constrained Optimal Transport. InProceedings of the Computer Vision and Pattern Recognition Conference. 4650–4660

  72. [72]

    Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. 2024. Visual autoregressive modeling: Scalable image generation via next-scale prediction. Advances in neural information processing systems37 (2024), 84839–84865

  73. [73]

    Chung-Ting Tsai, Ching-Yun Ko, I Chung, Yu-Chiang Frank Wang, Pin-Yu Chen, et al. 2024. Understanding and improving training-free ai-generated image detections with vision foundation models.arXiv preprint arXiv:2411.19117(2024)

  74. [74]

    Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A Efros. 2020. CNN-generated images are surprisingly easy to spot... for now. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8695–8704

  75. [75]

    Tongzhou Wang and Phillip Isola. 2020. Understanding contrastive representa- tion learning through alignment and uniformity on the hypersphere. InInterna- tional conference on machine learning. PMLR, 9929–9939

  76. [76]

    Zhendong Wang, Jianmin Bao, Wengang Zhou, Weilun Wang, Hezhen Hu, Hong Chen, and Houqiang Li. 2023. Dire for diffusion-generated image detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 22445– 22455

  77. [77]

    Monika Wysoczańska, Oriane Siméoni, Michaël Ramamonjisoa, Andrei Bursuc, Tomasz Trzciński, and Patrick Pérez. 2024. CLIP-DINOiser: Teaching CLIP a few DINO tricks for open-vocabulary semantic segmentation. InEuropean Conference on Computer Vision. Springer, 320–337

  78. [78]

    Zhipei Xu, Xuanyu Zhang, Runyi Li, Zecheng Tang, Qing Huang, and Jian Zhang

  79. [79]

    Fakeshield: Explainable image forgery detection and localization via multi- modal large language models.arXiv preprint arXiv:2410.02761(2024)

  80. [80]

    Shilin Yan, Ouxiang Li, Jiayin Cai, Yanbin Hao, Xiaolong Jiang, Yao Hu, and Weidi Xie. 2024. A sanity check for ai-generated image detection.arXiv preprint arXiv:2406.19435(2024)

Showing first 80 references.