pith. sign in

arxiv: 2503.21210 · v4 · submitted 2025-03-27 · 💻 cs.CV

Toward Generalizable Forgery Detection and Reasoning

Pith reviewed 2026-05-22 22:23 UTC · model grok-4.3

classification 💻 cs.CV
keywords forgery detectionAI-generated imagesmulti-modal large language modelsforgery reasoninggeneralizationimage forensicsdeepfake detection
0
0 comments X

The pith

FakeReasoning guides MLLMs to detect AI image forgeries and reason about their attributes by fusing CLIP semantics with DINO artifact maps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper frames detection and explanation of AI-generated images as one unified task that multi-modal large language models can solve through reliable reasoning over forgery attributes. It releases a dataset of 120K images from ten generators paired with 378K attribute annotations to support training and evaluation. The method adds a dual-branch encoder, a fusion module that routes DINO attention to forgery clues, and a mapper that ties language output to detection scores. Experiments indicate stronger cross-generator performance than prior detectors on both accuracy and explanation quality. This setup matters because every pixel in these images is synthesized, so standard saliency maps fail to isolate the relevant signals.

Core claim

A forgery-aware feature fusion module that injects DINO attention maps into an MLLM via cross-attention, together with a dual CLIP-DINO visual encoder and a classification probability mapper, allows the model to produce both accurate forgery detections and attribute-level reasoning that generalizes across generative models.

What carries the argument

Forgery-Aware Feature Fusion Module, which uses DINO attention maps and cross-attention to direct the MLLM to synthesis artifacts.

If this is right

  • The unified FDR-Task formulation lets one model handle both binary detection and attribute-level explanation without separate saliency heads.
  • Coupling language modeling with a classification probability mapper improves detection scores beyond language-only prompting.
  • Training on the MMFR-Dataset yields measurable gains over prior methods on both in-distribution and cross-model test sets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same attention-guided fusion idea could be tried on video or audio generators where low-level artifact maps are also available.
  • If DINO maps prove generator-specific rather than universal, future work might need to learn generator-agnostic artifact detectors first.

Load-bearing premise

DINO attention maps consistently mark the synthesis artifacts that differ across generators so the fusion module can steer the MLLM to the right clues.

What would settle it

Test the trained model on images produced by an eleventh generator never seen in the 120K-image dataset; large drops in detection accuracy or reasoning quality would falsify the generalization claim.

Figures

Figures reproduced from arXiv: 2503.21210 by Bingyao Yu, Dongliang Chang, Haotian Qin, Kongming Liang, Lei Chen, Muxi Diao, Yueying Gao, Zhanyu Ma.

Figure 1
Figure 1. Figure 1: Illustration of the FDR-Task. Different from traditional forgery detection, the FDR-Task leverages MLLMs to perform accurate detection through [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Construction pipeline of the MMFR-Dataset. GPT-4o is tasked with caption generation and forgery interpretation. For the forgery interpretation task, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Statistics of the MMFR-Dataset. (a) Text length distribution of caption and reasoning stages; (b) Attributes distribution of real and fake images. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The pipeline of FakeReasoning. FakeReasoning adopts a dual-branch visual encoder combining CLIP and DINO to extract both high-level and low-level [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: (a) Evaluation of the FDR-Task on LOKI benchmark. (b) Detection evaluation on LOKI benchmark. (c) Ablation study on the layers. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visual examples in the ablation study on forgery reasoning task. Only reasoning and conclusion stages of our method is presented. With the forgery [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualization results of FakeReasoning. Discriminate reasoning clues are bolded and corresponding regions are highlighted with dashed boxes. [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
read the original abstract

Accurate and interpretable detection of AI-generated images is essential for mitigating risks associated with AI misuse. However, the substantial domain gap among generative models makes it challenging to develop a generalizable forgery detection model. Moreover, since every pixel in an AI-generated image is synthesized, traditional saliency-based forgery explanation methods are not well suited for this task. To address these challenges, we formulate detection and explanation as a unified Forgery Detection and Reasoning task (FDR-Task), leveraging Multi-Modal Large Language Models (MLLMs) to provide accurate detection through reliable reasoning over forgery attributes. To facilitate this task, we introduce the Multi-Modal Forgery Reasoning dataset (MMFR-Dataset), a large-scale dataset containing 120K images across 10 generative models, with 378K reasoning annotations on forgery attributes, enabling comprehensive evaluation of the FDR-Task. Furthermore, we propose FakeReasoning, a forgery detection and reasoning framework with three key components: 1) a dual-branch visual encoder that integrates CLIP and DINO to capture both high-level semantics and low-level artifacts; 2) a Forgery-Aware Feature Fusion Module that leverages DINO's attention maps and cross-attention mechanisms to guide MLLMs toward forgery-related clues; 3) a Classification Probability Mapper that couples language modeling and forgery detection, enhancing overall performance. Experiments across multiple generative models demonstrate that FakeReasoning not only achieves robust generalization but also outperforms state-of-the-art methods on both detection and reasoning tasks. The code is available at: https://github.com/PRIS-CV/FakeReasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper formulates forgery detection and explanation as a unified Forgery Detection and Reasoning (FDR) task. It introduces the MMFR-Dataset (120K images from 10 generative models, 378K reasoning annotations) and proposes FakeReasoning, which combines a dual-branch CLIP+DINO visual encoder, a Forgery-Aware Feature Fusion Module that uses DINO attention maps and cross-attention to guide MLLMs, and a Classification Probability Mapper that couples language modeling with detection. Experiments are claimed to show robust generalization and SOTA performance on both detection and reasoning across multiple generators.

Significance. If the generalization results hold under held-out generator evaluation, the work would be significant for addressing domain gaps in AI-image detection and for providing interpretable reasoning. The MMFR-Dataset and the unified detection+reasoning formulation are clear contributions; open-sourcing the code strengthens reproducibility. The dual-encoder plus fusion approach offers a concrete mechanism for injecting low-level artifact signals into MLLMs.

major comments (2)
  1. [Experiments] Experiments section (and abstract): the central claim of 'robust generalization' across generative models requires explicit held-out-generator splits. The MMFR-Dataset description covers 10 models but does not state whether any model is completely excluded from training; if all splits are image-level within the same 10 models, the reported outperformance and generalization could be explained by generator-specific artifacts rather than transferable forgery cues, directly undermining the strongest claim.
  2. [§3.2] §3.2, Forgery-Aware Feature Fusion Module: the module's contribution rests on the assumption that DINO attention maps reliably surface synthesis artifacts across generators. No quantitative ablation isolating the fusion module's effect on held-out models, nor cross-generator attention-map visualizations, is referenced; without this, the claimed guidance of MLLMs toward forgery clues cannot be verified as load-bearing.
minor comments (2)
  1. [Abstract] Abstract and §4: quantitative metrics, error bars, dataset split details, and baseline comparisons are referenced only at a high level; adding a table summarizing detection accuracy and reasoning metrics per held-out model would improve clarity.
  2. [§3] Notation: the dual-branch encoder and probability mapper are described with component names but without explicit equations for the cross-attention or mapper; adding a short mathematical formulation would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments below and will revise the manuscript to incorporate held-out generator evaluations and additional ablations.

read point-by-point responses
  1. Referee: [Experiments] Experiments section (and abstract): the central claim of 'robust generalization' across generative models requires explicit held-out-generator splits. The MMFR-Dataset description covers 10 models but does not state whether any model is completely excluded from training; if all splits are image-level within the same 10 models, the reported outperformance and generalization could be explained by generator-specific artifacts rather than transferable forgery cues, directly undermining the strongest claim.

    Authors: We acknowledge that the manuscript does not explicitly describe held-out generator splits. To strengthen the generalization claim, we will add a new set of experiments in the revision that train on a subset of the 10 models and evaluate on completely excluded generators, reporting detection and reasoning metrics to demonstrate transferable forgery cues. revision: yes

  2. Referee: [§3.2] §3.2, Forgery-Aware Feature Fusion Module: the module's contribution rests on the assumption that DINO attention maps reliably surface synthesis artifacts across generators. No quantitative ablation isolating the fusion module's effect on held-out models, nor cross-generator attention-map visualizations, is referenced; without this, the claimed guidance of MLLMs toward forgery clues cannot be verified as load-bearing.

    Authors: We agree that isolating the fusion module's contribution requires further evidence. In the revision we will add quantitative ablations of the Forgery-Aware Feature Fusion Module evaluated specifically on held-out generators, together with cross-generator attention-map visualizations, to verify that the module reliably guides the MLLM toward forgery-related clues. revision: yes

Circularity Check

0 steps flagged

No circularity; new dataset and additive modules evaluated empirically on external backbones

full rationale

The paper formulates a new FDR-Task, releases the MMFR-Dataset (120K images, 378K annotations across 10 generators), and defines FakeReasoning via three explicitly additive components (dual CLIP+DINO encoder, DINO-guided fusion module, probability mapper) that operate on publicly available CLIP/DINO/MLLM models. No equations, parameters, or central claims are shown to reduce by construction to fitted inputs or self-citations; performance and generalization statements are presented as outcomes of experiments rather than tautological re-statements of the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that MLLMs can perform reliable forgery reasoning when supplied with appropriate visual guidance, plus the implicit assumption that the new dataset annotations accurately capture forgery attributes.

axioms (1)
  • domain assumption MLLMs can reliably reason over forgery attributes when guided by appropriate visual features from CLIP and DINO
    The framework depends on this to convert language-model output into accurate detection and explanations.

pith-pipeline@v0.9.0 · 5837 in / 1370 out tokens · 64474 ms · 2026-05-22T22:23:09.796146+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. IncreFA: Breaking the Static Wall of Generative Model Attribution

    cs.CV 2026-04 unverdicted novelty 6.0

    IncreFA uses hierarchical constraints with learnable orthogonal priors and a latent memory bank to enable continual adaptation for attributing images to new generative models, reporting SOTA accuracy and 98.93% unseen...

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · cited by 1 Pith paper · 10 internal anchors

  1. [1]

    Artificial intelligence and political deepfakes: Shaping citizen perceptions through misinformation,

    M. Momeni, “Artificial intelligence and political deepfakes: Shaping citizen perceptions through misinformation,” Journal of Creative Com- munications, vol. 20, no. 1, pp. 41–56, 2025

  2. [2]

    Financial fraud and manipulation: The malicious use of deepfakes in business,

    P. Kaushik, V . Garg, A. Priya, and S. Kant, “Financial fraud and manipulation: The malicious use of deepfakes in business,” in Deepfakes and Their Impact on Business . IGI Global Scientific Publishing, 2025, pp. 173–196

  3. [3]

    Concerns about the role of artificial intelligence in journalism, and media manipulation,

    S. Mahony and Q. Chen, “Concerns about the role of artificial intelligence in journalism, and media manipulation,” Journalism, p. 14648849241263293, 2024

  4. [4]

    Two-stage copy-move forgery detec- tion with self deep matching and proposal superglue,

    Y . Liu, C. Xia, X. Zhu, and S. Xu, “Two-stage copy-move forgery detec- tion with self deep matching and proposal superglue,”IEEE Transactions on Image Processing , vol. 31, pp. 541–555, 2021

  5. [5]

    Learning patch-channel corre- spondence for interpretable face forgery detection,

    Y . Hua, R. Shi, P. Wang, and S. Ge, “Learning patch-channel corre- spondence for interpretable face forgery detection,” IEEE Transactions on Image Processing , vol. 32, pp. 1668–1680, 2023. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 12

  6. [6]

    Trufor: Leveraging all-round clues for trustworthy image forgery detection and localization,

    F. Guillaro, D. Cozzolino, A. Sud, N. Dufour, and L. Verdoliva, “Trufor: Leveraging all-round clues for trustworthy image forgery detection and localization,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2023, pp. 20 606–20 615

  7. [7]

    Image copy-move forgery detection via deep patchmatch and pairwise ranking learning,

    Y . Li, Y . He, C. Chen, L. Dong, B. Li, J. Zhou, and X. Li, “Image copy-move forgery detection via deep patchmatch and pairwise ranking learning,” IEEE Transactions on Image Processing , 2024

  8. [8]

    Self-supervised adversarial training for robust face forgery detection

    Y . Gao, W. Lin, J. Xu, W. Xu, and P. Chen, “Self-supervised adversarial training for robust face forgery detection.” in BMVC, 2023, p. 718

  9. [9]

    Any-resolution ai-generated image detection by spectral learning,

    D. Karageorgiou, S. Papadopoulos, I. Kompatsiaris, and E. Gavves, “Any-resolution ai-generated image detection by spectral learning,” in Proceedings of the Computer Vision and Pattern Recognition Confer- ence, 2025, pp. 18 706–18 717

  10. [10]

    Rethinking the up- sampling operations in cnn-based generative network for generalizable deepfake detection,

    C. Tan, Y . Zhao, S. Wei, G. Gu, P. Liu, and Y . Wei, “Rethinking the up- sampling operations in cnn-based generative network for generalizable deepfake detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2024, pp. 28 130–28 139

  11. [11]

    Faces blind your eyes: Unveiling the content-irrelevant synthetic arti- facts for deepfake detection,

    X. Fu, B. Fu, S. Chen, T. Yao, Y . Wang, S. Ding, X. Liang, and X. Li, “Faces blind your eyes: Unveiling the content-irrelevant synthetic arti- facts for deepfake detection,” IEEE Transactions on Image Processing , 2025

  12. [12]

    Towards universal fake image detec- tors that generalize across generative models,

    U. Ojha, Y . Li, and Y . J. Lee, “Towards universal fake image detec- tors that generalize across generative models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2023, pp. 24 480–24 489

  13. [13]

    Forgery- aware adaptive transformer for generalizable synthetic image detection,

    H. Liu, Z. Tan, C. Tan, Y . Wei, J. Wang, and Y . Zhao, “Forgery- aware adaptive transformer for generalizable synthetic image detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 10 770–10 780

  14. [14]

    C2p-clip: Injecting category common prompt in clip to enhance generalization in deepfake detection,

    C. Tan, R. Tao, H. Liu, G. Gu, B. Wu, Y . Zhao, and Y . Wei, “C2p-clip: Injecting category common prompt in clip to enhance generalization in deepfake detection,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 7, 2025, pp. 7184–7192

  15. [15]

    Common sense reasoning for deepfake detection,

    Y . Zhang, B. Colman, X. Guo, A. Shahriyari, and G. Bharaj, “Common sense reasoning for deepfake detection,” in European Conference on Computer Vision. Springer, 2024, pp. 399–415

  16. [16]

    Ffaa: Multimodal large language model based explainable open-world face forgery analysis assistant,

    Z. Huang, B. Xia, Z. Lin, Z. Mou, W. Yang, and J. Jia, “Ffaa: Multimodal large language model based explainable open-world face forgery analysis assistant,” arXiv preprint arXiv:2408.10072 , 2024

  17. [17]

    Fakeshield: Explainable image forgery detection and localization via multi-modal large language models,

    Z. Xu, X. Zhang, R. Li, Z. Tang, Q. Huang, and J. Zhang, “Fakeshield: Explainable image forgery detection and localization via multi-modal large language models,” arXiv preprint arXiv:2410.02761 , 2024

  18. [18]

    Sida: Social media image deepfake detection, localization and explanation with large multimodal model,

    Z. Huang, J. Hu, X. Li, Y . He, X. Zhao, B. Peng, B. Wu, X. Huang, and G. Cheng, “Sida: Social media image deepfake detection, localization and explanation with large multimodal model,” in Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 28 831– 28 841

  19. [19]

    Cnn- generated images are surprisingly easy to spot... for now,

    S.-Y . Wang, O. Wang, R. Zhang, A. Owens, and A. A. Efros, “Cnn- generated images are surprisingly easy to spot... for now,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 8695–8704

  20. [20]

    Progressive Growing of GANs for Improved Quality, Stability, and Variation

    T. Karras, “Progressive growing of gans for improved quality, stability, and variation,” arXiv preprint arXiv:1710.10196 , 2017

  21. [21]

    Attributing fake images to gans: Learn- ing and analyzing gan fingerprints,

    N. Yu, L. S. Davis, and M. Fritz, “Attributing fake images to gans: Learn- ing and analyzing gan fingerprints,” in Proceedings of the IEEE/CVF international conference on computer vision , 2019, pp. 7556–7566

  22. [22]

    Dire for diffusion-generated image detection,

    Z. Wang, J. Bao, W. Zhou, W. Wang, H. Hu, H. Chen, and H. Li, “Dire for diffusion-generated image detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2023, pp. 22 445–22 455

  23. [23]

    Fakeinversion: Learning to detect images from unseen text-to-image models by invert- ing stable diffusion,

    G. Cazenavette, A. Sud, T. Leung, and B. Usman, “Fakeinversion: Learning to detect images from unseen text-to-image models by invert- ing stable diffusion,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2024, pp. 10 759–10 769

  24. [24]

    Drct: Diffusion reconstruction contrastive training towards universal detection of diffusion generated images,

    B. Chen, J. Zeng, J. Yang, and R. Yang, “Drct: Diffusion reconstruction contrastive training towards universal detection of diffusion generated images,” in Forty-first International Conference on Machine Learning , 2024

  25. [25]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning . PMLR, 2021, pp. 8748–8763

  26. [26]

    De-fake: Detection and attribution of fake images generated by text-to-image generation models,

    Z. Sha, Z. Li, N. Yu, and Y . Zhang, “De-fake: Detection and attribution of fake images generated by text-to-image generation models,” in Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, 2023, pp. 3418–3432

  27. [27]

    InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

    W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi, “Instructblip: Towards general-purpose vision-language models with instruction tuning,” arXiv preprint arXiv:2305.06500, 2023

  28. [28]

    MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

    J. Chen, D. Zhu, X. Shen, X. Li, Z. Liu, P. Zhang, R. Krishnamoorthi, V . Chandra, Y . Xiong, and M. Elhoseiny, “Minigpt-v2: large language model as a unified interface for vision-language multi-task learning,” arXiv preprint arXiv:2310.09478 , 2023

  29. [29]

    Visual instruction tuning,

    H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” Advances in neural information processing systems , vol. 36, 2024

  30. [30]

    Improved baselines with visual instruction tuning,

    H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved baselines with visual instruction tuning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , June 2024, pp. 26 296– 26 306

  31. [31]

    Llava-next: Improved reasoning, ocr, and world knowledge,

    H. Liu, C. Li, Y . Li, B. Li, Y . Zhang, S. Shen, and Y . J. Lee, “Llava-next: Improved reasoning, ocr, and world knowledge,” January 2024. [Online]. Available: https://llava-vl.github.io/blog/2024-01-30-llava-next/

  32. [32]

    Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,

    W.-L. Chiang, Z. Li, Z. Lin, Y . Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y . Zhuang, J. E. Gonzalez et al., “Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,” See https://vicuna. lmsys. org (accessed 14 April 2023) , vol. 2, no. 3, p. 6, 2023

  33. [33]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azhar et al. , “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023

  34. [34]

    Qwen2.5-VL Technical Report

    S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y . Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y . Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin, “Qwen2.5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025

  35. [35]

    Chatglm: A family of large language models from glm-130b to glm-4 all tools,

    T. GLM, A. Zeng, B. Xu, B. Wang, C. Zhang, D. Yin, D. Rojas, G. Feng, H. Zhao, H. Lai, H. Yu, H. Wang, J. Sun, J. Zhang, J. Cheng, J. Gui, J. Tang, J. Zhang, J. Li, L. Zhao, L. Wu, L. Zhong, M. Liu, M. Huang, P. Zhang, Q. Zheng, R. Lu, S. Duan, S. Zhang, S. Cao, S. Yang, W. L. Tam, W. Zhao, X. Liu, X. Xia, X. Zhang, X. Gu, X. Lv, X. Liu, X. Liu, X. Yang, ...

  36. [36]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,

    Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Lu et al. , “Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2024, pp. 24 185–24 198

  37. [37]

    DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

    Z. Wu, X. Chen, Z. Pan, X. Liu, W. Liu, D. Dai, H. Gao, Y . Ma, C. Wu, B. Wang et al. , “Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding,” arXiv preprint arXiv:2412.10302, 2024

  38. [38]

    Deferred neural rendering: Image synthesis using neural textures,

    J. Thies, M. Zollh ¨ofer, and M. Nießner, “Deferred neural rendering: Image synthesis using neural textures,” Acm Transactions on Graphics (TOG), vol. 38, no. 4, pp. 1–12, 2019

  39. [39]

    Gandiffface: Controllable generation of synthetic datasets for face recognition with realistic variations,

    P. Melzi, C. Rathgeb, R. Tolosana, R. Vera-Rodriguez, D. Lawatsch, F. Domin, and M. Schaubert, “Gandiffface: Controllable generation of synthetic datasets for face recognition with realistic variations,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 3086–3095

  40. [40]

    Fakebench: Uncover the achilles’ heels of fake images with large multimodal models,

    Y . Li, X. Liu, X. Wang, S. Wang, and W. Lin, “Fakebench: Uncover the achilles’ heels of fake images with large multimodal models,” arXiv e-prints, pp. arXiv–2404, 2024

  41. [41]

    Improving image generation with better captions,

    J. Betker, G. Goh, L. Jing, T. Brooks, J. Wang, L. Li, L. Ouyang, J. Zhuang, J. Lee, Y . Guoet al., “Improving image generation with better captions,” Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, vol. 2, no. 3, p. 8, 2023

  42. [42]

    Loki: A comprehensive synthetic data de- tection benchmark using large multimodal models,

    J. Ye, B. Zhou, Z. Huang, J. Zhang, T. Bai, H. Kang, J. He, H. Lin, Z. Wang, T. Wu et al. , “Loki: A comprehensive synthetic data de- tection benchmark using large multimodal models,” arXiv preprint arXiv:2410.09732, 2024

  43. [43]

    Black Forest Labs, “Flux,” https://github.com/black-forest-labs/flux, 2024

  44. [44]

    Repaint: Inpainting using denoising diffusion probabilistic models,

    A. Lugmayr, M. Danelljan, A. Romero, F. Yu, R. Timofte, and L. Van Gool, “Repaint: Inpainting using denoising diffusion probabilistic models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2022, pp. 11 461–11 471

  45. [45]

    Legion: Learning to ground and explain for synthetic image detection,

    H. Kang, S. Wen, Z. Wen, J. Ye, W. Li, P. Feng, B. Zhou, B. Wang, D. Lin, L. Zhang et al. , “Legion: Learning to ground and explain for synthetic image detection,” arXiv preprint arXiv:2503.15264 , 2025. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13

  46. [46]

    Noiseprint: A cnn-based camera model fingerprint,

    D. Cozzolino and L. Verdoliva, “Noiseprint: A cnn-based camera model fingerprint,” IEEE Transactions on Information Forensics and Security , vol. 15, pp. 144–159, 2019

  47. [47]

    Hierarchical fine-grained image forgery detection and localization,

    X. Guo, X. Liu, Z. Ren, S. Grosz, I. Masi, and X. Liu, “Hierarchical fine-grained image forgery detection and localization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 3155–3165

  48. [48]

    Antifakeprompt: Prompt- tuned vision-language models are fake image detectors,

    Y .-M. Chang, C. Yeh, W.-C. Chiu, and N. Yu, “Antifakeprompt: Prompt- tuned vision-language models are fake image detectors,” arXiv preprint arXiv:2310.17419, 2023

  49. [49]

    Bi-lora: A vision-language approach for synthetic image detection,

    M. Keita, W. Hamidouche, H. Bougueffa Eutamene, A. Taleb-Ahmed, D. Camacho, and A. Hadid, “Bi-lora: A vision-language approach for synthetic image detection,” Expert Systems , vol. 42, no. 2, p. e13829, 2025

  50. [50]

    LoRA: Low-Rank Adaptation of Large Language Models

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685 , 2021

  51. [51]

    Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,

    J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,” in International conference on machine learning . PMLR, 2023, pp. 19 730–19 742

  52. [52]

    Lisa: Reasoning segmentation via large language model,

    X. Lai, Z. Tian, Y . Chen, Y . Li, Y . Yuan, S. Liu, and J. Jia, “Lisa: Reasoning segmentation via large language model,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 9579–9589

  53. [53]

    D3: Scaling up deepfake detection by learning from discrepancy,

    Y . Yang, Z. Qian, Y . Zhu, O. Russakovsky, and Y . Wu, “D3: Scaling up deepfake detection by learning from discrepancy,” in Proceedings of the Computer Vision and Pattern Recognition Conference , 2025, pp. 23 850–23 859

  54. [54]

    Diffusiondb: A large-scale prompt gallery dataset for text-to- image generative models,

    Z. J. Wang, E. Montoya, D. Munechika, H. Yang, B. Hoover, and D. H. Chau, “Diffusiondb: A large-scale prompt gallery dataset for text-to- image generative models,” arXiv preprint arXiv:2210.14896 , 2022

  55. [55]

    Laion- 5b: An open large-scale dataset for training next generation image-text models,

    C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsmanet al., “Laion- 5b: An open large-scale dataset for training next generation image-text models,” Advances in Neural Information Processing Systems , vol. 35, pp. 25 278–25 294, 2022

  56. [56]

    DALLE-3, https://huggingface.co/datasets/ehristoforu/dalle-3-images, 2023

  57. [57]

    Genimage: A million-scale benchmark for detecting ai- generated image,

    M. Zhu, H. Chen, Q. Yan, X. Huang, G. Lin, W. Li, Z. Tu, H. Hu, J. Hu, and Y . Wang, “Genimage: A million-scale benchmark for detecting ai- generated image,” Advances in Neural Information Processing Systems , vol. 36, pp. 77 771–77 782, 2023

  58. [58]

    Kandinsky, https://huggingface.co/datasets/diffusers-parti-prompts/ kandinsky-2-2, 2023

  59. [59]

    PixArt- α, https://huggingface.co/datasets/PixArt-alpha/PixArt-Eval30K, 2024

  60. [60]

    FLUX, https://huggingface.co/datasets/lehduong/flux generated, 2025

  61. [61]

    GPT-4o, https://huggingface.co/datasets/FreedomIntelligence/ ShareGPT-4o-Image, 2025

  62. [62]

    Raising the Bar of AI-generated Image Detection with CLIP,

    D. Cozzolino, G. Poggi, R. Corvi, M. Nießner, and L. Verdoliva, “Raising the Bar of AI-generated Image Detection with CLIP,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024, pp. 4356–4366

  63. [63]

    ImageNet Large Scale Visual Recognition Challenge,

    O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision (IJCV) , vol. 115, no. 3, pp. 211–252, 2015

  64. [64]

    A-bench: Are lmms masters at evaluating ai- generated images?

    Z. Zhang, H. Wu, C. Li, Y . Zhou, W. Sun, X. Min, Z. Chen, X. Liu, W. Lin, and G. Zhai, “A-bench: Are lmms masters at evaluating ai- generated images?” arXiv preprint arXiv:2406.03070 , 2024

  65. [65]

    Bioinstruct: instruction tuning of large language models for biomedical natural language processing,

    H. Tran, Z. Yang, Z. Yao, and H. Yu, “Bioinstruct: instruction tuning of large language models for biomedical natural language processing,” Journal of the American Medical Informatics Association, vol. 31, no. 9, pp. 1821–1832, 2024

  66. [66]

    Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning,

    H. Shao, S. Qian, H. Xiao, G. Song, Z. Zong, L. Wang, Y . Liu, and H. Li, “Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning,” Advances in Neural Information Processing Systems , vol. 37, pp. 8612– 8642, 2024

  67. [67]

    LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

    G. Xu, P. Jin, L. Hao, Y . Song, L. Sun, and L. Yuan, “Llava- o1: Let vision language models reason step-by-step,” arXiv preprint arXiv:2411.10440, 2024

  68. [68]

    Driverx: A vision-language reasoning model for cross-task autonomous driving,

    M. Diao, L. Yang, H. Yin, Z. Wang, Y . Wang, D. Tian, K. Liang, and Z. Ma, “Driverx: A vision-language reasoning model for cross-task autonomous driving,” arXiv preprint arXiv:2505.20665 , 2025

  69. [69]

    Patchcraft: Exploring texture patch for efficient ai-generated image detection

    N. Zhong, Y . Xu, S. Li, Z. Qian, and X. Zhang, “Patchcraft: Exploring texture patch for efficient ai-generated image detection,” arXiv preprint arXiv:2311.12397, 2023

  70. [70]

    Mvss-net: Multi- view multi-scale supervised networks for image manipulation detec- tion,

    C. Dong, X. Chen, R. Hu, J. Cao, and X. Li, “Mvss-net: Multi- view multi-scale supervised networks for image manipulation detec- tion,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 45, no. 3, pp. 3539–3553, 2022

  71. [71]

    Lideepdet: Deepfake detection via image decomposition and advanced lighting information analysis,

    Z. Lai, J. Li, C. Wang, J. Wu, and D. Jiang, “Lideepdet: Deepfake detection via image decomposition and advanced lighting information analysis,” Electronics, vol. 13, no. 22, p. 4466, 2024

  72. [72]

    arXiv preprint arXiv:2310.01018 , volume=

    Z. Luo, F. K. Gustafsson, Z. Zhao, J. Sj ¨olund, and T. B. Sch ¨on, “Controlling vision-language models for multi-task image restoration,” arXiv preprint arXiv:2310.01018 , 2023

  73. [73]

    Do computer vision foundation models learn the low-level characteristics of the human visual system?

    Y . Cai, F. Yin, D. Hammou, and R. Mantiuk, “Do computer vision foundation models learn the low-level characteristics of the human visual system?” in Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 20 039–20 048

  74. [74]

    Rigid: A training-free and model-agnostic framework for ro- bust ai-generated image detection

    Z. He, P.-Y . Chen, and T.-Y . Ho, “Rigid: A training-free and model- agnostic framework for robust ai-generated image detection,” arXiv preprint arXiv:2405.20112, 2024

  75. [75]

    Emerging properties in self-supervised vision transformers,

    M. Caron, H. Touvron, I. Misra, H. J ´egou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 9650–9660

  76. [76]

    Eyes wide shut? exploring the visual shortcomings of multimodal llms,

    S. Tong, Z. Liu, Y . Zhai, Y . Ma, Y . LeCun, and S. Xie, “Eyes wide shut? exploring the visual shortcomings of multimodal llms,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 9568–9578

  77. [77]

    From clip to dino: Visual encoders shout in multi-modal large language models,

    D. Jiang, Y . Liu, S. Liu, J. Zhao, H. Zhang, Z. Gao, X. Zhang, J. Li, and H. Xiong, “From clip to dino: Visual encoders shout in multi-modal large language models,” arXiv preprint arXiv:2310.08825 , 2023

  78. [78]

    DINOv2: Learning Robust Visual Features without Supervision

    M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby et al. , “Dinov2: Learning robust visual features without supervision,” arXiv preprint arXiv:2304.07193, 2023

  79. [79]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems , vol. 30, 2017

  80. [80]

    An image is worth 16x16 words: Transformers for image recognition at scale,

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” ICLR, 2021

Showing first 80 references.