pith. sign in

arxiv: 2410.10238 · v4 · submitted 2024-10-14 · 💻 cs.CV · cs.AI

ForgeryGPT: A Multimodal LLM for Interpretable Image Forgery Detection and Localization

Pith reviewed 2026-05-23 19:03 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords image forgery detectionmultimodal large language modelforgery localizationexplainable detectionmask encodervision language alignmentIFDL task
0
0 comments X

The pith

ForgeryGPT integrates a mask-aware extractor into a multimodal LLM to enable explainable image forgery detection and localization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes ForgeryGPT to address limitations in image forgery detection and localization. Existing methods rely on low-level clues and provide only single judgments, while general multimodal LLMs struggle with this task. ForgeryGPT captures high-order forensics knowledge from linguistic feature spaces and enables explainable generation and interactive dialogue. It does so by integrating a Mask-Aware Forgery Extractor into a customized LLM architecture and using a three-stage training strategy with new datasets for vision-language alignment.

Core claim

ForgeryGPT advances the IFDL task by capturing high-order forensics knowledge correlations of forged images from diverse linguistic feature spaces, while enabling explainable generation and interactive dialogue through a newly customized Large Language Model architecture. Specifically, it enhances traditional LLMs by integrating the Mask-Aware Forgery Extractor, which enables the excavating of precise forgery mask information from input images and facilitating pixel-level understanding of tampering artifacts. The extractor consists of a Forgery Localization Expert augmented with an Object-agnostic Forgery Prompt and a Vocabulary-enhanced Vision Encoder, along with a Mask Encoder.

What carries the argument

Mask-Aware Forgery Extractor that excavates precise forgery mask information from input images for pixel-level understanding of tampering artifacts.

If this is right

  • Supports explainable generation of detection results beyond single judgments.
  • Enables interactive dialogue about the forgery analysis.
  • Captures multi-scale fine-grained forgery details for improved accuracy.
  • Aligns vision and language modalities through dedicated datasets.
  • Improves instruction-following capabilities for IFDL tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Such a system could be adapted for detecting forgeries in video or other media types.
  • The use of linguistic feature spaces might reveal patterns not visible in purely visual methods.
  • Interactive features could facilitate collaboration between AI and human experts in forensic analysis.
  • Testing on more diverse real-world datasets would validate its robustness beyond controlled experiments.

Load-bearing premise

The Mask-Aware Forgery Extractor can excavate precise forgery mask information from input images to enable pixel-level understanding of tampering artifacts.

What would settle it

An experiment where the model fails to produce accurate pixel-level forgery masks on a benchmark dataset with varied tampering techniques would disprove the central claim.

Figures

Figures reproduced from arXiv: 2410.10238 by Dong Li, Esther Sun, Fanrui Zhang, Jiawei Liu, Jiaying Zhu, Qiang Zhang, Zheng-Jun Zha.

Figure 1
Figure 1. Figure 1: Comparison between our ForgeryGPT and existing methods. “Forgery Mask” denotes the ground truth mask of the [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed ForgeryGPT. The left panel shows the overall architecture, which comprises an Image [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of FL-Expert. It consists of the Object-agnostic Forgery Prompt module, frozen CLIP text and vision encoders, [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Two training data generation pipelines: one for Mask-Text Alignment Pre-training, which creates caption data from [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Our splicing method for constructing multi-granularity [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: An example from the dataset to illustrate the Mask-Text Alignment Pre-training data. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: An example from the dataset to illustrate the Task-Specific Instruction Tuning data. [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Evaluation of the values of the object-agnostic learnable [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Results of human evaluation. The left side shows the [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Visualizing the performance impact of ForgeryGPT [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Visualization of the predicted manipulation mask by different methods. From left to right, we show forged images, [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Perception capability of CLIP for the authenticity-forgery attributes of images. For each object prompt, we prepend [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗
Figure 14
Figure 14. Figure 14: The generalization ability of ForgeryGPT in industrial [PITH_FULL_IMAGE:figures/full_fig_p014_14.png] view at source ↗
read the original abstract

Multimodal Large Language Models (MLLMs), such as GPT4o, have shown strong capabilities in visual reasoning and explanation generation. However, despite these strengths, they face significant challenges in the increasingly critical task of Image Forgery Detection and Localization (IFDL). Moreover, existing IFDL methods are typically limited to the learning of low-level semantic-agnostic clues and merely provide a single outcome judgment. To tackle these issues, we propose ForgeryGPT, a novel framework that advances the IFDL task by capturing high-order forensics knowledge correlations of forged images from diverse linguistic feature spaces, while enabling explainable generation and interactive dialogue through a newly customized Large Language Model (LLM) architecture. Specifically, ForgeryGPT enhances traditional LLMs by integrating the Mask-Aware Forgery Extractor, which enables the excavating of precise forgery mask information from input images and facilitating pixel-level understanding of tampering artifacts. The Mask-Aware Forgery Extractor consists of a Forgery Localization Expert (FL-Expert) and a Mask Encoder, where the FL-Expert is augmented with an Object-agnostic Forgery Prompt and a Vocabulary-enhanced Vision Encoder, allowing for effectively capturing of multi-scale fine-grained forgery details. To enhance its performance, we implement a three-stage training strategy, supported by our designed Mask-Text Alignment and IFDL Task-Specific Instruction Tuning datasets, which align vision-language modalities and improve forgery detection and instruction-following capabilities. Extensive experiments demonstrate the effectiveness of the proposed method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes ForgeryGPT, a multimodal LLM framework for image forgery detection and localization (IFDL). It augments an LLM with a Mask-Aware Forgery Extractor (Forgery Localization Expert or FL-Expert, augmented by an Object-agnostic Forgery Prompt and Vocabulary-enhanced Vision Encoder, plus a Mask Encoder) to extract precise forgery masks for pixel-level tampering understanding. A three-stage training strategy uses custom Mask-Text Alignment and IFDL Task-Specific Instruction Tuning datasets to align modalities and improve detection/instruction-following. The work claims to capture high-order forensics knowledge correlations across linguistic spaces while enabling explainable outputs and interactive dialogue, with the abstract asserting that extensive experiments demonstrate effectiveness over prior low-level IFDL methods.

Significance. If the central claims hold, the integration of specialized forgery extraction with MLLM reasoning could advance IFDL by adding interpretability and interactivity beyond binary or low-level outputs. The three-stage training and custom alignment datasets represent a structured effort to bridge vision and language for forensics, which is a potentially useful direction if the extractor delivers on pixel-level precision.

major comments (2)
  1. [Abstract] Abstract: the claim that 'extensive experiments demonstrate the effectiveness' is unsupported because the abstract (and the provided description) supplies no quantitative results, ablation studies, baseline comparisons, or error analysis. Without these, it is impossible to verify whether the architecture supports the performance claims on high-order correlations or localization.
  2. [Method (Mask-Aware Forgery Extractor)] Method description of the Mask-Aware Forgery Extractor: the central claim that this module 'enables the excavating of precise forgery mask information from input images and facilitating pixel-level understanding of tampering artifacts' is load-bearing, yet the description provides no concrete mechanism (e.g., mask-prediction loss, supervision signal, or architectural difference from standard vision encoders) that would guarantee focus on tampering artifacts rather than generic object boundaries. This directly affects whether the subsequent LLM stages can be shown to advance IFDL.
minor comments (1)
  1. [Abstract] The abstract is lengthy and could be condensed while retaining the core technical contributions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment below and will make revisions to improve clarity and support for the claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'extensive experiments demonstrate the effectiveness' is unsupported because the abstract (and the provided description) supplies no quantitative results, ablation studies, baseline comparisons, or error analysis. Without these, it is impossible to verify whether the architecture supports the performance claims on high-order correlations or localization.

    Authors: We agree that the abstract would be strengthened by including key quantitative results. The full manuscript contains sections detailing experiments with baseline comparisons, ablation studies on the FL-Expert components, and metrics for pixel-level localization and detection. In revision we will update the abstract to reference specific performance gains, such as improved localization IoU over prior IFDL methods. revision: yes

  2. Referee: [Method (Mask-Aware Forgery Extractor)] Method description of the Mask-Aware Forgery Extractor: the central claim that this module 'enables the excavating of precise forgery mask information from input images and facilitating pixel-level understanding of tampering artifacts' is load-bearing, yet the description provides no concrete mechanism (e.g., mask-prediction loss, supervision signal, or architectural difference from standard vision encoders) that would guarantee focus on tampering artifacts rather than generic object boundaries. This directly affects whether the subsequent LLM stages can be shown to advance IFDL.

    Authors: The abstract provides a high-level overview. The full method section describes the Object-agnostic Forgery Prompt and Vocabulary-enhanced Vision Encoder as mechanisms to prioritize forgery artifacts over object boundaries, with the three-stage training using the Mask-Text Alignment dataset for supervision. We acknowledge the need for explicit details on the mask-prediction loss and supervision signals. We will expand the method description to include these elements and clarify the architectural differences from standard encoders. revision: yes

Circularity Check

0 steps flagged

No circularity detected; derivation is self-contained architectural proposal

full rationale

The paper proposes ForgeryGPT as a novel MLLM framework integrating a Mask-Aware Forgery Extractor (FL-Expert with Object-agnostic Forgery Prompt, Vocabulary-enhanced Vision Encoder, and Mask Encoder) plus three-stage training on custom Mask-Text Alignment and IFDL datasets. No equations, parameter-fitting steps, or self-citation chains appear in the provided text that reduce any claimed prediction or result to the inputs by construction. The central claims rest on the described modules and experimental validation rather than any definitional equivalence or fitted-input renaming, making the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the unproven effectiveness of the proposed Mask-Aware Forgery Extractor and the three-stage training procedure; no free parameters, axioms, or invented entities are explicitly quantified in the abstract.

pith-pipeline@v0.9.0 · 5817 in / 1130 out tokens · 20634 ms · 2026-05-23T19:03:55.360761+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ReAlign: Generalizable Image Forgery Detection via Reasoning-Aligned Representation

    cs.CV 2026-05 unverdicted novelty 7.0

    ReAlign distills LLM-generated reasoning texts into a lightweight AIGI forgery detector via contrastive image-text alignment to improve generalization on complex forgeries.

  2. Code-in-the-Loop Forensics: Agentic Tool Use for Image Forgery Detection

    cs.AI 2025-12 unverdicted novelty 7.0

    ForenAgent lets MLLMs create and iteratively improve low-level Python tools for image forgery detection via a two-stage training pipeline and a new 100k-image benchmark dataset.

  3. Venus-DeFakerOne: Unified Fake Image Detection & Localization

    cs.CV 2026-05 unverdicted novelty 6.0

    DeFakerOne integrates InternVL2 and SAM2 into a single model that achieves state-of-the-art results on 39 detection and 9 localization benchmarks for unified fake image detection and localization.

  4. UniGenDet: A Unified Generative-Discriminative Framework for Co-Evolutionary Image Generation and Generated Image Detection

    cs.CV 2026-04 unverdicted novelty 5.0

    UniGenDet unifies generative and discriminative models through symbiotic self-attention and detector-guided alignment to co-evolve image generation and authenticity detection.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · cited by 4 Pith papers · 7 internal anchors

  1. [1]

    Towards jpeg-resistant image forgery detection and localization via self-supervised domain adapta- tion,

    Y . Rao, J. Ni, W. Zhang, and J. Huang, “Towards jpeg-resistant image forgery detection and localization via self-supervised domain adapta- tion,” IEEE Transactions on Pattern Analysis and Machine Intelligence , 2022

  2. [2]

    Detecting and grounding multi-modal media manipulation and beyond,

    R. Shao, T. Wu, J. Wu, L. Nie, and Z. Liu, “Detecting and grounding multi-modal media manipulation and beyond,” IEEE Transactions on Pattern Analysis and Machine Intelligence , 2024

  3. [3]

    Face forgery detection by 3d decomposition and composition search,

    X. Zhu, H. Fei, B. Zhang, T. Zhang, X. Zhang, S. Z. Li, and Z. Lei, “Face forgery detection by 3d decomposition and composition search,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 45, no. 7, pp. 8342–8357, 2023

  4. [4]

    A principled design of image representation: Towards forensic tasks,

    S. Qi, Y . Zhang, C. Wang, J. Zhou, and X. Cao, “A principled design of image representation: Towards forensic tasks,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 45, no. 5, pp. 5337– 5354, 2022

  5. [5]

    Fully unsupervised deepfake video detection via enhanced contrastive learning,

    T. Qiao, S. Xie, Y . Chen, F. Retraint, and X. Luo, “Fully unsupervised deepfake video detection via enhanced contrastive learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence , 2024

  6. [6]

    Photorealistic text-to-image diffusion models with deep language understanding,

    C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans et al., “Photorealistic text-to-image diffusion models with deep language understanding,” Advances in Neural Information Processing Systems , vol. 35, pp. 36 479–36 494, 2022

  7. [7]

    GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

    A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen, “Glide: Towards photorealistic image gen- eration and editing with text-guided diffusion models,” arXiv preprint arXiv:2112.10741, 2021

  8. [8]

    Forgery- aware adaptive transformer for generalizable synthetic image detection,

    H. Liu, Z. Tan, C. Tan, Y . Wei, J. Wang, and Y . Zhao, “Forgery- aware adaptive transformer for generalizable synthetic image detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 10 770–10 780

  9. [9]

    Learning rich features for image manipulation detection,

    P. Zhou, X. Han, V . I. Morariu, and L. S. Davis, “Learning rich features for image manipulation detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2018, pp. 1053–1061

  10. [10]

    Image manipulation detection by multi-view multi-scale supervision,

    X. Chen, C. Dong, J. Ji, J. Cao, and X. Li, “Image manipulation detection by multi-view multi-scale supervision,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2021, pp. 14 185–14 193

  11. [11]

    Edge-aware regional message passing controller for image forgery localization,

    D. Li, J. Zhu, M. Wang, J. Liu, X. Fu, and Z.-J. Zha, “Edge-aware regional message passing controller for image forgery localization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 8222–8232

  12. [12]

    Learning discriminative noise guidance for image forgery detection and localization,

    J. Zhu, D. Li, X. Fu, G. Yang, J. Huang, A. Liu, and Z.-J. Zha, “Learning discriminative noise guidance for image forgery detection and localization,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 7, 2024, pp. 7739–7747

  13. [13]

    Trufor: Leveraging all-round clues for trustworthy image forgery detection and localization,

    F. Guillaro, D. Cozzolino, A. Sud, N. Dufour, and L. Verdoliva, “Trufor: Leveraging all-round clues for trustworthy image forgery detection and localization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2023, pp. 20 606–20 615

  14. [14]

    Diffforensics: Leveraging diffu- sion prior to image forgery detection and localization,

    Z. Yu, J. Ni, Y . Lin, H. Deng, and B. Li, “Diffforensics: Leveraging diffu- sion prior to image forgery detection and localization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 12 765–12 774

  15. [15]

    Objectformer for image manipulation detection and localization,

    J. Wang, Z. Wu, J. Chen, X. Han, A. Shrivastava, S.-N. Lim, and Y .-G. Jiang, “Objectformer for image manipulation detection and localization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2364–2373

  16. [16]

    A Survey on Multimodal Large Language Models

    S. Yin, C. Fu, S. Zhao, K. Li, X. Sun, T. Xu, and E. Chen, “A survey on multimodal large language models,” CoRR, vol. abs/2306.13549,

  17. [17]

    A Survey on Multimodal Large Language Models

    [Online]. Available: https://doi.org/10.48550/arXiv.2306.13549

  18. [18]

    GPT-4 technical report,

    OpenAI, “GPT-4 technical report,” 2023

  19. [19]

    Visual instruction tuning,

    H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” in Advances in Neural Information Processing Systems , 2023

  20. [20]

    High- resolution image synthesis with latent diffusion models,

    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2022, pp. 10 684–10 695

  21. [21]

    GPT-4v(ision) system card,

    OpenAI, “GPT-4v(ision) system card,” https://cdn.openai.com/papers/ GPTV System Card.pdf, 2023

  22. [22]

    The point where reality meets fantasy: Mixed adversarial generators for image splice detection,

    V . V . Kniaz, V . Knyaz, and F. Remondino, “The point where reality meets fantasy: Mixed adversarial generators for image splice detection,” Advances in Neural Information Processing Systems , vol. 32, 2019

  23. [23]

    Casia image tampering detection evaluation database,

    J. Dong, W. Wang, and T. Tan, “Casia image tampering detection evaluation database,” in 2013 IEEE China Summit and International Conference on Signal and Information Processing . IEEE, 2013, pp. 422–426

  24. [24]

    Noiseprint: a cnn-based camera model fingerprint,

    D. Cozzolino and L. Verdoliva, “Noiseprint: a cnn-based camera model fingerprint,” IEEE Transactions on Information Forensics and Security , vol. 15, pp. 144–159, 2019

  25. [25]

    Coverage—a novel database for copy-move forgery detection,

    B. Wen, Y . Zhu, R. Subramanian, T.-T. Ng, X. Shen, and S. Winkler, “Coverage—a novel database for copy-move forgery detection,” in 2016 IEEE International Conference on Image Processing (ICIP) . IEEE, 2016, pp. 161–165

  26. [27]

    Spatiotemporal trident networks: detection and localization of object removal tampering in video passive forensics,

    Q. Yang, D. Yu, Z. Zhang, Y . Yao, and L. Chen, “Spatiotemporal trident networks: detection and localization of object removal tampering in video passive forensics,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 10, pp. 4131–4144, 2020

  27. [28]

    A deep learning approach to patch-based image inpainting forensics,

    X. Zhu, Y . Qian, X. Zhao, B. Sun, and Y . Sun, “A deep learning approach to patch-based image inpainting forensics,” Signal Processing: Image Communication, vol. 67, pp. 90–99, 2018

  28. [29]

    Mantra-net: Manipulation tracing network for detection and localization of image forgeries with anomalous features,

    Y . Wu, W. AbdAlmageed, and P. Natarajan, “Mantra-net: Manipulation tracing network for detection and localization of image forgeries with anomalous features,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2019, pp. 9543–9552

  29. [30]

    Span: Spatial pyramid attention network for image manipulation localization,

    X. Hu, Z. Zhang, Z. Jiang, S. Chaudhuri, Z. Yang, and R. Nevatia, “Span: Spatial pyramid attention network for image manipulation localization,” in European Conference on Computer Vision. Springer, 2020, pp. 312– 328. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, OCTOBER 2024 16

  30. [31]

    Self-adversarial training incorporating forgery attention for image forgery localization,

    L. Zhuo, S. Tan, B. Li, and J. Huang, “Self-adversarial training incorporating forgery attention for image forgery localization,” IEEE Transactions on Information Forensics and Security , vol. 17, pp. 819– 834, 2022

  31. [32]

    Cat-net: Compression artifact tracing network for detection and localization of image splicing,

    M.-J. Kwon, I.-J. Yu, S.-H. Nam, and H.-K. Lee, “Cat-net: Compression artifact tracing network for detection and localization of image splicing,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 375–384

  32. [33]

    Pscc-net: Progressive spatio- channel correlation network for image manipulation detection and localization,

    X. Liu, Y . Liu, J. Chen, and X. Liu, “Pscc-net: Progressive spatio- channel correlation network for image manipulation detection and localization,” IEEE Transactions on Circuits and Systems for Video Technology, 2022

  33. [34]

    Hierarchical fine-grained image forgery detection and localization,

    X. Guo, X. Liu, Z. Ren, S. Grosz, I. Masi, and X. Liu, “Hierarchical fine-grained image forgery detection and localization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2023, pp. 3155–3165

  34. [35]

    BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models,

    J. Li, D. Li, S. Savarese, and S. C. H. Hoi, “BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models,” in International Conference on Machine Learning , ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, Eds., vol. 202, 2023, pp. 19 730–19 742

  35. [36]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “MiniGPT-4: Enhancing vision-language understanding with advanced large language models,” arXiv preprint arXiv:2304.10592 , 2023

  36. [37]

    Osprey: Pixel understanding with visual instruction tuning,

    Y . Yuan, W. Li, J. Liu, D. Tang, X. Luo, C. Qin, L. Zhang, and J. Zhu, “Osprey: Pixel understanding with visual instruction tuning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 28 202–28 211

  37. [38]

    Anomalygpt: Detecting industrial anomalies using large vision-language models,

    Z. Gu, B. Zhu, G. Zhu, Y . Chen, M. Tang, and J. Wang, “Anomalygpt: Detecting industrial anomalies using large vision-language models,” in Proceedings of the AAAI Conference on Artificial Intelligence , vol. 38, no. 3, 2024, pp. 1932–1940

  38. [39]

    PandaGPT: One Model To Instruction-Follow Them All

    Y . Su, T. Lan, H. Li, J. Xu, Y . Wang, and D. Cai, “Pandagpt: One model to instruction-follow them all,” arXiv preprint arXiv:2305.16355 , 2023

  39. [40]

    Myriad: Large multimodal model by applying vision experts for industrial anomaly detection,

    Y . Li, H. Wang, S. Yuan, M. Liu, D. Zhao, Y . Guo, C. Xu, G. Shi, and W. Zuo, “Myriad: Large multimodal model by applying vision experts for industrial anomaly detection,” arXiv preprint arXiv:2310.19070 , 2023

  40. [41]

    Sniffer: Multimodal large lan- guage model for explainable out-of-context misinformation detection,

    P. Qi, Z. Yan, W. Hsu, and M. L. Lee, “Sniffer: Multimodal large lan- guage model for explainable out-of-context misinformation detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 13 052–13 062

  41. [42]

    InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

    W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. C. H. Hoi, “InstructBLIP: Towards general-purpose vision- language models with instruction tuning,” CoRR, vol. abs/2305.06500,

  42. [43]
  43. [44]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning . PMLR, 2021, pp. 8748–8763

  44. [45]

    F-vlm: Open-vocabulary object detection upon frozen vision and language models,

    W. Kuo, Y . Cui, X. Gu, A. Piergiovanni, and A. Angelova, “F-vlm: Open-vocabulary object detection upon frozen vision and language models,” arXiv preprint arXiv:2209.15639 , 2022

  45. [46]

    Extract free dense labels from clip,

    C. Zhou, C. C. Loy, and B. Dai, “Extract free dense labels from clip,” in European Conference on Computer Vision. Springer, 2022, pp. 696– 712

  46. [47]

    Iterative prompt learning for unsupervised backlit image enhancement,

    Z. Liang, C. Li, S. Zhou, R. Feng, and C. C. Loy, “Iterative prompt learning for unsupervised backlit image enhancement,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2023, pp. 8094–8103

  47. [48]

    Exploring clip for assessing the look and feel of images,

    J. Wang, K. C. Chan, and C. C. Loy, “Exploring clip for assessing the look and feel of images,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 2, 2023, pp. 2555–2563

  48. [49]

    How can we know what language models know?

    Z. Jiang, F. F. Xu, J. Araki, and G. Neubig, “How can we know what language models know?” Transactions of the Association for Computational Linguistics, vol. 8, pp. 423–438, 2020

  49. [50]

    Language models are few-shot learners,

    T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert- V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Am...

  50. [51]

    Improving zero- shot generalization for clip with synthesized prompts,

    Z. Wang, J. Liang, R. He, N. Xu, Z. Wang, and T. Tan, “Improving zero- shot generalization for clip with synthesized prompts,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2023, pp. 3032–3042

  51. [52]

    Vicuna: An open-source chatbot impressing gpt- 4 with 90%* chatgpt quality,

    W.-L. Chiang, Z. Li, Z. Lin, Y . Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y . Zhuang, J. E. Gonzalez, I. Stoica, and E. P. Xing, “Vicuna: An open-source chatbot impressing gpt- 4 with 90%* chatgpt quality,” March 2023. [Online]. Available: https://lmsys.org/blog/2023-03-30-vicuna/

  52. [53]

    An image is worth 16x16 words: Trans- formers for image recognition at scale,

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Trans- formers for image recognition at scale,” in 9th International Conference on Learning Representations , 2021

  53. [54]

    Microsoft coco: Common objects in context,

    T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European Conference on Computer Vision. Springer, 2014, pp. 740–755

  54. [55]

    Busternet: Detecting copy- move image forgery with source/target localization,

    Y . Wu, W. Abd-Almageed, and P. Natarajan, “Busternet: Detecting copy- move image forgery with source/target localization,” in Proceedings of the European Conference on Computer Vision , 2018, pp. 168–184

  55. [56]

    Recurrent feature reasoning for image inpainting,

    J. Li, N. Wang, L. Zhang, B. Du, and D. Tao, “Recurrent feature reasoning for image inpainting,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2020, pp. 7760–7768

  56. [57]

    Imd2020: A large-scale annotated dataset tailored for detecting manipulated images,

    A. Novozamsky, B. Mahdian, and S. Saic, “Imd2020: A large-scale annotated dataset tailored for detecting manipulated images,” inProceed- ings of the IEEE/CVF Winter Conference on Applications of Computer Vision Workshops, 2020, pp. 71–80

  57. [58]

    Mfc datasets: Large- scale benchmark datasets for media forensic challenge evaluation,

    H. Guan, M. Kozak, E. Robertson, Y . Lee, A. N. Yates, A. Delgado, D. Zhou, T. Kheyrkhah, J. Smith, and J. Fiscus, “Mfc datasets: Large- scale benchmark datasets for media forensic challenge evaluation,” in 2019 IEEE Winter Applications of Computer Vision Workshops (WACVW). IEEE, 2019, pp. 63–72

  58. [59]

    Autosplice: A text-prompt manipulated image dataset for media forensics,

    S. Jia, M. Huang, Z. Zhou, Y . Ju, J. Cai, and S. Lyu, “Autosplice: A text-prompt manipulated image dataset for media forensics,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 893–903

  59. [60]

    Segment anything,

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Loet al., “Segment anything,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4015–4026

  60. [61]

    Woodpecker: Hallucination correction for multimodal large language models,

    S. Yin, C. Fu, S. Zhao, T. Xu, H. Wang, D. Sui, Y . Shen, K. Li, X. Sun, and E. Chen, “Woodpecker: Hallucination correction for multimodal large language models,” CoRR, vol. abs/2310.16045, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2310.16045

  61. [62]

    Localization of deep inpainting using high- pass fully convolutional network,

    H. Li and J. Huang, “Localization of deep inpainting using high- pass fully convolutional network,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2019, pp. 8301–8310

  62. [63]

    Generate, segment, and refine: Towards generic manipulation segmentation,

    P. Zhou, B.-C. Chen, X. Han, M. Najibi, A. Shrivastava, S.-N. Lim, and L. Davis, “Generate, segment, and refine: Towards generic manipulation segmentation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 13 058–13 065

  63. [64]

    Detecting image splicing using geometry invariants and camera characteristics consistency,

    Y .-F. Hsu and S.-F. Chang, “Detecting image splicing using geometry invariants and camera characteristics consistency,” in2006 IEEE Interna- tional Conference on Multimedia and Expo . IEEE, 2006, pp. 549–552

  64. [65]

    Exposing digital image forgeries by illumination color classification,

    T. J. De Carvalho, C. Riess, E. Angelopoulou, H. Pedrini, and A. de Rezende Rocha, “Exposing digital image forgeries by illumination color classification,” IEEE Transactions on Information Forensics and Security, vol. 8, no. 7, pp. 1182–1194, 2013

  65. [66]

    Multi-scale analysis strategies in prnu-based tampering localization,

    P. Korus and J. Huang, “Multi-scale analysis strategies in prnu-based tampering localization,” IEEE Transactions on Information Forensics and Security, vol. 12, no. 4, pp. 809–824, 2016

  66. [67]

    Openforensics: Large-scale challenging dataset for multi-face forgery detection and seg- mentation in-the-wild,

    T.-N. Le, H. H. Nguyen, J. Yamagishi, and I. Echizen, “Openforensics: Large-scale challenging dataset for multi-face forgery detection and seg- mentation in-the-wild,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2021, pp. 10 117–10 127

  67. [68]

    Hybrid lstm and encoder–decoder architecture for de- tection of image forgeries,

    J. H. Bappy, C. Simons, L. Nataraj, B. Manjunath, and A. K. Roy- Chowdhury, “Hybrid lstm and encoder–decoder architecture for de- tection of image forgeries,” IEEE Transactions on Image Processing , vol. 28, no. 7, pp. 3286–3300, 2019

  68. [69]

    Rouge: A package for automatic evaluation of summaries,

    C.-Y . Lin, “Rouge: A package for automatic evaluation of summaries,” in Text Summarization Branches Out , 2004, pp. 74–81

  69. [70]

    Real-iad: A real-world multi-view dataset for benchmarking versatile industrial anomaly detection,

    C. Wang, W. Zhu, B.-B. Gao, Z. Gan, J. Zhang, Z. Gu, S. Qian, M. Chen, and L. Ma, “Real-iad: A real-world multi-view dataset for benchmarking versatile industrial anomaly detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2024, pp. 22 883–22 892