Toward Generalizable Forgery Detection and Reasoning
Pith reviewed 2026-05-22 22:23 UTC · model grok-4.3
The pith
FakeReasoning guides MLLMs to detect AI image forgeries and reason about their attributes by fusing CLIP semantics with DINO artifact maps.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A forgery-aware feature fusion module that injects DINO attention maps into an MLLM via cross-attention, together with a dual CLIP-DINO visual encoder and a classification probability mapper, allows the model to produce both accurate forgery detections and attribute-level reasoning that generalizes across generative models.
What carries the argument
Forgery-Aware Feature Fusion Module, which uses DINO attention maps and cross-attention to direct the MLLM to synthesis artifacts.
If this is right
- The unified FDR-Task formulation lets one model handle both binary detection and attribute-level explanation without separate saliency heads.
- Coupling language modeling with a classification probability mapper improves detection scores beyond language-only prompting.
- Training on the MMFR-Dataset yields measurable gains over prior methods on both in-distribution and cross-model test sets.
Where Pith is reading between the lines
- The same attention-guided fusion idea could be tried on video or audio generators where low-level artifact maps are also available.
- If DINO maps prove generator-specific rather than universal, future work might need to learn generator-agnostic artifact detectors first.
Load-bearing premise
DINO attention maps consistently mark the synthesis artifacts that differ across generators so the fusion module can steer the MLLM to the right clues.
What would settle it
Test the trained model on images produced by an eleventh generator never seen in the 120K-image dataset; large drops in detection accuracy or reasoning quality would falsify the generalization claim.
Figures
read the original abstract
Accurate and interpretable detection of AI-generated images is essential for mitigating risks associated with AI misuse. However, the substantial domain gap among generative models makes it challenging to develop a generalizable forgery detection model. Moreover, since every pixel in an AI-generated image is synthesized, traditional saliency-based forgery explanation methods are not well suited for this task. To address these challenges, we formulate detection and explanation as a unified Forgery Detection and Reasoning task (FDR-Task), leveraging Multi-Modal Large Language Models (MLLMs) to provide accurate detection through reliable reasoning over forgery attributes. To facilitate this task, we introduce the Multi-Modal Forgery Reasoning dataset (MMFR-Dataset), a large-scale dataset containing 120K images across 10 generative models, with 378K reasoning annotations on forgery attributes, enabling comprehensive evaluation of the FDR-Task. Furthermore, we propose FakeReasoning, a forgery detection and reasoning framework with three key components: 1) a dual-branch visual encoder that integrates CLIP and DINO to capture both high-level semantics and low-level artifacts; 2) a Forgery-Aware Feature Fusion Module that leverages DINO's attention maps and cross-attention mechanisms to guide MLLMs toward forgery-related clues; 3) a Classification Probability Mapper that couples language modeling and forgery detection, enhancing overall performance. Experiments across multiple generative models demonstrate that FakeReasoning not only achieves robust generalization but also outperforms state-of-the-art methods on both detection and reasoning tasks. The code is available at: https://github.com/PRIS-CV/FakeReasoning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper formulates forgery detection and explanation as a unified Forgery Detection and Reasoning (FDR) task. It introduces the MMFR-Dataset (120K images from 10 generative models, 378K reasoning annotations) and proposes FakeReasoning, which combines a dual-branch CLIP+DINO visual encoder, a Forgery-Aware Feature Fusion Module that uses DINO attention maps and cross-attention to guide MLLMs, and a Classification Probability Mapper that couples language modeling with detection. Experiments are claimed to show robust generalization and SOTA performance on both detection and reasoning across multiple generators.
Significance. If the generalization results hold under held-out generator evaluation, the work would be significant for addressing domain gaps in AI-image detection and for providing interpretable reasoning. The MMFR-Dataset and the unified detection+reasoning formulation are clear contributions; open-sourcing the code strengthens reproducibility. The dual-encoder plus fusion approach offers a concrete mechanism for injecting low-level artifact signals into MLLMs.
major comments (2)
- [Experiments] Experiments section (and abstract): the central claim of 'robust generalization' across generative models requires explicit held-out-generator splits. The MMFR-Dataset description covers 10 models but does not state whether any model is completely excluded from training; if all splits are image-level within the same 10 models, the reported outperformance and generalization could be explained by generator-specific artifacts rather than transferable forgery cues, directly undermining the strongest claim.
- [§3.2] §3.2, Forgery-Aware Feature Fusion Module: the module's contribution rests on the assumption that DINO attention maps reliably surface synthesis artifacts across generators. No quantitative ablation isolating the fusion module's effect on held-out models, nor cross-generator attention-map visualizations, is referenced; without this, the claimed guidance of MLLMs toward forgery clues cannot be verified as load-bearing.
minor comments (2)
- [Abstract] Abstract and §4: quantitative metrics, error bars, dataset split details, and baseline comparisons are referenced only at a high level; adding a table summarizing detection accuracy and reasoning metrics per held-out model would improve clarity.
- [§3] Notation: the dual-branch encoder and probability mapper are described with component names but without explicit equations for the cross-attention or mapper; adding a short mathematical formulation would aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the major comments below and will revise the manuscript to incorporate held-out generator evaluations and additional ablations.
read point-by-point responses
-
Referee: [Experiments] Experiments section (and abstract): the central claim of 'robust generalization' across generative models requires explicit held-out-generator splits. The MMFR-Dataset description covers 10 models but does not state whether any model is completely excluded from training; if all splits are image-level within the same 10 models, the reported outperformance and generalization could be explained by generator-specific artifacts rather than transferable forgery cues, directly undermining the strongest claim.
Authors: We acknowledge that the manuscript does not explicitly describe held-out generator splits. To strengthen the generalization claim, we will add a new set of experiments in the revision that train on a subset of the 10 models and evaluate on completely excluded generators, reporting detection and reasoning metrics to demonstrate transferable forgery cues. revision: yes
-
Referee: [§3.2] §3.2, Forgery-Aware Feature Fusion Module: the module's contribution rests on the assumption that DINO attention maps reliably surface synthesis artifacts across generators. No quantitative ablation isolating the fusion module's effect on held-out models, nor cross-generator attention-map visualizations, is referenced; without this, the claimed guidance of MLLMs toward forgery clues cannot be verified as load-bearing.
Authors: We agree that isolating the fusion module's contribution requires further evidence. In the revision we will add quantitative ablations of the Forgery-Aware Feature Fusion Module evaluated specifically on held-out generators, together with cross-generator attention-map visualizations, to verify that the module reliably guides the MLLM toward forgery-related clues. revision: yes
Circularity Check
No circularity; new dataset and additive modules evaluated empirically on external backbones
full rationale
The paper formulates a new FDR-Task, releases the MMFR-Dataset (120K images, 378K annotations across 10 generators), and defines FakeReasoning via three explicitly additive components (dual CLIP+DINO encoder, DINO-guided fusion module, probability mapper) that operate on publicly available CLIP/DINO/MLLM models. No equations, parameters, or central claims are shown to reduce by construction to fitted inputs or self-citations; performance and generalization statements are presented as outcomes of experiments rather than tautological re-statements of the inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption MLLMs can reliably reason over forgery attributes when guided by appropriate visual features from CLIP and DINO
Forward citations
Cited by 1 Pith paper
-
IncreFA: Breaking the Static Wall of Generative Model Attribution
IncreFA uses hierarchical constraints with learnable orthogonal priors and a latent memory bank to enable continual adaptation for attributing images to new generative models, reporting SOTA accuracy and 98.93% unseen...
Reference graph
Works this paper leans on
-
[1]
Artificial intelligence and political deepfakes: Shaping citizen perceptions through misinformation,
M. Momeni, “Artificial intelligence and political deepfakes: Shaping citizen perceptions through misinformation,” Journal of Creative Com- munications, vol. 20, no. 1, pp. 41–56, 2025
work page 2025
-
[2]
Financial fraud and manipulation: The malicious use of deepfakes in business,
P. Kaushik, V . Garg, A. Priya, and S. Kant, “Financial fraud and manipulation: The malicious use of deepfakes in business,” in Deepfakes and Their Impact on Business . IGI Global Scientific Publishing, 2025, pp. 173–196
work page 2025
-
[3]
Concerns about the role of artificial intelligence in journalism, and media manipulation,
S. Mahony and Q. Chen, “Concerns about the role of artificial intelligence in journalism, and media manipulation,” Journalism, p. 14648849241263293, 2024
work page 2024
-
[4]
Two-stage copy-move forgery detec- tion with self deep matching and proposal superglue,
Y . Liu, C. Xia, X. Zhu, and S. Xu, “Two-stage copy-move forgery detec- tion with self deep matching and proposal superglue,”IEEE Transactions on Image Processing , vol. 31, pp. 541–555, 2021
work page 2021
-
[5]
Learning patch-channel corre- spondence for interpretable face forgery detection,
Y . Hua, R. Shi, P. Wang, and S. Ge, “Learning patch-channel corre- spondence for interpretable face forgery detection,” IEEE Transactions on Image Processing , vol. 32, pp. 1668–1680, 2023. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 12
work page 2023
-
[6]
Trufor: Leveraging all-round clues for trustworthy image forgery detection and localization,
F. Guillaro, D. Cozzolino, A. Sud, N. Dufour, and L. Verdoliva, “Trufor: Leveraging all-round clues for trustworthy image forgery detection and localization,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2023, pp. 20 606–20 615
work page 2023
-
[7]
Image copy-move forgery detection via deep patchmatch and pairwise ranking learning,
Y . Li, Y . He, C. Chen, L. Dong, B. Li, J. Zhou, and X. Li, “Image copy-move forgery detection via deep patchmatch and pairwise ranking learning,” IEEE Transactions on Image Processing , 2024
work page 2024
-
[8]
Self-supervised adversarial training for robust face forgery detection
Y . Gao, W. Lin, J. Xu, W. Xu, and P. Chen, “Self-supervised adversarial training for robust face forgery detection.” in BMVC, 2023, p. 718
work page 2023
-
[9]
Any-resolution ai-generated image detection by spectral learning,
D. Karageorgiou, S. Papadopoulos, I. Kompatsiaris, and E. Gavves, “Any-resolution ai-generated image detection by spectral learning,” in Proceedings of the Computer Vision and Pattern Recognition Confer- ence, 2025, pp. 18 706–18 717
work page 2025
-
[10]
C. Tan, Y . Zhao, S. Wei, G. Gu, P. Liu, and Y . Wei, “Rethinking the up- sampling operations in cnn-based generative network for generalizable deepfake detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2024, pp. 28 130–28 139
work page 2024
-
[11]
X. Fu, B. Fu, S. Chen, T. Yao, Y . Wang, S. Ding, X. Liang, and X. Li, “Faces blind your eyes: Unveiling the content-irrelevant synthetic arti- facts for deepfake detection,” IEEE Transactions on Image Processing , 2025
work page 2025
-
[12]
Towards universal fake image detec- tors that generalize across generative models,
U. Ojha, Y . Li, and Y . J. Lee, “Towards universal fake image detec- tors that generalize across generative models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2023, pp. 24 480–24 489
work page 2023
-
[13]
Forgery- aware adaptive transformer for generalizable synthetic image detection,
H. Liu, Z. Tan, C. Tan, Y . Wei, J. Wang, and Y . Zhao, “Forgery- aware adaptive transformer for generalizable synthetic image detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 10 770–10 780
work page 2024
-
[14]
C2p-clip: Injecting category common prompt in clip to enhance generalization in deepfake detection,
C. Tan, R. Tao, H. Liu, G. Gu, B. Wu, Y . Zhao, and Y . Wei, “C2p-clip: Injecting category common prompt in clip to enhance generalization in deepfake detection,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 7, 2025, pp. 7184–7192
work page 2025
-
[15]
Common sense reasoning for deepfake detection,
Y . Zhang, B. Colman, X. Guo, A. Shahriyari, and G. Bharaj, “Common sense reasoning for deepfake detection,” in European Conference on Computer Vision. Springer, 2024, pp. 399–415
work page 2024
-
[16]
Ffaa: Multimodal large language model based explainable open-world face forgery analysis assistant,
Z. Huang, B. Xia, Z. Lin, Z. Mou, W. Yang, and J. Jia, “Ffaa: Multimodal large language model based explainable open-world face forgery analysis assistant,” arXiv preprint arXiv:2408.10072 , 2024
-
[17]
Z. Xu, X. Zhang, R. Li, Z. Tang, Q. Huang, and J. Zhang, “Fakeshield: Explainable image forgery detection and localization via multi-modal large language models,” arXiv preprint arXiv:2410.02761 , 2024
-
[18]
Z. Huang, J. Hu, X. Li, Y . He, X. Zhao, B. Peng, B. Wu, X. Huang, and G. Cheng, “Sida: Social media image deepfake detection, localization and explanation with large multimodal model,” in Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 28 831– 28 841
work page 2025
-
[19]
Cnn- generated images are surprisingly easy to spot... for now,
S.-Y . Wang, O. Wang, R. Zhang, A. Owens, and A. A. Efros, “Cnn- generated images are surprisingly easy to spot... for now,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 8695–8704
work page 2020
-
[20]
Progressive Growing of GANs for Improved Quality, Stability, and Variation
T. Karras, “Progressive growing of gans for improved quality, stability, and variation,” arXiv preprint arXiv:1710.10196 , 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[21]
Attributing fake images to gans: Learn- ing and analyzing gan fingerprints,
N. Yu, L. S. Davis, and M. Fritz, “Attributing fake images to gans: Learn- ing and analyzing gan fingerprints,” in Proceedings of the IEEE/CVF international conference on computer vision , 2019, pp. 7556–7566
work page 2019
-
[22]
Dire for diffusion-generated image detection,
Z. Wang, J. Bao, W. Zhou, W. Wang, H. Hu, H. Chen, and H. Li, “Dire for diffusion-generated image detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2023, pp. 22 445–22 455
work page 2023
-
[23]
G. Cazenavette, A. Sud, T. Leung, and B. Usman, “Fakeinversion: Learning to detect images from unseen text-to-image models by invert- ing stable diffusion,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2024, pp. 10 759–10 769
work page 2024
-
[24]
B. Chen, J. Zeng, J. Yang, and R. Yang, “Drct: Diffusion reconstruction contrastive training towards universal detection of diffusion generated images,” in Forty-first International Conference on Machine Learning , 2024
work page 2024
-
[25]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning . PMLR, 2021, pp. 8748–8763
work page 2021
-
[26]
De-fake: Detection and attribution of fake images generated by text-to-image generation models,
Z. Sha, Z. Li, N. Yu, and Y . Zhang, “De-fake: Detection and attribution of fake images generated by text-to-image generation models,” in Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, 2023, pp. 3418–3432
work page 2023
-
[27]
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi, “Instructblip: Towards general-purpose vision-language models with instruction tuning,” arXiv preprint arXiv:2305.06500, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
J. Chen, D. Zhu, X. Shen, X. Li, Z. Liu, P. Zhang, R. Krishnamoorthi, V . Chandra, Y . Xiong, and M. Elhoseiny, “Minigpt-v2: large language model as a unified interface for vision-language multi-task learning,” arXiv preprint arXiv:2310.09478 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[29]
H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” Advances in neural information processing systems , vol. 36, 2024
work page 2024
-
[30]
Improved baselines with visual instruction tuning,
H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved baselines with visual instruction tuning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , June 2024, pp. 26 296– 26 306
work page 2024
-
[31]
Llava-next: Improved reasoning, ocr, and world knowledge,
H. Liu, C. Li, Y . Li, B. Li, Y . Zhang, S. Shen, and Y . J. Lee, “Llava-next: Improved reasoning, ocr, and world knowledge,” January 2024. [Online]. Available: https://llava-vl.github.io/blog/2024-01-30-llava-next/
work page 2024
-
[32]
Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,
W.-L. Chiang, Z. Li, Z. Lin, Y . Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y . Zhuang, J. E. Gonzalez et al., “Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,” See https://vicuna. lmsys. org (accessed 14 April 2023) , vol. 2, no. 3, p. 6, 2023
work page 2023
-
[33]
LLaMA: Open and Efficient Foundation Language Models
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azhar et al. , “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y . Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y . Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin, “Qwen2.5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
Chatglm: A family of large language models from glm-130b to glm-4 all tools,
T. GLM, A. Zeng, B. Xu, B. Wang, C. Zhang, D. Yin, D. Rojas, G. Feng, H. Zhao, H. Lai, H. Yu, H. Wang, J. Sun, J. Zhang, J. Cheng, J. Gui, J. Tang, J. Zhang, J. Li, L. Zhao, L. Wu, L. Zhong, M. Liu, M. Huang, P. Zhang, Q. Zheng, R. Lu, S. Duan, S. Zhang, S. Cao, S. Yang, W. L. Tam, W. Zhao, X. Liu, X. Xia, X. Zhang, X. Gu, X. Lv, X. Liu, X. Liu, X. Yang, ...
work page 2024
-
[36]
Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,
Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Lu et al. , “Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2024, pp. 24 185–24 198
work page 2024
-
[37]
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
Z. Wu, X. Chen, Z. Pan, X. Liu, W. Liu, D. Dai, H. Gao, Y . Ma, C. Wu, B. Wang et al. , “Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding,” arXiv preprint arXiv:2412.10302, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[38]
Deferred neural rendering: Image synthesis using neural textures,
J. Thies, M. Zollh ¨ofer, and M. Nießner, “Deferred neural rendering: Image synthesis using neural textures,” Acm Transactions on Graphics (TOG), vol. 38, no. 4, pp. 1–12, 2019
work page 2019
-
[39]
P. Melzi, C. Rathgeb, R. Tolosana, R. Vera-Rodriguez, D. Lawatsch, F. Domin, and M. Schaubert, “Gandiffface: Controllable generation of synthetic datasets for face recognition with realistic variations,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 3086–3095
work page 2023
-
[40]
Fakebench: Uncover the achilles’ heels of fake images with large multimodal models,
Y . Li, X. Liu, X. Wang, S. Wang, and W. Lin, “Fakebench: Uncover the achilles’ heels of fake images with large multimodal models,” arXiv e-prints, pp. arXiv–2404, 2024
work page 2024
-
[41]
Improving image generation with better captions,
J. Betker, G. Goh, L. Jing, T. Brooks, J. Wang, L. Li, L. Ouyang, J. Zhuang, J. Lee, Y . Guoet al., “Improving image generation with better captions,” Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, vol. 2, no. 3, p. 8, 2023
work page 2023
-
[42]
Loki: A comprehensive synthetic data de- tection benchmark using large multimodal models,
J. Ye, B. Zhou, Z. Huang, J. Zhang, T. Bai, H. Kang, J. He, H. Lin, Z. Wang, T. Wu et al. , “Loki: A comprehensive synthetic data de- tection benchmark using large multimodal models,” arXiv preprint arXiv:2410.09732, 2024
-
[43]
Black Forest Labs, “Flux,” https://github.com/black-forest-labs/flux, 2024
work page 2024
-
[44]
Repaint: Inpainting using denoising diffusion probabilistic models,
A. Lugmayr, M. Danelljan, A. Romero, F. Yu, R. Timofte, and L. Van Gool, “Repaint: Inpainting using denoising diffusion probabilistic models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2022, pp. 11 461–11 471
work page 2022
-
[45]
Legion: Learning to ground and explain for synthetic image detection,
H. Kang, S. Wen, Z. Wen, J. Ye, W. Li, P. Feng, B. Zhou, B. Wang, D. Lin, L. Zhang et al. , “Legion: Learning to ground and explain for synthetic image detection,” arXiv preprint arXiv:2503.15264 , 2025. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13
-
[46]
Noiseprint: A cnn-based camera model fingerprint,
D. Cozzolino and L. Verdoliva, “Noiseprint: A cnn-based camera model fingerprint,” IEEE Transactions on Information Forensics and Security , vol. 15, pp. 144–159, 2019
work page 2019
-
[47]
Hierarchical fine-grained image forgery detection and localization,
X. Guo, X. Liu, Z. Ren, S. Grosz, I. Masi, and X. Liu, “Hierarchical fine-grained image forgery detection and localization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 3155–3165
work page 2023
-
[48]
Antifakeprompt: Prompt- tuned vision-language models are fake image detectors,
Y .-M. Chang, C. Yeh, W.-C. Chiu, and N. Yu, “Antifakeprompt: Prompt- tuned vision-language models are fake image detectors,” arXiv preprint arXiv:2310.17419, 2023
-
[49]
Bi-lora: A vision-language approach for synthetic image detection,
M. Keita, W. Hamidouche, H. Bougueffa Eutamene, A. Taleb-Ahmed, D. Camacho, and A. Hadid, “Bi-lora: A vision-language approach for synthetic image detection,” Expert Systems , vol. 42, no. 2, p. e13829, 2025
work page 2025
-
[50]
LoRA: Low-Rank Adaptation of Large Language Models
E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685 , 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[51]
J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,” in International conference on machine learning . PMLR, 2023, pp. 19 730–19 742
work page 2023
-
[52]
Lisa: Reasoning segmentation via large language model,
X. Lai, Z. Tian, Y . Chen, Y . Li, Y . Yuan, S. Liu, and J. Jia, “Lisa: Reasoning segmentation via large language model,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 9579–9589
work page 2024
-
[53]
D3: Scaling up deepfake detection by learning from discrepancy,
Y . Yang, Z. Qian, Y . Zhu, O. Russakovsky, and Y . Wu, “D3: Scaling up deepfake detection by learning from discrepancy,” in Proceedings of the Computer Vision and Pattern Recognition Conference , 2025, pp. 23 850–23 859
work page 2025
-
[54]
Diffusiondb: A large-scale prompt gallery dataset for text-to- image generative models,
Z. J. Wang, E. Montoya, D. Munechika, H. Yang, B. Hoover, and D. H. Chau, “Diffusiondb: A large-scale prompt gallery dataset for text-to- image generative models,” arXiv preprint arXiv:2210.14896 , 2022
-
[55]
Laion- 5b: An open large-scale dataset for training next generation image-text models,
C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsmanet al., “Laion- 5b: An open large-scale dataset for training next generation image-text models,” Advances in Neural Information Processing Systems , vol. 35, pp. 25 278–25 294, 2022
work page 2022
-
[56]
DALLE-3, https://huggingface.co/datasets/ehristoforu/dalle-3-images, 2023
work page 2023
-
[57]
Genimage: A million-scale benchmark for detecting ai- generated image,
M. Zhu, H. Chen, Q. Yan, X. Huang, G. Lin, W. Li, Z. Tu, H. Hu, J. Hu, and Y . Wang, “Genimage: A million-scale benchmark for detecting ai- generated image,” Advances in Neural Information Processing Systems , vol. 36, pp. 77 771–77 782, 2023
work page 2023
-
[58]
Kandinsky, https://huggingface.co/datasets/diffusers-parti-prompts/ kandinsky-2-2, 2023
work page 2023
-
[59]
PixArt- α, https://huggingface.co/datasets/PixArt-alpha/PixArt-Eval30K, 2024
work page 2024
-
[60]
FLUX, https://huggingface.co/datasets/lehduong/flux generated, 2025
work page 2025
-
[61]
GPT-4o, https://huggingface.co/datasets/FreedomIntelligence/ ShareGPT-4o-Image, 2025
work page 2025
-
[62]
Raising the Bar of AI-generated Image Detection with CLIP,
D. Cozzolino, G. Poggi, R. Corvi, M. Nießner, and L. Verdoliva, “Raising the Bar of AI-generated Image Detection with CLIP,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024, pp. 4356–4366
work page 2024
-
[63]
ImageNet Large Scale Visual Recognition Challenge,
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision (IJCV) , vol. 115, no. 3, pp. 211–252, 2015
work page 2015
-
[64]
A-bench: Are lmms masters at evaluating ai- generated images?
Z. Zhang, H. Wu, C. Li, Y . Zhou, W. Sun, X. Min, Z. Chen, X. Liu, W. Lin, and G. Zhai, “A-bench: Are lmms masters at evaluating ai- generated images?” arXiv preprint arXiv:2406.03070 , 2024
-
[65]
Bioinstruct: instruction tuning of large language models for biomedical natural language processing,
H. Tran, Z. Yang, Z. Yao, and H. Yu, “Bioinstruct: instruction tuning of large language models for biomedical natural language processing,” Journal of the American Medical Informatics Association, vol. 31, no. 9, pp. 1821–1832, 2024
work page 2024
-
[66]
H. Shao, S. Qian, H. Xiao, G. Song, Z. Zong, L. Wang, Y . Liu, and H. Li, “Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning,” Advances in Neural Information Processing Systems , vol. 37, pp. 8612– 8642, 2024
work page 2024
-
[67]
LLaVA-CoT: Let Vision Language Models Reason Step-by-Step
G. Xu, P. Jin, L. Hao, Y . Song, L. Sun, and L. Yuan, “Llava- o1: Let vision language models reason step-by-step,” arXiv preprint arXiv:2411.10440, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[68]
Driverx: A vision-language reasoning model for cross-task autonomous driving,
M. Diao, L. Yang, H. Yin, Z. Wang, Y . Wang, D. Tian, K. Liang, and Z. Ma, “Driverx: A vision-language reasoning model for cross-task autonomous driving,” arXiv preprint arXiv:2505.20665 , 2025
-
[69]
Patchcraft: Exploring texture patch for efficient ai-generated image detection
N. Zhong, Y . Xu, S. Li, Z. Qian, and X. Zhang, “Patchcraft: Exploring texture patch for efficient ai-generated image detection,” arXiv preprint arXiv:2311.12397, 2023
-
[70]
Mvss-net: Multi- view multi-scale supervised networks for image manipulation detec- tion,
C. Dong, X. Chen, R. Hu, J. Cao, and X. Li, “Mvss-net: Multi- view multi-scale supervised networks for image manipulation detec- tion,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 45, no. 3, pp. 3539–3553, 2022
work page 2022
-
[71]
Lideepdet: Deepfake detection via image decomposition and advanced lighting information analysis,
Z. Lai, J. Li, C. Wang, J. Wu, and D. Jiang, “Lideepdet: Deepfake detection via image decomposition and advanced lighting information analysis,” Electronics, vol. 13, no. 22, p. 4466, 2024
work page 2024
-
[72]
arXiv preprint arXiv:2310.01018 , volume=
Z. Luo, F. K. Gustafsson, Z. Zhao, J. Sj ¨olund, and T. B. Sch ¨on, “Controlling vision-language models for multi-task image restoration,” arXiv preprint arXiv:2310.01018 , 2023
-
[73]
Do computer vision foundation models learn the low-level characteristics of the human visual system?
Y . Cai, F. Yin, D. Hammou, and R. Mantiuk, “Do computer vision foundation models learn the low-level characteristics of the human visual system?” in Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 20 039–20 048
work page 2025
-
[74]
Rigid: A training-free and model-agnostic framework for ro- bust ai-generated image detection
Z. He, P.-Y . Chen, and T.-Y . Ho, “Rigid: A training-free and model- agnostic framework for robust ai-generated image detection,” arXiv preprint arXiv:2405.20112, 2024
-
[75]
Emerging properties in self-supervised vision transformers,
M. Caron, H. Touvron, I. Misra, H. J ´egou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 9650–9660
work page 2021
-
[76]
Eyes wide shut? exploring the visual shortcomings of multimodal llms,
S. Tong, Z. Liu, Y . Zhai, Y . Ma, Y . LeCun, and S. Xie, “Eyes wide shut? exploring the visual shortcomings of multimodal llms,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 9568–9578
work page 2024
-
[77]
From clip to dino: Visual encoders shout in multi-modal large language models,
D. Jiang, Y . Liu, S. Liu, J. Zhao, H. Zhang, Z. Gao, X. Zhang, J. Li, and H. Xiong, “From clip to dino: Visual encoders shout in multi-modal large language models,” arXiv preprint arXiv:2310.08825 , 2023
-
[78]
DINOv2: Learning Robust Visual Features without Supervision
M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby et al. , “Dinov2: Learning robust visual features without supervision,” arXiv preprint arXiv:2304.07193, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[79]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems , vol. 30, 2017
work page 2017
-
[80]
An image is worth 16x16 words: Transformers for image recognition at scale,
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” ICLR, 2021
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.