pith. machine review for the scientific record. sign in

arxiv: 2512.16300 · v2 · submitted 2025-12-18 · 💻 cs.AI

Code-in-the-Loop Forensics: Agentic Tool Use for Image Forgery Detection

Pith reviewed 2026-05-16 21:37 UTC · model grok-4.3

classification 💻 cs.AI
keywords image forgery detectionmultimodal large language modelsagentic tool usecode generationforensic analysisFABench dataset
0
0 comments X

The pith

Multimodal language models can generate and run their own Python tools to detect image forgeries by combining low-level artifacts with semantic reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ForenAgent, an interactive framework that lets multimodal large language models autonomously create, execute, and refine custom Python code for analyzing image forgeries. Existing detection methods either focus on low-level statistical traces or high-level semantic understanding, yet struggle to combine the two. ForenAgent addresses this by embedding a dynamic reasoning loop of global perception, local focusing, iterative probing, and final adjudication, trained first through cold-start examples then reinforcement on task-aligned rewards. It is evaluated on the new FABench dataset of 100k images and 200k interaction pairs. A sympathetic reader would care because the approach promises more adaptable and interpretable forgery detection that can adjust its own analysis tools to each case.

Core claim

ForenAgent is a multi-round interactive IFD framework in which MLLMs generate, execute, and iteratively refine Python-based low-level tools around the detection objective, following a two-stage training pipeline of Cold Start and Reinforcement Fine-Tuning together with a dynamic reasoning loop of global perception, local focusing, iterative probing, and holistic adjudication, resulting in emergent tool-use competence and reflective reasoning on challenging forgery tasks when assisted by low-level code.

What carries the argument

ForenAgent's code-in-the-loop mechanism, in which the model autonomously writes and runs executable Python scripts for low-level image analysis inside a multi-turn interaction loop.

If this is right

  • Low-level artifact signals and high-level semantic knowledge become usable within one unified detection process.
  • Forensic decisions gain step-by-step interpretability through the generated code and the model's reasoning trace.
  • Detection performance improves on heterogeneous forgery types because tools can be adapted on the fly to each image.
  • A route opens toward general-purpose IFD systems that do not require separate hand-crafted pipelines for each forgery category.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same agent pattern could be tested on video or audio manipulation detection by swapping the low-level analysis primitives.
  • Content-moderation platforms might incorporate such agents to produce auditable code logs for disputed media items.
  • Training data efficiency could rise if the reinforcement stage is replaced by direct optimization against existing forensic benchmarks.

Load-bearing premise

Multimodal large language models can reliably produce correct, executable Python code for low-level image analysis without introducing errors or hallucinations that would invalidate the detection outcome.

What would settle it

A controlled test set of known forged images on which ForenAgent either generates non-executable code or returns detection decisions that systematically disagree with established ground-truth labels.

Figures

Figures reproduced from arXiv: 2512.16300 by Chuanhao Li, Fanrui Zhang, Jianwen Sun, Jiawei Liu, Jiaxin Ai, Kaipeng Zhang, Qiang Zhang, Sizhuo Zhou, Wenjie Li, Yifan Chang, Yujie Zhang, Yukang Feng, Zizhen Li.

Figure 1
Figure 1. Figure 1: ForenAgent autonomously composes a global-to-local [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The overall architecture of the ForenAgent is illustrated, with the upper part showing the FABench construction process and the [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Examples of tampered and synthetic images from diverse FABench generators. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The complete evidence chain by which ForenAgent correctly identifies a synthetic image. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Low-level forensic tool usage frequency distribution [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
read the original abstract

Existing image forgery detection (IFD) methods either exploit low-level, semantics-agnostic artifacts or rely on multimodal large language models (MLLMs) with high-level semantic knowledge. Although naturally complementary, these two information streams are highly heterogeneous in both paradigm and reasoning, making it difficult for existing methods to unify them or effectively model their cross-level interactions. To address this gap, we propose ForenAgent, a multi-round interactive IFD framework that enables MLLMs to autonomously generate, execute, and iteratively refine Python-based low-level tools around the detection objective, thereby achieving more flexible and interpretable forgery analysis. ForenAgent follows a two-stage training pipeline combining Cold Start and Reinforcement Fine-Tuning to enhance its tool interaction capability and reasoning adaptability progressively. Inspired by human reasoning, we design a dynamic reasoning loop comprising global perception, local focusing, iterative probing, and holistic adjudication, and instantiate it as both a data-sampling strategy and a task-aligned process reward. For systematic training and evaluation, we construct FABench, a heterogeneous, high-quality agent-forensics dataset comprising 100k images and approximately 200k agent-interaction question-answer pairs. Experiments show that ForenAgent exhibits emergent tool-use competence and reflective reasoning on challenging IFD tasks when assisted by low-level tools, charting a promising route toward general-purpose IFD. The code will be released after the review process is completed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces ForenAgent, a multi-round interactive framework for image forgery detection (IFD) in which multimodal LLMs autonomously generate, execute, and iteratively refine Python-based low-level analysis tools. It employs a two-stage training pipeline (Cold Start followed by Reinforcement Fine-Tuning), instantiates a dynamic reasoning loop of global perception, local focusing, iterative probing, and holistic adjudication, and releases the FABench dataset containing 100k images and ~200k agent-interaction QA pairs. The central claim is that this agentic setup produces emergent tool-use competence and reflective reasoning on challenging IFD tasks.

Significance. If the experimental claims are substantiated with quantitative evidence, the work would be significant for bridging heterogeneous low-level artifact cues and high-level semantic reasoning within a single agentic loop, offering a more flexible and interpretable alternative to existing IFD pipelines. The public release of code and the large-scale FABench dataset would further increase its value to the community.

major comments (3)
  1. [Experiments] Experiments section: The abstract asserts that 'Experiments show that ForenAgent exhibits emergent tool-use competence and reflective reasoning,' yet the manuscript provides no quantitative metrics (accuracy, F1, AUC), baseline comparisons, ablation results on the Cold Start vs. RFT stages, or statistics on tool-call success rate, syntax/runtime error frequency, or cases where erroneous code still produces a final detection. Without these data the central claim cannot be evaluated.
  2. [Method] Method, dynamic reasoning loop: The process reward used in RFT is described only at the level of 'task-aligned process reward' derived from the four-stage loop; no explicit formulation, weighting of stages, or verification that the reward actually correlates with forgery-detection correctness is supplied. This is load-bearing for the claim that reflective refinement occurs.
  3. [Dataset] Dataset construction: FABench is stated to contain 100k images and ~200k agent-interaction QA pairs, but no protocol is given for how the QA pairs were generated, how tool-execution feedback was filtered for correctness, or what quality-control steps were taken to avoid training on hallucinated or buggy code. This directly affects the validity of the reported emergent competence.
minor comments (1)
  1. [Abstract] The abstract and method descriptions repeatedly use the phrase 'emergent tool-use competence' without a precise operational definition or reference to prior literature on emergence in agentic systems.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating revisions that will be incorporated into the next version of the manuscript.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: The abstract asserts that 'Experiments show that ForenAgent exhibits emergent tool-use competence and reflective reasoning,' yet the manuscript provides no quantitative metrics (accuracy, F1, AUC), baseline comparisons, ablation results on the Cold Start vs. RFT stages, or statistics on tool-call success rate, syntax/runtime error frequency, or cases where erroneous code still produces a final detection. Without these data the central claim cannot be evaluated.

    Authors: We agree that the Experiments section would benefit from more explicit and consolidated quantitative evidence. While the manuscript does contain baseline comparisons, accuracy/F1 results, and some ablation analysis on the two-stage pipeline, we will expand this section in revision to include a dedicated summary table with accuracy, F1, AUC, tool-call success rates, syntax/runtime error frequencies, and analysis of cases where erroneous code still yields correct final detections. Ablation results contrasting Cold Start and RFT stages will also be presented more prominently. revision: yes

  2. Referee: [Method] Method, dynamic reasoning loop: The process reward used in RFT is described only at the level of 'task-aligned process reward' derived from the four-stage loop; no explicit formulation, weighting of stages, or verification that the reward actually correlates with forgery-detection correctness is supplied. This is load-bearing for the claim that reflective refinement occurs.

    Authors: We acknowledge that the current description of the process reward is high-level. In the revised manuscript we will add an explicit mathematical formulation of the task-aligned process reward, including the weighting scheme across the four stages (global perception, local focusing, iterative probing, holistic adjudication) and empirical verification (e.g., correlation plots or ablation) demonstrating that the reward aligns with final forgery-detection correctness. revision: yes

  3. Referee: [Dataset] Dataset construction: FABench is stated to contain 100k images and ~200k agent-interaction QA pairs, but no protocol is given for how the QA pairs were generated, how tool-execution feedback was filtered for correctness, or what quality-control steps were taken to avoid training on hallucinated or buggy code. This directly affects the validity of the reported emergent competence.

    Authors: We agree that additional transparency on dataset construction is required. The revised Dataset section will include a detailed protocol describing how the ~200k QA pairs were generated, the criteria and filtering steps applied to tool-execution feedback for correctness, and the quality-control procedures used to reduce hallucinated or buggy code in the training data. revision: yes

Circularity Check

0 steps flagged

No circularity; new framework and dataset are independent contributions

full rationale

The paper introduces ForenAgent as a novel multi-round interactive framework with a two-stage Cold Start + RFT pipeline and a dynamic reasoning loop (global perception, local focusing, iterative probing, holistic adjudication), instantiated via a new heterogeneous dataset FABench of 100k images and 200k QA pairs. No equations, fitted parameters, or self-referential definitions appear in the provided text. The central claims of emergent tool-use competence rest on experimental outcomes from this new setup rather than reducing by construction to prior inputs, self-citations, or renamed known results. This is a self-contained methodological proposal with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that MLLMs possess sufficient code-generation capability to produce reliable low-level forensic tools and that the proposed reasoning loop can be effectively aligned with task rewards.

axioms (1)
  • domain assumption Multimodal LLMs can generate executable and useful Python code for low-level image processing and artifact analysis
    Invoked throughout the tool-generation and iterative-probing stages of the framework.

pith-pipeline@v0.9.0 · 5597 in / 1285 out tokens · 30888 ms · 2026-05-16T21:37:39.471784+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 12 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  2. [2]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...

  3. [3]

    A deep learning approach to universal image manipulation detection using a new convolutional layer

    Belhassen Bayar and Matthew C Stamm. A deep learning approach to universal image manipulation detection using a new convolutional layer. InProceedings of the 4th ACM workshop on information hiding and multimedia security, pages 5–10, 2016. 3

  4. [4]

    Xiuli Bi, Bo Liu, Fan Yang, Bin Xiao, Weisheng Li, Gao Huang, and Pamela C. Cosman. Detecting generated images by real images only.Arxiv, 2023. 6, 7

  5. [5]

    Self-supervised learning of adversarial example: Towards good generalizations for deepfake detection

    Liang Chen, Yong Zhang, Yibing Song, Lingqiao Liu, and Jue Wang. Self-supervised learning of adversarial example: Towards good generalizations for deepfake detection. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18710–18719, 2022. 3

  6. [6]

    Image manipulation detection by multi-view multi-scale supervision

    Xinru Chen, Chengbo Dong, Jiaqi Ji, Juan Cao, and Xirong Li. Image manipulation detection by multi-view multi-scale supervision. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14185–14193, 2021. 3

  7. [7]

    Nanobanana.https://aistudio.google

    Google. Nanobanana.https://aistudio.google. com/models/gemini- 2- 5- flash- image, 2025b. 2, 4

  8. [8]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 6

  9. [9]

    Hierarchical fine-grained im- age forgery detection and localization

    Xiao Guo, Xiaohong Liu, Zhiyuan Ren, Steven Grosz, Ia- copo Masi, and Xiaoming Liu. Hierarchical fine-grained im- age forgery detection and localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3155–3165, 2023. 3

  10. [10]

    Visual program- ming: Compositional visual reasoning without training

    Tanmay Gupta and Aniruddha Kembhavi. Visual program- ming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 14953–14962, 2023. 4

  11. [11]

    Sida: Social media image deepfake detection, localization and explanation with large multimodal model,

    Zhenglin Huang, Jinwei Hu, Xiangtai Li, Yiwei He, Xingyu Zhao, Bei Peng, Baoyuan Wu, Xiaowei Huang, and Guan- gliang Cheng. Sida: Social media image deepfake detection, localization and explanation with large multimodal model,

  12. [12]

    So-fake: Benchmarking and explain- ing social media image forgery detection.arXiv preprint arXiv:2505.18660, 2025

    Zhenglin Huang, Tianxiao Li, Xiangtai Li, Haiquan Wen, Yiwei He, Jiangning Zhang, Hao Fei, Xi Yang, Xiaowei Huang, Bei Peng, et al. So-fake: Benchmarking and explain- ing social media image forgery detection.arXiv preprint arXiv:2505.18660, 2025. 3

  13. [13]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard- son, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024. 2, 4, 6, 7

  14. [14]

    Legion: Learning to ground and explain for synthetic image detection.arXiv preprint arXiv:2503.15264, 2025

    Hengrui Kang, Siwei Wen, Zichen Wen, Junyan Ye, Wei- jia Li, Peilin Feng, Baichuan Zhou, Bin Wang, Dahua Lin, Linfeng Zhang, et al. Legion: Learning to ground and explain for synthetic image detection.arXiv preprint arXiv:2503.15264, 2025. 3

  15. [15]

    A style-based generator architecture for generative adversarial networks

    Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 4401–4410, 2019. 4

  16. [16]

    Flux.https://github.com/ black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, 2024. 4

  17. [17]

    Fakescope: Large multimodal expert model for transparent ai-generated image forensics

    Yixuan Li, Yu Tian, Yipo Huang, Wei Lu, Shiqi Wang, Weisi Lin, and Anderson Rocha. Fakescope: Large multimodal expert model for transparent ai-generated image forensics. arXiv preprint arXiv:2503.24267, 2025. 3

  18. [18]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 4

  19. [19]

    Forgery-aware adaptive transformer for generalizable synthetic image detection

    Huan Liu, Zichang Tan, Chuangchuang Tan, Yunchao Wei, Jingdong Wang, and Yao Zhao. Forgery-aware adaptive transformer for generalizable synthetic image detection. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 10770–10780, 2024. 3

  20. [20]

    ForgeryGPT: A Multimodal LLM for Interpretable Image Forgery Detection and Localization

    Jiawei Liu, Fanrui Zhang, Jiaying Zhu, Esther Sun, Qiang Zhang, and Zheng-Jun Zha. Forgerygpt: Multimodal large language model for explainable image forgery detection and localization.arXiv preprint arXiv:2410.10238, 2024. 3

  21. [21]

    Zhengzhe Liu, Xiaojuan Qi, and Philip H. S. Torr. Global texture enhancement for fake face detection in the wild. In CVPR, 2020. 6, 7

  22. [22]

    Visual agentic reinforcement fine-tuning.arXiv preprint arXiv:2505.14246, 2025

    Ziyu Liu, Yuhang Zang, Yushan Zou, Zijian Liang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Ji- aqi Wang. Visual agentic reinforcement fine-tuning.arXiv preprint arXiv:2505.14246, 2025. 4

  23. [23]

    Gener- alizing face forgery detection with high-frequency features

    Yuchen Luo, Yong Zhang, Junchi Yan, and Wei Liu. Gener- alizing face forgery detection with high-frequency features. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 16317–16326, 2021. 3

  24. [24]

    Midourney.https://www.midjourney

    Midourney. Midourney.https://www.midjourney. com/home, 2028. 2, 4

  25. [25]

    Towards uni- versal fake image detectors that generalize across generative models

    Utkarsh Ojha, Yuheng Li, and Yong Jae Lee. Towards uni- versal fake image detectors that generalize across generative models. InCVPR, 2023. 6, 7

  26. [26]

    DALL·E 3.https://openai.com/dall-e,

    OpenAI. DALL·E 3.https://openai.com/dall-e,

  27. [27]

    Introducing gpt-4.1.https://openai.com/ index/gpt-4-1/, 2025

    OpenAI. Introducing gpt-4.1.https://openai.com/ index/gpt-4-1/, 2025. Accessed: 2025-11-13. 2, 5, 6, 7

  28. [28]

    Introducing gpt-5.https://openai.com/ introducing- gpt- 5/, 2025

    OpenAI. Introducing gpt-5.https://openai.com/ introducing- gpt- 5/, 2025. Accessed: 2025-11-13. 6, 7

  29. [29]

    SDXL: improving latent diffusion mod- els for high-resolution image synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. SDXL: improving latent diffusion mod- els for high-resolution image synthesis. InICLR. OpenRe- view.net, 2024. 4

  30. [30]

    Exposing digital forgeries by detecting traces of resampling.IEEE Transactions on sig- nal processing, 53(2):758–767, 2005

    Alin C Popescu and Hany Farid. Exposing digital forgeries by detecting traces of resampling.IEEE Transactions on sig- nal processing, 53(2):758–767, 2005. 3

  31. [31]

    A principled design of image representation: Towards forensic tasks.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 45(5):5337–5354, 2022

    Shuren Qi, Yushu Zhang, Chao Wang, Jiantao Zhou, and Xi- aochun Cao. A principled design of image representation: Towards forensic tasks.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 45(5):5337–5354, 2022. 1

  32. [32]

    Fully unsupervised deepfake video detec- tion via enhanced contrastive learning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

    Tong Qiao, Shichuang Xie, Yanli Chen, Florent Retraint, and Xiangyang Luo. Fully unsupervised deepfake video detec- tion via enhanced contrastive learning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

  33. [33]

    To- wards jpeg-resistant image forgery detection and localization via self-supervised domain adaptation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022

    Yuan Rao, Jiangqun Ni, Weizhe Zhang, and Jiwu Huang. To- wards jpeg-resistant image forgery detection and localization via self-supervised domain adaptation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022

  34. [34]

    Detecting and grounding multi-modal media manip- ulation and beyond.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

    Rui Shao, Tianxing Wu, Jianlong Wu, Liqiang Nie, and Zi- wei Liu. Detecting and grounding multi-modal media manip- ulation and beyond.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 1

  35. [35]

    OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning

    Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Jun- tao Li, Xiaoye Qu, et al. Openthinkimg: Learning to think with images via visual tool reinforcement learning.arXiv preprint arXiv:2505.08617, 2025. 4

  36. [36]

    Vipergpt: Visual inference via python execution for reasoning

    D ´ıdac Sur´ıs, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execution for reasoning. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 11888–11898, 2023. 4

  37. [37]

    Learning on gradients: Generalized arti- facts representation for gan-generated images detection

    Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, and Yunchao Wei. Learning on gradients: Generalized arti- facts representation for gan-generated images detection. In CVPR, 2023. 6, 7

  38. [38]

    Frequency-aware deepfake de- tection: Improving generalizability through frequency space domain learning

    Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. Frequency-aware deepfake de- tection: Improving generalizability through frequency space domain learning. InProceedings of the AAAI Conference on Artificial Intelligence, pages 5052–5060, 2024. 3

  39. [39]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 6, 7

  40. [40]

    Qvq: To see the world with wisdom, 2024

    Qwen Team. Qvq: To see the world with wisdom, 2024. 6, 7

  41. [41]

    Ob- jectformer for image manipulation detection and localiza- tion

    Junke Wang, Zuxuan Wu, Jingjing Chen, Xintong Han, Ab- hinav Shrivastava, Ser-Nam Lim, and Yu-Gang Jiang. Ob- jectformer for image manipulation detection and localiza- tion. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 2364–2373,

  42. [42]

    Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

    Weiyun Wang, Zhe Chen, Wenhai Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Jinguo Zhu, Xizhou Zhu, Lewei Lu, Yu Qiao, and Jifeng Dai. Enhancing the reason- ing ability of multimodal large language models via mixed preference optimization.arXiv preprint arXiv:2411.10442,

  43. [43]

    Meta- tool: Facilitating large language models to master tools with meta-task augmentation.arXiv preprint arXiv:2407.12871,

    Xiaohan Wang, Dian Li, Yilin Zhao, Hui Wang, et al. Meta- tool: Facilitating large language models to master tools with meta-task augmentation.arXiv preprint arXiv:2407.12871,

  44. [44]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 4

  45. [45]

    Fakeshield: Explainable image forgery detection and localization via multi-modal large lan- guage models

    Zhipei Xu, Xuanyu Zhang, Runyi Li, Zecheng Tang, Qing Huang, and Jian Zhang. Fakeshield: Explainable image forgery detection and localization via multi-modal large lan- guage models. InInternational Conference on Learning Representations, 2025. 2, 3

  46. [46]

    A Survey on Multimodal Large Language Models

    Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.CoRR, abs/2306.13549, 2023. 2

  47. [47]

    Prnu- based image forgery localization with deep multi-scale fu- sion.ACM Transactions on Multimedia Computing, Com- munications and Applications, 19(2):1–20, 2023

    Yushu Zhang, Qing Tan, Shuren Qi, and Mingfu Xue. Prnu- based image forgery localization with deep multi-scale fu- sion.ACM Transactions on Multimedia Computing, Com- munications and Applications, 19(2):1–20, 2023. 3

  48. [48]

    Common sense reasoning for deepfake de- tection

    Yue Zhang, Ben Colman, Xiao Guo, Ali Shahriyari, and Gaurav Bharaj. Common sense reasoning for deepfake de- tection. InEuropean Conference on Computer Vision, pages 399–415. Springer, 2024. 3

  49. [49]

    Pyvision: Agentic vision with dynamic tooling.arXiv preprint arXiv:2507.07998, 2025

    Shitian Zhao, Haoquan Zhang, Shaoheng Lin, Ming Li, Qilong Wu, Kaipeng Zhang, and Chen Wei. Pyvision: Agentic vision with dynamic tooling.arXiv preprint arXiv:2507.07998, 2025. 2, 4

  50. [50]

    DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

    Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deep- eyes: Incentivizing” thinking with images” via reinforce- ment learning.arXiv preprint arXiv:2505.14362, 2025. 2, 4, 5

  51. [51]

    Reinforced visual perception with tools.arXiv preprint arXiv:2509.01656, 2025

    Zetong Zhou, Dongping Chen, Zixian Ma, Zhihan Hu, Mingyang Fu, Sinan Wang, Yao Wan, Zhou Zhao, and Ran- jay Krishna. Reinforced visual perception with tools.arXiv preprint arXiv:2509.01656, 2025. 4

  52. [52]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 6, 7

  53. [53]

    An intelligent agentic system for complex image restoration problems,

    Kaiwen Zhu, Jinjin Gu, Zhiyuan You, Yu Qiao, and Chao Dong. An intelligent agentic system for complex image restoration problems.arXiv preprint arXiv:2410.17809,

  54. [54]

    Face forgery de- tection by 3d decomposition and composition search.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(7):8342–8357, 2023

    Xiangyu Zhu, Hongyan Fei, Bin Zhang, Tianshuo Zhang, Xiaoyu Zhang, Stan Z Li, and Zhen Lei. Face forgery de- tection by 3d decomposition and composition search.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(7):8342–8357, 2023. 1