arxiv: 2512.16300 · v2 · submitted 2025-12-18 · 💻 cs.AI

Code-in-the-Loop Forensics: Agentic Tool Use for Image Forgery Detection

Fanrui Zhang , Qiang Zhang , Sizhuo Zhou , Jianwen Sun , Chuanhao Li , Jiaxin Ai , Yukang Feng , Yujie Zhang

show 5 more authors

Wenjie Li Zizhen Li Yifan Chang Jiawei Liu Kaipeng Zhang

This is my paper

Pith reviewed 2026-05-16 21:37 UTC · model grok-4.3

classification 💻 cs.AI

keywords image forgery detectionmultimodal large language modelsagentic tool usecode generationforensic analysisFABench dataset

0 comments

The pith

Multimodal language models can generate and run their own Python tools to detect image forgeries by combining low-level artifacts with semantic reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ForenAgent, an interactive framework that lets multimodal large language models autonomously create, execute, and refine custom Python code for analyzing image forgeries. Existing detection methods either focus on low-level statistical traces or high-level semantic understanding, yet struggle to combine the two. ForenAgent addresses this by embedding a dynamic reasoning loop of global perception, local focusing, iterative probing, and final adjudication, trained first through cold-start examples then reinforcement on task-aligned rewards. It is evaluated on the new FABench dataset of 100k images and 200k interaction pairs. A sympathetic reader would care because the approach promises more adaptable and interpretable forgery detection that can adjust its own analysis tools to each case.

Core claim

ForenAgent is a multi-round interactive IFD framework in which MLLMs generate, execute, and iteratively refine Python-based low-level tools around the detection objective, following a two-stage training pipeline of Cold Start and Reinforcement Fine-Tuning together with a dynamic reasoning loop of global perception, local focusing, iterative probing, and holistic adjudication, resulting in emergent tool-use competence and reflective reasoning on challenging forgery tasks when assisted by low-level code.

What carries the argument

ForenAgent's code-in-the-loop mechanism, in which the model autonomously writes and runs executable Python scripts for low-level image analysis inside a multi-turn interaction loop.

If this is right

Low-level artifact signals and high-level semantic knowledge become usable within one unified detection process.
Forensic decisions gain step-by-step interpretability through the generated code and the model's reasoning trace.
Detection performance improves on heterogeneous forgery types because tools can be adapted on the fly to each image.
A route opens toward general-purpose IFD systems that do not require separate hand-crafted pipelines for each forgery category.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same agent pattern could be tested on video or audio manipulation detection by swapping the low-level analysis primitives.
Content-moderation platforms might incorporate such agents to produce auditable code logs for disputed media items.
Training data efficiency could rise if the reinforcement stage is replaced by direct optimization against existing forensic benchmarks.

Load-bearing premise

Multimodal large language models can reliably produce correct, executable Python code for low-level image analysis without introducing errors or hallucinations that would invalidate the detection outcome.

What would settle it

A controlled test set of known forged images on which ForenAgent either generates non-executable code or returns detection decisions that systematically disagree with established ground-truth labels.

Figures

Figures reproduced from arXiv: 2512.16300 by Chuanhao Li, Fanrui Zhang, Jianwen Sun, Jiawei Liu, Jiaxin Ai, Kaipeng Zhang, Qiang Zhang, Sizhuo Zhou, Wenjie Li, Yifan Chang, Yujie Zhang, Yukang Feng, Zizhen Li.

**Figure 2.** Figure 2: The overall architecture of the ForenAgent is illustrated, with the upper part showing the FABench construction process and the [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Examples of tampered and synthetic images from diverse FABench generators. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: The complete evidence chain by which ForenAgent correctly identifies a synthetic image. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 6.** Figure 6: Low-level forensic tool usage frequency distribution [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

Existing image forgery detection (IFD) methods either exploit low-level, semantics-agnostic artifacts or rely on multimodal large language models (MLLMs) with high-level semantic knowledge. Although naturally complementary, these two information streams are highly heterogeneous in both paradigm and reasoning, making it difficult for existing methods to unify them or effectively model their cross-level interactions. To address this gap, we propose ForenAgent, a multi-round interactive IFD framework that enables MLLMs to autonomously generate, execute, and iteratively refine Python-based low-level tools around the detection objective, thereby achieving more flexible and interpretable forgery analysis. ForenAgent follows a two-stage training pipeline combining Cold Start and Reinforcement Fine-Tuning to enhance its tool interaction capability and reasoning adaptability progressively. Inspired by human reasoning, we design a dynamic reasoning loop comprising global perception, local focusing, iterative probing, and holistic adjudication, and instantiate it as both a data-sampling strategy and a task-aligned process reward. For systematic training and evaluation, we construct FABench, a heterogeneous, high-quality agent-forensics dataset comprising 100k images and approximately 200k agent-interaction question-answer pairs. Experiments show that ForenAgent exhibits emergent tool-use competence and reflective reasoning on challenging IFD tasks when assisted by low-level tools, charting a promising route toward general-purpose IFD. The code will be released after the review process is completed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ForenAgent has an MLLM generate and run its own Python tools for forgery detection inside a dynamic loop, which is a distinct setup, but the results stand or fall on whether that code generation stays reliable.

read the letter

The paper's core move is to let an MLLM autonomously write, execute, and refine low-level Python code for artifact checks while it also does semantic reasoning. That integration of tool creation with a multi-round loop is the main novelty, and it differs from both pure low-level detectors and standard MLLM prompting in the IFD literature they cite. They back it with a two-stage pipeline (cold start then reinforcement fine-tuning) and a dynamic loop that moves from global perception to local probing and final adjudication, plus the FABench dataset of 100k images and 200k interaction pairs. Those pieces give the work a concrete foundation that others could build on or test against.

Referee Report

3 major / 1 minor

Summary. The paper introduces ForenAgent, a multi-round interactive framework for image forgery detection (IFD) in which multimodal LLMs autonomously generate, execute, and iteratively refine Python-based low-level analysis tools. It employs a two-stage training pipeline (Cold Start followed by Reinforcement Fine-Tuning), instantiates a dynamic reasoning loop of global perception, local focusing, iterative probing, and holistic adjudication, and releases the FABench dataset containing 100k images and ~200k agent-interaction QA pairs. The central claim is that this agentic setup produces emergent tool-use competence and reflective reasoning on challenging IFD tasks.

Significance. If the experimental claims are substantiated with quantitative evidence, the work would be significant for bridging heterogeneous low-level artifact cues and high-level semantic reasoning within a single agentic loop, offering a more flexible and interpretable alternative to existing IFD pipelines. The public release of code and the large-scale FABench dataset would further increase its value to the community.

major comments (3)

[Experiments] Experiments section: The abstract asserts that 'Experiments show that ForenAgent exhibits emergent tool-use competence and reflective reasoning,' yet the manuscript provides no quantitative metrics (accuracy, F1, AUC), baseline comparisons, ablation results on the Cold Start vs. RFT stages, or statistics on tool-call success rate, syntax/runtime error frequency, or cases where erroneous code still produces a final detection. Without these data the central claim cannot be evaluated.
[Method] Method, dynamic reasoning loop: The process reward used in RFT is described only at the level of 'task-aligned process reward' derived from the four-stage loop; no explicit formulation, weighting of stages, or verification that the reward actually correlates with forgery-detection correctness is supplied. This is load-bearing for the claim that reflective refinement occurs.
[Dataset] Dataset construction: FABench is stated to contain 100k images and ~200k agent-interaction QA pairs, but no protocol is given for how the QA pairs were generated, how tool-execution feedback was filtered for correctness, or what quality-control steps were taken to avoid training on hallucinated or buggy code. This directly affects the validity of the reported emergent competence.

minor comments (1)

[Abstract] The abstract and method descriptions repeatedly use the phrase 'emergent tool-use competence' without a precise operational definition or reference to prior literature on emergence in agentic systems.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating revisions that will be incorporated into the next version of the manuscript.

read point-by-point responses

Referee: [Experiments] Experiments section: The abstract asserts that 'Experiments show that ForenAgent exhibits emergent tool-use competence and reflective reasoning,' yet the manuscript provides no quantitative metrics (accuracy, F1, AUC), baseline comparisons, ablation results on the Cold Start vs. RFT stages, or statistics on tool-call success rate, syntax/runtime error frequency, or cases where erroneous code still produces a final detection. Without these data the central claim cannot be evaluated.

Authors: We agree that the Experiments section would benefit from more explicit and consolidated quantitative evidence. While the manuscript does contain baseline comparisons, accuracy/F1 results, and some ablation analysis on the two-stage pipeline, we will expand this section in revision to include a dedicated summary table with accuracy, F1, AUC, tool-call success rates, syntax/runtime error frequencies, and analysis of cases where erroneous code still yields correct final detections. Ablation results contrasting Cold Start and RFT stages will also be presented more prominently. revision: yes
Referee: [Method] Method, dynamic reasoning loop: The process reward used in RFT is described only at the level of 'task-aligned process reward' derived from the four-stage loop; no explicit formulation, weighting of stages, or verification that the reward actually correlates with forgery-detection correctness is supplied. This is load-bearing for the claim that reflective refinement occurs.

Authors: We acknowledge that the current description of the process reward is high-level. In the revised manuscript we will add an explicit mathematical formulation of the task-aligned process reward, including the weighting scheme across the four stages (global perception, local focusing, iterative probing, holistic adjudication) and empirical verification (e.g., correlation plots or ablation) demonstrating that the reward aligns with final forgery-detection correctness. revision: yes
Referee: [Dataset] Dataset construction: FABench is stated to contain 100k images and ~200k agent-interaction QA pairs, but no protocol is given for how the QA pairs were generated, how tool-execution feedback was filtered for correctness, or what quality-control steps were taken to avoid training on hallucinated or buggy code. This directly affects the validity of the reported emergent competence.

Authors: We agree that additional transparency on dataset construction is required. The revised Dataset section will include a detailed protocol describing how the ~200k QA pairs were generated, the criteria and filtering steps applied to tool-execution feedback for correctness, and the quality-control procedures used to reduce hallucinated or buggy code in the training data. revision: yes

Circularity Check

0 steps flagged

No circularity; new framework and dataset are independent contributions

full rationale

The paper introduces ForenAgent as a novel multi-round interactive framework with a two-stage Cold Start + RFT pipeline and a dynamic reasoning loop (global perception, local focusing, iterative probing, holistic adjudication), instantiated via a new heterogeneous dataset FABench of 100k images and 200k QA pairs. No equations, fitted parameters, or self-referential definitions appear in the provided text. The central claims of emergent tool-use competence rest on experimental outcomes from this new setup rather than reducing by construction to prior inputs, self-citations, or renamed known results. This is a self-contained methodological proposal with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that MLLMs possess sufficient code-generation capability to produce reliable low-level forensic tools and that the proposed reasoning loop can be effectively aligned with task rewards.

axioms (1)

domain assumption Multimodal LLMs can generate executable and useful Python code for low-level image processing and artifact analysis
Invoked throughout the tool-generation and iterative-probing stages of the framework.

pith-pipeline@v0.9.0 · 5597 in / 1285 out tokens · 30888 ms · 2026-05-16T21:37:39.471784+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ForenAgent follows a two-stage training pipeline combining Cold Start and Reinforcement Fine-Tuning... dynamic reasoning loop comprising global perception, local focusing, iterative probing, and holistic adjudication... Rtool(τ) = λglobal Rglobal(τ) + λlogic Rlogic(τ) + λcrop Rcrop(τ) + λcoh Rcoh(τ)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We abstract and generalize several commonly used low-level Python tools for IFD, such as frequency residual, noise residual, and high-pass filtering... 12 candidate utilities

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 12 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

A deep learning approach to universal image manipulation detection using a new convolutional layer

Belhassen Bayar and Matthew C Stamm. A deep learning approach to universal image manipulation detection using a new convolutional layer. InProceedings of the 4th ACM workshop on information hiding and multimedia security, pages 5–10, 2016. 3

work page 2016
[4]

Xiuli Bi, Bo Liu, Fan Yang, Bin Xiao, Weisheng Li, Gao Huang, and Pamela C. Cosman. Detecting generated images by real images only.Arxiv, 2023. 6, 7

work page 2023
[5]

Self-supervised learning of adversarial example: Towards good generalizations for deepfake detection

Liang Chen, Yong Zhang, Yibing Song, Lingqiao Liu, and Jue Wang. Self-supervised learning of adversarial example: Towards good generalizations for deepfake detection. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18710–18719, 2022. 3

work page 2022
[6]

Image manipulation detection by multi-view multi-scale supervision

Xinru Chen, Chengbo Dong, Jiaqi Ji, Juan Cao, and Xirong Li. Image manipulation detection by multi-view multi-scale supervision. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14185–14193, 2021. 3

work page 2021
[7]

Nanobanana.https://aistudio.google

Google. Nanobanana.https://aistudio.google. com/models/gemini- 2- 5- flash- image, 2025b. 2, 4

work page
[8]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Hierarchical fine-grained im- age forgery detection and localization

Xiao Guo, Xiaohong Liu, Zhiyuan Ren, Steven Grosz, Ia- copo Masi, and Xiaoming Liu. Hierarchical fine-grained im- age forgery detection and localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3155–3165, 2023. 3

work page 2023
[10]

Visual program- ming: Compositional visual reasoning without training

Tanmay Gupta and Aniruddha Kembhavi. Visual program- ming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 14953–14962, 2023. 4

work page 2023
[11]

Sida: Social media image deepfake detection, localization and explanation with large multimodal model,

Zhenglin Huang, Jinwei Hu, Xiangtai Li, Yiwei He, Xingyu Zhao, Bei Peng, Baoyuan Wu, Xiaowei Huang, and Guan- gliang Cheng. Sida: Social media image deepfake detection, localization and explanation with large multimodal model,

work page
[12]

So-fake: Benchmarking and explain- ing social media image forgery detection.arXiv preprint arXiv:2505.18660, 2025

Zhenglin Huang, Tianxiao Li, Xiangtai Li, Haiquan Wen, Yiwei He, Jiangning Zhang, Hao Fei, Xi Yang, Xiaowei Huang, Bei Peng, et al. So-fake: Benchmarking and explain- ing social media image forgery detection.arXiv preprint arXiv:2505.18660, 2025. 3

work page arXiv 2025
[13]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard- son, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024. 2, 4, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Legion: Learning to ground and explain for synthetic image detection.arXiv preprint arXiv:2503.15264, 2025

Hengrui Kang, Siwei Wen, Zichen Wen, Junyan Ye, Wei- jia Li, Peilin Feng, Baichuan Zhou, Bin Wang, Dahua Lin, Linfeng Zhang, et al. Legion: Learning to ground and explain for synthetic image detection.arXiv preprint arXiv:2503.15264, 2025. 3

work page arXiv 2025
[15]

A style-based generator architecture for generative adversarial networks

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 4401–4410, 2019. 4

work page 2019
[16]

Flux.https://github.com/ black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, 2024. 4

work page 2024
[17]

Fakescope: Large multimodal expert model for transparent ai-generated image forensics

Yixuan Li, Yu Tian, Yipo Huang, Wei Lu, Shiqi Wang, Weisi Lin, and Anderson Rocha. Fakescope: Large multimodal expert model for transparent ai-generated image forensics. arXiv preprint arXiv:2503.24267, 2025. 3

work page arXiv 2025
[18]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 4

work page 2014
[19]

Forgery-aware adaptive transformer for generalizable synthetic image detection

Huan Liu, Zichang Tan, Chuangchuang Tan, Yunchao Wei, Jingdong Wang, and Yao Zhao. Forgery-aware adaptive transformer for generalizable synthetic image detection. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 10770–10780, 2024. 3

work page 2024
[20]

ForgeryGPT: A Multimodal LLM for Interpretable Image Forgery Detection and Localization

Jiawei Liu, Fanrui Zhang, Jiaying Zhu, Esther Sun, Qiang Zhang, and Zheng-Jun Zha. Forgerygpt: Multimodal large language model for explainable image forgery detection and localization.arXiv preprint arXiv:2410.10238, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Zhengzhe Liu, Xiaojuan Qi, and Philip H. S. Torr. Global texture enhancement for fake face detection in the wild. In CVPR, 2020. 6, 7

work page 2020
[22]

Visual agentic reinforcement fine-tuning.arXiv preprint arXiv:2505.14246, 2025

Ziyu Liu, Yuhang Zang, Yushan Zou, Zijian Liang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Ji- aqi Wang. Visual agentic reinforcement fine-tuning.arXiv preprint arXiv:2505.14246, 2025. 4

work page arXiv 2025
[23]

Gener- alizing face forgery detection with high-frequency features

Yuchen Luo, Yong Zhang, Junchi Yan, and Wei Liu. Gener- alizing face forgery detection with high-frequency features. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 16317–16326, 2021. 3

work page 2021
[24]

Midourney.https://www.midjourney

Midourney. Midourney.https://www.midjourney. com/home, 2028. 2, 4

work page 2028
[25]

Towards uni- versal fake image detectors that generalize across generative models

Utkarsh Ojha, Yuheng Li, and Yong Jae Lee. Towards uni- versal fake image detectors that generalize across generative models. InCVPR, 2023. 6, 7

work page 2023
[26]

DALL·E 3.https://openai.com/dall-e,

OpenAI. DALL·E 3.https://openai.com/dall-e,

work page
[27]

Introducing gpt-4.1.https://openai.com/ index/gpt-4-1/, 2025

OpenAI. Introducing gpt-4.1.https://openai.com/ index/gpt-4-1/, 2025. Accessed: 2025-11-13. 2, 5, 6, 7

work page 2025
[28]

Introducing gpt-5.https://openai.com/ introducing- gpt- 5/, 2025

OpenAI. Introducing gpt-5.https://openai.com/ introducing- gpt- 5/, 2025. Accessed: 2025-11-13. 6, 7

work page 2025
[29]

SDXL: improving latent diffusion mod- els for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. SDXL: improving latent diffusion mod- els for high-resolution image synthesis. InICLR. OpenRe- view.net, 2024. 4

work page 2024
[30]

Exposing digital forgeries by detecting traces of resampling.IEEE Transactions on sig- nal processing, 53(2):758–767, 2005

Alin C Popescu and Hany Farid. Exposing digital forgeries by detecting traces of resampling.IEEE Transactions on sig- nal processing, 53(2):758–767, 2005. 3

work page 2005
[31]

A principled design of image representation: Towards forensic tasks.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 45(5):5337–5354, 2022

Shuren Qi, Yushu Zhang, Chao Wang, Jiantao Zhou, and Xi- aochun Cao. A principled design of image representation: Towards forensic tasks.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 45(5):5337–5354, 2022. 1

work page 2022
[32]

Fully unsupervised deepfake video detec- tion via enhanced contrastive learning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

Tong Qiao, Shichuang Xie, Yanli Chen, Florent Retraint, and Xiangyang Luo. Fully unsupervised deepfake video detec- tion via enhanced contrastive learning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

work page 2024
[33]

To- wards jpeg-resistant image forgery detection and localization via self-supervised domain adaptation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022

Yuan Rao, Jiangqun Ni, Weizhe Zhang, and Jiwu Huang. To- wards jpeg-resistant image forgery detection and localization via self-supervised domain adaptation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022

work page 2022
[34]

Detecting and grounding multi-modal media manip- ulation and beyond.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

Rui Shao, Tianxing Wu, Jianlong Wu, Liqiang Nie, and Zi- wei Liu. Detecting and grounding multi-modal media manip- ulation and beyond.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 1

work page 2024
[35]

OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning

Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Jun- tao Li, Xiaoye Qu, et al. Openthinkimg: Learning to think with images via visual tool reinforcement learning.arXiv preprint arXiv:2505.08617, 2025. 4

work page internal anchor Pith review arXiv 2025
[36]

Vipergpt: Visual inference via python execution for reasoning

D ´ıdac Sur´ıs, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execution for reasoning. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 11888–11898, 2023. 4

work page 2023
[37]

Learning on gradients: Generalized arti- facts representation for gan-generated images detection

Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, and Yunchao Wei. Learning on gradients: Generalized arti- facts representation for gan-generated images detection. In CVPR, 2023. 6, 7

work page 2023
[38]

Frequency-aware deepfake de- tection: Improving generalizability through frequency space domain learning

Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. Frequency-aware deepfake de- tection: Improving generalizability through frequency space domain learning. InProceedings of the AAAI Conference on Artificial Intelligence, pages 5052–5060, 2024. 3

work page 2024
[39]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

Qvq: To see the world with wisdom, 2024

Qwen Team. Qvq: To see the world with wisdom, 2024. 6, 7

work page 2024
[41]

Ob- jectformer for image manipulation detection and localiza- tion

Junke Wang, Zuxuan Wu, Jingjing Chen, Xintong Han, Ab- hinav Shrivastava, Ser-Nam Lim, and Yu-Gang Jiang. Ob- jectformer for image manipulation detection and localiza- tion. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 2364–2373,

work page
[42]

Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

Weiyun Wang, Zhe Chen, Wenhai Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Jinguo Zhu, Xizhou Zhu, Lewei Lu, Yu Qiao, and Jifeng Dai. Enhancing the reason- ing ability of multimodal large language models via mixed preference optimization.arXiv preprint arXiv:2411.10442,

work page internal anchor Pith review Pith/arXiv arXiv
[43]

Meta- tool: Facilitating large language models to master tools with meta-task augmentation.arXiv preprint arXiv:2407.12871,

Xiaohan Wang, Dian Li, Yilin Zhao, Hui Wang, et al. Meta- tool: Facilitating large language models to master tools with meta-task augmentation.arXiv preprint arXiv:2407.12871,

work page arXiv
[44]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Fakeshield: Explainable image forgery detection and localization via multi-modal large lan- guage models

Zhipei Xu, Xuanyu Zhang, Runyi Li, Zecheng Tang, Qing Huang, and Jian Zhang. Fakeshield: Explainable image forgery detection and localization via multi-modal large lan- guage models. InInternational Conference on Learning Representations, 2025. 2, 3

work page 2025
[46]

A Survey on Multimodal Large Language Models

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.CoRR, abs/2306.13549, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

Prnu- based image forgery localization with deep multi-scale fu- sion.ACM Transactions on Multimedia Computing, Com- munications and Applications, 19(2):1–20, 2023

Yushu Zhang, Qing Tan, Shuren Qi, and Mingfu Xue. Prnu- based image forgery localization with deep multi-scale fu- sion.ACM Transactions on Multimedia Computing, Com- munications and Applications, 19(2):1–20, 2023. 3

work page 2023
[48]

Common sense reasoning for deepfake de- tection

Yue Zhang, Ben Colman, Xiao Guo, Ali Shahriyari, and Gaurav Bharaj. Common sense reasoning for deepfake de- tection. InEuropean Conference on Computer Vision, pages 399–415. Springer, 2024. 3

work page 2024
[49]

Pyvision: Agentic vision with dynamic tooling.arXiv preprint arXiv:2507.07998, 2025

Shitian Zhao, Haoquan Zhang, Shaoheng Lin, Ming Li, Qilong Wu, Kaipeng Zhang, and Chen Wei. Pyvision: Agentic vision with dynamic tooling.arXiv preprint arXiv:2507.07998, 2025. 2, 4

work page arXiv 2025
[50]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deep- eyes: Incentivizing” thinking with images” via reinforce- ment learning.arXiv preprint arXiv:2505.14362, 2025. 2, 4, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

Reinforced visual perception with tools.arXiv preprint arXiv:2509.01656, 2025

Zetong Zhou, Dongping Chen, Zixian Ma, Zhihan Hu, Mingyang Fu, Sinan Wang, Yao Wan, Zhou Zhao, and Ran- jay Krishna. Reinforced visual perception with tools.arXiv preprint arXiv:2509.01656, 2025. 4

work page arXiv 2025
[52]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

An intelligent agentic system for complex image restoration problems,

Kaiwen Zhu, Jinjin Gu, Zhiyuan You, Yu Qiao, and Chao Dong. An intelligent agentic system for complex image restoration problems.arXiv preprint arXiv:2410.17809,

work page arXiv
[54]

Face forgery de- tection by 3d decomposition and composition search.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(7):8342–8357, 2023

Xiangyu Zhu, Hongyan Fei, Bin Zhang, Tianshuo Zhang, Xiaoyu Zhang, Stan Z Li, and Zhen Lei. Face forgery de- tection by 3d decomposition and composition search.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(7):8342–8357, 2023. 1

work page 2023