Code-in-the-Loop Forensics: Agentic Tool Use for Image Forgery Detection
Pith reviewed 2026-05-16 21:37 UTC · model grok-4.3
The pith
Multimodal language models can generate and run their own Python tools to detect image forgeries by combining low-level artifacts with semantic reasoning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ForenAgent is a multi-round interactive IFD framework in which MLLMs generate, execute, and iteratively refine Python-based low-level tools around the detection objective, following a two-stage training pipeline of Cold Start and Reinforcement Fine-Tuning together with a dynamic reasoning loop of global perception, local focusing, iterative probing, and holistic adjudication, resulting in emergent tool-use competence and reflective reasoning on challenging forgery tasks when assisted by low-level code.
What carries the argument
ForenAgent's code-in-the-loop mechanism, in which the model autonomously writes and runs executable Python scripts for low-level image analysis inside a multi-turn interaction loop.
If this is right
- Low-level artifact signals and high-level semantic knowledge become usable within one unified detection process.
- Forensic decisions gain step-by-step interpretability through the generated code and the model's reasoning trace.
- Detection performance improves on heterogeneous forgery types because tools can be adapted on the fly to each image.
- A route opens toward general-purpose IFD systems that do not require separate hand-crafted pipelines for each forgery category.
Where Pith is reading between the lines
- The same agent pattern could be tested on video or audio manipulation detection by swapping the low-level analysis primitives.
- Content-moderation platforms might incorporate such agents to produce auditable code logs for disputed media items.
- Training data efficiency could rise if the reinforcement stage is replaced by direct optimization against existing forensic benchmarks.
Load-bearing premise
Multimodal large language models can reliably produce correct, executable Python code for low-level image analysis without introducing errors or hallucinations that would invalidate the detection outcome.
What would settle it
A controlled test set of known forged images on which ForenAgent either generates non-executable code or returns detection decisions that systematically disagree with established ground-truth labels.
Figures
read the original abstract
Existing image forgery detection (IFD) methods either exploit low-level, semantics-agnostic artifacts or rely on multimodal large language models (MLLMs) with high-level semantic knowledge. Although naturally complementary, these two information streams are highly heterogeneous in both paradigm and reasoning, making it difficult for existing methods to unify them or effectively model their cross-level interactions. To address this gap, we propose ForenAgent, a multi-round interactive IFD framework that enables MLLMs to autonomously generate, execute, and iteratively refine Python-based low-level tools around the detection objective, thereby achieving more flexible and interpretable forgery analysis. ForenAgent follows a two-stage training pipeline combining Cold Start and Reinforcement Fine-Tuning to enhance its tool interaction capability and reasoning adaptability progressively. Inspired by human reasoning, we design a dynamic reasoning loop comprising global perception, local focusing, iterative probing, and holistic adjudication, and instantiate it as both a data-sampling strategy and a task-aligned process reward. For systematic training and evaluation, we construct FABench, a heterogeneous, high-quality agent-forensics dataset comprising 100k images and approximately 200k agent-interaction question-answer pairs. Experiments show that ForenAgent exhibits emergent tool-use competence and reflective reasoning on challenging IFD tasks when assisted by low-level tools, charting a promising route toward general-purpose IFD. The code will be released after the review process is completed.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ForenAgent, a multi-round interactive framework for image forgery detection (IFD) in which multimodal LLMs autonomously generate, execute, and iteratively refine Python-based low-level analysis tools. It employs a two-stage training pipeline (Cold Start followed by Reinforcement Fine-Tuning), instantiates a dynamic reasoning loop of global perception, local focusing, iterative probing, and holistic adjudication, and releases the FABench dataset containing 100k images and ~200k agent-interaction QA pairs. The central claim is that this agentic setup produces emergent tool-use competence and reflective reasoning on challenging IFD tasks.
Significance. If the experimental claims are substantiated with quantitative evidence, the work would be significant for bridging heterogeneous low-level artifact cues and high-level semantic reasoning within a single agentic loop, offering a more flexible and interpretable alternative to existing IFD pipelines. The public release of code and the large-scale FABench dataset would further increase its value to the community.
major comments (3)
- [Experiments] Experiments section: The abstract asserts that 'Experiments show that ForenAgent exhibits emergent tool-use competence and reflective reasoning,' yet the manuscript provides no quantitative metrics (accuracy, F1, AUC), baseline comparisons, ablation results on the Cold Start vs. RFT stages, or statistics on tool-call success rate, syntax/runtime error frequency, or cases where erroneous code still produces a final detection. Without these data the central claim cannot be evaluated.
- [Method] Method, dynamic reasoning loop: The process reward used in RFT is described only at the level of 'task-aligned process reward' derived from the four-stage loop; no explicit formulation, weighting of stages, or verification that the reward actually correlates with forgery-detection correctness is supplied. This is load-bearing for the claim that reflective refinement occurs.
- [Dataset] Dataset construction: FABench is stated to contain 100k images and ~200k agent-interaction QA pairs, but no protocol is given for how the QA pairs were generated, how tool-execution feedback was filtered for correctness, or what quality-control steps were taken to avoid training on hallucinated or buggy code. This directly affects the validity of the reported emergent competence.
minor comments (1)
- [Abstract] The abstract and method descriptions repeatedly use the phrase 'emergent tool-use competence' without a precise operational definition or reference to prior literature on emergence in agentic systems.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating revisions that will be incorporated into the next version of the manuscript.
read point-by-point responses
-
Referee: [Experiments] Experiments section: The abstract asserts that 'Experiments show that ForenAgent exhibits emergent tool-use competence and reflective reasoning,' yet the manuscript provides no quantitative metrics (accuracy, F1, AUC), baseline comparisons, ablation results on the Cold Start vs. RFT stages, or statistics on tool-call success rate, syntax/runtime error frequency, or cases where erroneous code still produces a final detection. Without these data the central claim cannot be evaluated.
Authors: We agree that the Experiments section would benefit from more explicit and consolidated quantitative evidence. While the manuscript does contain baseline comparisons, accuracy/F1 results, and some ablation analysis on the two-stage pipeline, we will expand this section in revision to include a dedicated summary table with accuracy, F1, AUC, tool-call success rates, syntax/runtime error frequencies, and analysis of cases where erroneous code still yields correct final detections. Ablation results contrasting Cold Start and RFT stages will also be presented more prominently. revision: yes
-
Referee: [Method] Method, dynamic reasoning loop: The process reward used in RFT is described only at the level of 'task-aligned process reward' derived from the four-stage loop; no explicit formulation, weighting of stages, or verification that the reward actually correlates with forgery-detection correctness is supplied. This is load-bearing for the claim that reflective refinement occurs.
Authors: We acknowledge that the current description of the process reward is high-level. In the revised manuscript we will add an explicit mathematical formulation of the task-aligned process reward, including the weighting scheme across the four stages (global perception, local focusing, iterative probing, holistic adjudication) and empirical verification (e.g., correlation plots or ablation) demonstrating that the reward aligns with final forgery-detection correctness. revision: yes
-
Referee: [Dataset] Dataset construction: FABench is stated to contain 100k images and ~200k agent-interaction QA pairs, but no protocol is given for how the QA pairs were generated, how tool-execution feedback was filtered for correctness, or what quality-control steps were taken to avoid training on hallucinated or buggy code. This directly affects the validity of the reported emergent competence.
Authors: We agree that additional transparency on dataset construction is required. The revised Dataset section will include a detailed protocol describing how the ~200k QA pairs were generated, the criteria and filtering steps applied to tool-execution feedback for correctness, and the quality-control procedures used to reduce hallucinated or buggy code in the training data. revision: yes
Circularity Check
No circularity; new framework and dataset are independent contributions
full rationale
The paper introduces ForenAgent as a novel multi-round interactive framework with a two-stage Cold Start + RFT pipeline and a dynamic reasoning loop (global perception, local focusing, iterative probing, holistic adjudication), instantiated via a new heterogeneous dataset FABench of 100k images and 200k QA pairs. No equations, fitted parameters, or self-referential definitions appear in the provided text. The central claims of emergent tool-use competence rest on experimental outcomes from this new setup rather than reducing by construction to prior inputs, self-citations, or renamed known results. This is a self-contained methodological proposal with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Multimodal LLMs can generate executable and useful Python code for low-level image processing and artifact analysis
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ForenAgent follows a two-stage training pipeline combining Cold Start and Reinforcement Fine-Tuning... dynamic reasoning loop comprising global perception, local focusing, iterative probing, and holistic adjudication... Rtool(τ) = λglobal Rglobal(τ) + λlogic Rlogic(τ) + λcrop Rcrop(τ) + λcoh Rcoh(τ)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We abstract and generalize several commonly used low-level Python tools for IFD, such as frequency residual, noise residual, and high-pass filtering... 12 candidate utilities
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
A deep learning approach to universal image manipulation detection using a new convolutional layer
Belhassen Bayar and Matthew C Stamm. A deep learning approach to universal image manipulation detection using a new convolutional layer. InProceedings of the 4th ACM workshop on information hiding and multimedia security, pages 5–10, 2016. 3
work page 2016
-
[4]
Xiuli Bi, Bo Liu, Fan Yang, Bin Xiao, Weisheng Li, Gao Huang, and Pamela C. Cosman. Detecting generated images by real images only.Arxiv, 2023. 6, 7
work page 2023
-
[5]
Self-supervised learning of adversarial example: Towards good generalizations for deepfake detection
Liang Chen, Yong Zhang, Yibing Song, Lingqiao Liu, and Jue Wang. Self-supervised learning of adversarial example: Towards good generalizations for deepfake detection. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18710–18719, 2022. 3
work page 2022
-
[6]
Image manipulation detection by multi-view multi-scale supervision
Xinru Chen, Chengbo Dong, Jiaqi Ji, Juan Cao, and Xirong Li. Image manipulation detection by multi-view multi-scale supervision. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14185–14193, 2021. 3
work page 2021
-
[7]
Nanobanana.https://aistudio.google
Google. Nanobanana.https://aistudio.google. com/models/gemini- 2- 5- flash- image, 2025b. 2, 4
-
[8]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Hierarchical fine-grained im- age forgery detection and localization
Xiao Guo, Xiaohong Liu, Zhiyuan Ren, Steven Grosz, Ia- copo Masi, and Xiaoming Liu. Hierarchical fine-grained im- age forgery detection and localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3155–3165, 2023. 3
work page 2023
-
[10]
Visual program- ming: Compositional visual reasoning without training
Tanmay Gupta and Aniruddha Kembhavi. Visual program- ming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 14953–14962, 2023. 4
work page 2023
-
[11]
Zhenglin Huang, Jinwei Hu, Xiangtai Li, Yiwei He, Xingyu Zhao, Bei Peng, Baoyuan Wu, Xiaowei Huang, and Guan- gliang Cheng. Sida: Social media image deepfake detection, localization and explanation with large multimodal model,
-
[12]
Zhenglin Huang, Tianxiao Li, Xiangtai Li, Haiquan Wen, Yiwei He, Jiangning Zhang, Hao Fei, Xi Yang, Xiaowei Huang, Bei Peng, et al. So-fake: Benchmarking and explain- ing social media image forgery detection.arXiv preprint arXiv:2505.18660, 2025. 3
-
[13]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard- son, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024. 2, 4, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
Hengrui Kang, Siwei Wen, Zichen Wen, Junyan Ye, Wei- jia Li, Peilin Feng, Baichuan Zhou, Bin Wang, Dahua Lin, Linfeng Zhang, et al. Legion: Learning to ground and explain for synthetic image detection.arXiv preprint arXiv:2503.15264, 2025. 3
-
[15]
A style-based generator architecture for generative adversarial networks
Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 4401–4410, 2019. 4
work page 2019
-
[16]
Flux.https://github.com/ black-forest-labs/flux, 2024
Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, 2024. 4
work page 2024
-
[17]
Fakescope: Large multimodal expert model for transparent ai-generated image forensics
Yixuan Li, Yu Tian, Yipo Huang, Wei Lu, Shiqi Wang, Weisi Lin, and Anderson Rocha. Fakescope: Large multimodal expert model for transparent ai-generated image forensics. arXiv preprint arXiv:2503.24267, 2025. 3
-
[18]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 4
work page 2014
-
[19]
Forgery-aware adaptive transformer for generalizable synthetic image detection
Huan Liu, Zichang Tan, Chuangchuang Tan, Yunchao Wei, Jingdong Wang, and Yao Zhao. Forgery-aware adaptive transformer for generalizable synthetic image detection. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 10770–10780, 2024. 3
work page 2024
-
[20]
ForgeryGPT: A Multimodal LLM for Interpretable Image Forgery Detection and Localization
Jiawei Liu, Fanrui Zhang, Jiaying Zhu, Esther Sun, Qiang Zhang, and Zheng-Jun Zha. Forgerygpt: Multimodal large language model for explainable image forgery detection and localization.arXiv preprint arXiv:2410.10238, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Zhengzhe Liu, Xiaojuan Qi, and Philip H. S. Torr. Global texture enhancement for fake face detection in the wild. In CVPR, 2020. 6, 7
work page 2020
-
[22]
Visual agentic reinforcement fine-tuning.arXiv preprint arXiv:2505.14246, 2025
Ziyu Liu, Yuhang Zang, Yushan Zou, Zijian Liang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Ji- aqi Wang. Visual agentic reinforcement fine-tuning.arXiv preprint arXiv:2505.14246, 2025. 4
-
[23]
Gener- alizing face forgery detection with high-frequency features
Yuchen Luo, Yong Zhang, Junchi Yan, and Wei Liu. Gener- alizing face forgery detection with high-frequency features. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 16317–16326, 2021. 3
work page 2021
-
[24]
Midourney.https://www.midjourney
Midourney. Midourney.https://www.midjourney. com/home, 2028. 2, 4
work page 2028
-
[25]
Towards uni- versal fake image detectors that generalize across generative models
Utkarsh Ojha, Yuheng Li, and Yong Jae Lee. Towards uni- versal fake image detectors that generalize across generative models. InCVPR, 2023. 6, 7
work page 2023
- [26]
-
[27]
Introducing gpt-4.1.https://openai.com/ index/gpt-4-1/, 2025
OpenAI. Introducing gpt-4.1.https://openai.com/ index/gpt-4-1/, 2025. Accessed: 2025-11-13. 2, 5, 6, 7
work page 2025
-
[28]
Introducing gpt-5.https://openai.com/ introducing- gpt- 5/, 2025
OpenAI. Introducing gpt-5.https://openai.com/ introducing- gpt- 5/, 2025. Accessed: 2025-11-13. 6, 7
work page 2025
-
[29]
SDXL: improving latent diffusion mod- els for high-resolution image synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. SDXL: improving latent diffusion mod- els for high-resolution image synthesis. InICLR. OpenRe- view.net, 2024. 4
work page 2024
-
[30]
Alin C Popescu and Hany Farid. Exposing digital forgeries by detecting traces of resampling.IEEE Transactions on sig- nal processing, 53(2):758–767, 2005. 3
work page 2005
-
[31]
Shuren Qi, Yushu Zhang, Chao Wang, Jiantao Zhou, and Xi- aochun Cao. A principled design of image representation: Towards forensic tasks.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence, 45(5):5337–5354, 2022. 1
work page 2022
-
[32]
Tong Qiao, Shichuang Xie, Yanli Chen, Florent Retraint, and Xiangyang Luo. Fully unsupervised deepfake video detec- tion via enhanced contrastive learning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024
work page 2024
-
[33]
Yuan Rao, Jiangqun Ni, Weizhe Zhang, and Jiwu Huang. To- wards jpeg-resistant image forgery detection and localization via self-supervised domain adaptation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022
work page 2022
-
[34]
Rui Shao, Tianxing Wu, Jianlong Wu, Liqiang Nie, and Zi- wei Liu. Detecting and grounding multi-modal media manip- ulation and beyond.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 1
work page 2024
-
[35]
OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning
Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Jun- tao Li, Xiaoye Qu, et al. Openthinkimg: Learning to think with images via visual tool reinforcement learning.arXiv preprint arXiv:2505.08617, 2025. 4
work page internal anchor Pith review arXiv 2025
-
[36]
Vipergpt: Visual inference via python execution for reasoning
D ´ıdac Sur´ıs, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execution for reasoning. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 11888–11898, 2023. 4
work page 2023
-
[37]
Learning on gradients: Generalized arti- facts representation for gan-generated images detection
Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, and Yunchao Wei. Learning on gradients: Generalized arti- facts representation for gan-generated images detection. In CVPR, 2023. 6, 7
work page 2023
-
[38]
Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. Frequency-aware deepfake de- tection: Improving generalizability through frequency space domain learning. InProceedings of the AAAI Conference on Artificial Intelligence, pages 5052–5060, 2024. 3
work page 2024
-
[39]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[40]
Qvq: To see the world with wisdom, 2024
Qwen Team. Qvq: To see the world with wisdom, 2024. 6, 7
work page 2024
-
[41]
Ob- jectformer for image manipulation detection and localiza- tion
Junke Wang, Zuxuan Wu, Jingjing Chen, Xintong Han, Ab- hinav Shrivastava, Ser-Nam Lim, and Yu-Gang Jiang. Ob- jectformer for image manipulation detection and localiza- tion. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 2364–2373,
-
[42]
Weiyun Wang, Zhe Chen, Wenhai Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Jinguo Zhu, Xizhou Zhu, Lewei Lu, Yu Qiao, and Jifeng Dai. Enhancing the reason- ing ability of multimodal large language models via mixed preference optimization.arXiv preprint arXiv:2411.10442,
work page internal anchor Pith review Pith/arXiv arXiv
-
[43]
Xiaohan Wang, Dian Li, Yilin Zhao, Hui Wang, et al. Meta- tool: Facilitating large language models to master tools with meta-task augmentation.arXiv preprint arXiv:2407.12871,
-
[44]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 4
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Zhipei Xu, Xuanyu Zhang, Runyi Li, Zecheng Tang, Qing Huang, and Jian Zhang. Fakeshield: Explainable image forgery detection and localization via multi-modal large lan- guage models. InInternational Conference on Learning Representations, 2025. 2, 3
work page 2025
-
[46]
A Survey on Multimodal Large Language Models
Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.CoRR, abs/2306.13549, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[47]
Yushu Zhang, Qing Tan, Shuren Qi, and Mingfu Xue. Prnu- based image forgery localization with deep multi-scale fu- sion.ACM Transactions on Multimedia Computing, Com- munications and Applications, 19(2):1–20, 2023. 3
work page 2023
-
[48]
Common sense reasoning for deepfake de- tection
Yue Zhang, Ben Colman, Xiao Guo, Ali Shahriyari, and Gaurav Bharaj. Common sense reasoning for deepfake de- tection. InEuropean Conference on Computer Vision, pages 399–415. Springer, 2024. 3
work page 2024
-
[49]
Pyvision: Agentic vision with dynamic tooling.arXiv preprint arXiv:2507.07998, 2025
Shitian Zhao, Haoquan Zhang, Shaoheng Lin, Ming Li, Qilong Wu, Kaipeng Zhang, and Chen Wei. Pyvision: Agentic vision with dynamic tooling.arXiv preprint arXiv:2507.07998, 2025. 2, 4
-
[50]
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deep- eyes: Incentivizing” thinking with images” via reinforce- ment learning.arXiv preprint arXiv:2505.14362, 2025. 2, 4, 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[51]
Reinforced visual perception with tools.arXiv preprint arXiv:2509.01656, 2025
Zetong Zhou, Dongping Chen, Zixian Ma, Zhihan Hu, Mingyang Fu, Sinan Wang, Yao Wan, Zhou Zhao, and Ran- jay Krishna. Reinforced visual perception with tools.arXiv preprint arXiv:2509.01656, 2025. 4
-
[52]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[53]
An intelligent agentic system for complex image restoration problems,
Kaiwen Zhu, Jinjin Gu, Zhiyuan You, Yu Qiao, and Chao Dong. An intelligent agentic system for complex image restoration problems.arXiv preprint arXiv:2410.17809,
-
[54]
Xiangyu Zhu, Hongyan Fei, Bin Zhang, Tianshuo Zhang, Xiaoyu Zhang, Stan Z Li, and Zhen Lei. Face forgery de- tection by 3d decomposition and composition search.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(7):8342–8357, 2023. 1
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.