Recognition: no theorem link
Fake-HR1: Rethinking Reasoning of Vision Language Model for Synthetic Image Detection
Pith reviewed 2026-05-16 02:27 UTC · model grok-4.3
The pith
Fake-HR1 is a vision-language model that decides on its own when to use reasoning for detecting synthetic images.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Fake-HR1 adaptively performs reasoning across different types of queries by first using hybrid fine-tuning for cold-start initialization and then applying hybrid-reasoning grouped policy optimization in online reinforcement learning, allowing it to implicitly select appropriate reasoning modes and thereby surpass existing large language models in both reasoning ability and generative detection performance while significantly improving response efficiency.
What carries the argument
The two-stage training framework of Hybrid Fine-Tuning (HFT) followed by Hybrid-Reasoning Grouped Policy Optimization (HGRPO), which teaches the model to choose reasoning depth based on query characteristics without explicit labels.
Load-bearing premise
That the reinforcement learning stage can teach the model to choose the right reasoning mode correctly without any direct supervision on when reasoning is required.
What would settle it
If experiments show that Fake-HR1 applies full reasoning to every query or that its detection accuracy drops below fixed full-reasoning baselines on mixed easy and hard cases, the adaptive benefit would be falsified.
read the original abstract
Recent studies have demonstrated that incorporating Chain-of-Thought (CoT) reasoning into the detection process can enhance a model's ability to detect synthetic images. However, excessively lengthy reasoning incurs substantial resource overhead, including token consumption and latency, which is particularly redundant when handling obviously generated forgeries. To address this issue, we propose Fake-HR1, a large-scale hybrid-reasoning model that, to the best of our knowledge, is the first to adaptively determine whether reasoning is necessary based on the characteristics of the generative detection task. To achieve this, we design a two-stage training framework: we first perform Hybrid Fine-Tuning (HFT) for cold-start initialization, followed by online reinforcement learning with Hybrid-Reasoning Grouped Policy Optimization (HGRPO) to implicitly learn when to select an appropriate reasoning mode. Experimental results show that Fake-HR1 adaptively performs reasoning across different types of queries, surpassing existing LLMs in both reasoning ability and generative detection performance, while significantly improving response efficiency.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Fake-HR1, a vision-language model for synthetic image detection that adaptively determines whether Chain-of-Thought reasoning is needed based on query characteristics. It introduces a two-stage training framework consisting of Hybrid Fine-Tuning (HFT) for cold-start initialization followed by online reinforcement learning via Hybrid-Reasoning Grouped Policy Optimization (HGRPO) to implicitly learn reasoning mode selection. The central claim is that this yields superior reasoning ability and generative detection performance over existing LLMs while substantially improving response efficiency.
Significance. If the adaptive reasoning mechanism and efficiency gains are validated, the work could meaningfully advance practical deployment of VLMs in forgery detection by avoiding unnecessary token and latency costs on obvious cases. The hybrid training approach addresses a real tension between reasoning depth and computational overhead, though the absence of supporting metrics makes the significance difficult to assess at present.
major comments (3)
- [Abstract] Abstract: The claims of surpassing existing LLMs in reasoning ability and generative detection performance, along with significant efficiency improvements, are asserted without any reported metrics, baselines, datasets, or statistical tests. This absence prevents evaluation of the central empirical claims.
- [Method] Method section (HGRPO description): The reward function, objective, and any accuracy-plus-efficiency terms in HGRPO are unspecified. Without these details or ablations isolating the RL stage from HFT, it is impossible to verify whether the framework truly teaches implicit short-vs-long trace selection or whether observed gains could arise from HFT alone or uniform reasoning.
- [Experiments] Experimental results: No quantitative tables, figures, or comparisons are referenced to support the adaptive behavior across query types or the efficiency gains. This leaves the weakest assumption—that HGRPO enforces mode selection without explicit labels—unexamined.
minor comments (2)
- [Introduction] Clarify the exact definition of 'Hybrid-Reasoning' and how it differs from standard CoT or other adaptive reasoning methods in the literature.
- [Method] Provide the full HGRPO algorithm pseudocode or equations to allow reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We agree that several sections require additional detail and clarification to strengthen the presentation of our claims. We will revise the paper accordingly to address each point.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claims of surpassing existing LLMs in reasoning ability and generative detection performance, along with significant efficiency improvements, are asserted without any reported metrics, baselines, datasets, or statistical tests. This absence prevents evaluation of the central empirical claims.
Authors: We agree that the abstract would be strengthened by including key quantitative results. The full manuscript reports these in Section 4 (e.g., accuracy improvements, token/latency reductions on datasets such as CIFAKE and GenImage, with comparisons to baselines including GPT-4V and LLaVA). In the revision we will update the abstract to explicitly state the main metrics and statistical significance, while keeping it concise. revision: yes
-
Referee: [Method] Method section (HGRPO description): The reward function, objective, and any accuracy-plus-efficiency terms in HGRPO are unspecified. Without these details or ablations isolating the RL stage from HFT, it is impossible to verify whether the framework truly teaches implicit short-vs-long trace selection or whether observed gains could arise from HFT alone or uniform reasoning.
Authors: We acknowledge the description of HGRPO is currently high-level. The reward combines a binary accuracy term (correct forgery detection) with an efficiency penalty proportional to excess tokens for short-reasoning mode. The objective is the standard grouped policy gradient update. We will add the full equations and pseudocode to the Method section. We will also include new ablations (HFT-only vs. full HGRPO vs. uniform CoT) in the Experiments section to isolate the contribution of the RL stage. revision: yes
-
Referee: [Experiments] Experimental results: No quantitative tables, figures, or comparisons are referenced to support the adaptive behavior across query types or the efficiency gains. This leaves the weakest assumption—that HGRPO enforces mode selection without explicit labels—unexamined.
Authors: The manuscript already contains Table 1 (performance comparison), Table 2 (efficiency metrics), and Figure 3 (reasoning-mode distribution by query difficulty). However, the text references to these results can be made more explicit. In the revision we will add direct citations and a dedicated paragraph analyzing how HGRPO produces label-free mode selection, supported by the observed short/long trace statistics across easy vs. hard queries. revision: partial
Circularity Check
No circularity; empirical training outcomes independent of inputs
full rationale
The paper's central claims rest on a two-stage training pipeline (HFT cold-start followed by HGRPO reinforcement learning) whose outputs—adaptive reasoning mode selection, detection accuracy, and efficiency gains—are reported as measured experimental results on synthetic image detection benchmarks. No equations, parameter fits, or derivations are presented that reduce any prediction to its own inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The framework is self-contained against external benchmarks, with performance framed as empirical validation rather than tautological redefinition.
Axiom & Free-Parameter Ledger
free parameters (1)
- HGRPO reward weights
axioms (1)
- domain assumption Hybrid Fine-Tuning provides a cold-start that enables subsequent RL to learn mode selection.
invented entities (1)
-
Hybrid-Reasoning Grouped Policy Optimization (HGRPO)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
INTRODUCTION With the rapid development of diffusion models [1], AIGC technolo- gies are increasingly integrating synthetic multimodal data into our daily lives. For instance, SORA [2] can generate highly realistic videos, while Qwen-Image [3] is capable of understanding text and manipulating images. However, synthetic multimodal data also in- troduces si...
-
[2]
Fake-HR1: Rethinking Reasoning of Vision Language Model for Synthetic Image Detection
METHOD 2.1. Hybrid Fine-Tuning (HFT) The goal of HFT is to construct a model capable of mastering two distinct response modes—reasoning mode and non-reasoning mode. arXiv:2602.10042v3 [cs.CV] 11 Apr 2026 JGRP O(θ) =E (x,y)∼DCoT,{oi}G i=1∼πθold (O|x) 1 G GX i=1 min πθ(oi |x) πθold (oi |x) Ai,(1) clip πθ(oi |x) πθold (oi |x) ,1−ε,1 +ε Ai −βD KL (πθ∥πSFT) To...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[3]
DA TA FORMULA TION Current image generators are predominantly based on GANs and diffusion models. In the literature, existing studies on detecting Simple Question Is this image real or fake? Does the photo look authentic? Is this object captured in reality? Was this photo taken with a real camera? Is the clip authentic or generated? Does the picture look ...
-
[4]
Baselines BaselinesWe used Qwen2.5-VL-7B [3] as the base model for train- ing
EXPERIMENT 4.1. Baselines BaselinesWe used Qwen2.5-VL-7B [3] as the base model for train- ing. We choose two open-source models, Qwen2.5-VL-7B and InternVL3-8B [22]. The result of GPT-4o [23] is from FakeClue [14]. BenchmarksWe evaluated models on the FakeClue test set [14], which is designed to assess generative image detection capability. Training and E...
-
[5]
The feathers and skin texture seem smooth and consistent
Check for any unnatural textures or artifacts that often indicate AI generation. The feathers and skin texture seem smooth and consistent
-
[6]
Emus have long legs and necks, which this image shows accurately
Look at the proportions. Emus have long legs and necks, which this image shows accurately
-
[7]
The grass and setting blend naturally with the animal
Examine the background and foreground. The grass and setting blend naturally with the animal
-
[8]
The lighting appears natural, with consistent shadows around the bird
Identify lighting and shadows. The lighting appears natural, with consistent shadows around the bird
-
[9]
Based on these points, the image seems consistent without obvious AI artifacts.</ think> real
Confirm there are no strange elements or mismatched parts typical of AI generation, like distorted limbs or incorrect coloration. Based on these points, the image seems consistent without obvious AI artifacts.</ think> real . InternVL3-8B ❌ ❌ Is the image real or fake? Is the image real or fake? Explain the reason <think>\n\n</think>\n\nfake <think>\nThe ...
-
[10]
To this end, we proposed a two-stage training framework consisting of SFT and HGRPO
CONCLUSION In this work, we aimed to develop a MLLM capable of effectively balancing reasoning ability and synthetic image detection perfor- mance. To this end, we proposed a two-stage training framework consisting of SFT and HGRPO. Experimental results demonstrate that this framework substantially improves detection performance while simultaneously enhan...
-
[11]
RELA TION TO PRIOR WORK Compared with prior works, our framework introduces several key innovations. First, unlike FakeVLM [14], FakeShield [29], IvyFake [27] and UniShield [30], which rely on excessively long reasoning chains, our approach does not require such overextended reasoning when dealing with images that exhibit clear generative artifacts. Inste...
-
[12]
ACKNOWLEDGMENTS This research is supported by the Young Scientists Fund of the Na- tional Natural Science Foundation of China (Grant No.72304215) and the Ant Group Research Intern Program
-
[13]
Denoising diffu- sion probabilistic models,
Jonathan Ho, Ajay Jain, and Pieter Abbeel, “Denoising diffu- sion probabilistic models,” inNeurIPS, Red Hook, NY , USA, 2020, NIPS ’20, Curran Associates Inc
work page 2020
-
[14]
Video generation models as world simulators,
Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh, “Video generation models as world simulators,” 2024
work page 2024
-
[15]
Chenfei Wu, Jiahao Li, Jingren Zhou, et al., “Qwen-image technical report,” 2025
work page 2025
-
[16]
Glff: Global and local feature fusion for ai-synthesized image detection,
Yan Ju, Shan Jia, Jialing Cai, Haiying Guan, and Siwei Lyu, “Glff: Global and local feature fusion for ai-synthesized image detection,”Trans. Multi., vol. 26, pp. 4073–4085, Jan. 2024
work page 2024
-
[17]
Omniguard: Hy- brid manipulation localization via augmented versatile deep image watermarking,
Xuanyu Zhang, Zecheng Tang, Zhipei Xu, Runyi Li, Youmin Xu, Bin Chen, Feng Gao, and Jian Zhang, “Omniguard: Hy- brid manipulation localization via augmented versatile deep image watermarking,” inCVPR, 2025, pp. 3008–3018
work page 2025
-
[18]
Forensichub: A unified benchmark & codebase for all-domain fake image de- tection and localization,
Bo Du, Xuekang Zhu, Xiaochen Ma, et al., “Forensichub: A unified benchmark & codebase for all-domain fake image de- tection and localization,” 2025
work page 2025
-
[19]
Towards general visual-linguistic face forgery detection,
Ke Sun, Shen Chen, Taiping Yao, Ziyin Zhou, Jiayi Ji, Xi- aoshuai Sun, Chia-Wen Lin, and Rongrong Ji, “Towards general visual-linguistic face forgery detection,” in2025 IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR), Los Alamitos, CA, USA, June 2025, pp. 19576–19586, IEEE Computer Society
work page 2025
-
[20]
Common sense reasoning for deepfake detection,
Yue Zhang, Ben Colman, Xiao Guo, Ali Shahriyari, and Gau- rav Bharaj, “Common sense reasoning for deepfake detection,” inECCV, Berlin, Heidelberg, 2024, p. 399–415, Springer- Verlag
work page 2024
-
[21]
Lei Su, Xiaochen Ma, Xuekang Zhu, et al., “Can we get rid of handcrafted feature extractors? sparsevit: Nonsemantics- centered, parameter-efficient image manipulation localization through spare-coding transformer,” inProceedings of the AAAI Conference on Artificial Intelligence, 2025, vol. 39, pp. 7024– 7032
work page 2025
-
[22]
Loki: A comprehensive synthetic data detection benchmark using large multimodal models,
Junyan Ye, Baichuan Zhou, Zilong Huang, et al., “Loki: A comprehensive synthetic data detection benchmark using large multimodal models,”ICLR, 2025
work page 2025
-
[23]
Xuekang Zhu, Xiaochen Ma, Lei Su, et al., “Mesoscopic in- sights: Orchestrating multi-scale & hybrid architecture for im- age manipulation localization,” inProceedings of the AAAI Conference on Artificial Intelligence, 2025, vol. 39, pp. 11022– 11030
work page 2025
-
[24]
A sanity check for ai-generated image detection,
Shilin Yan, Ouxiang Li, Jiayin Cai, et al., “A sanity check for ai-generated image detection,”International Conference on Learning Representations, 2025
work page 2025
-
[25]
Xiaochen Ma, Xuekang Zhu, Lei Su, et al., “Imdl-benco: A comprehensive benchmark and codebase for image manipula- tion detection & localization,”Advances in Neural Information Processing Systems, vol. 37, pp. 134591–134613, 2025
work page 2025
-
[26]
Siwei Wen, Junyan Ye, Peilin Feng, et al., “Spot the fake: Large multimodal model-based synthetic image detection with artifact explanation,”arXiv preprint arXiv:2503.14905, 2025
-
[27]
Think only when you need with large hybrid- reasoning models,
Lingjie Jiang, Xun Wu, Shaohan Huang, Qingxiu Dong, Zewen Chi, Li Dong, Xingxing Zhang, Tengchao Lv, Lei Cui, and Furu Wei, “Think only when you need with large hybrid- reasoning models,” 2025
work page 2025
-
[28]
Thinkless: Llm learns when to think,
Gongfan Fang, Xinyin Ma, and Xinchao Wang, “Thinkless: Llm learns when to think,” 2025
work page 2025
-
[29]
Genimage: a million-scale benchmark for detecting ai-generated image,
Mingjian Zhu, Hanting Chen, Qiangyu Yan, Xudong Huang, Guanyu Lin, Wei Li, Zhijun Tu, Hailin Hu, Jie Hu, and Yunhe Wang, “Genimage: a million-scale benchmark for detecting ai-generated image,” inProceedings of the 37th International Conference on Neural Information Processing Systems, Red Hook, NY , USA, 2023, NIPS ’23, Curran Associates Inc
work page 2023
-
[30]
Community forensics: Using thousands of generators to train fake image detectors,
Jeongsoo Park and Andrew Owens, “Community forensics: Using thousands of generators to train fake image detectors,” in Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), June 2025, pp. 8245–8257
work page 2025
-
[31]
Fakebench: Probing explainable fake image detection via large multimodal mod- els,
Yixuan Li, Xuelin Liu, Xiaoyang Wang, Bu Sung Lee, Shiqi Wang, Anderson Rocha, and Weisi Lin, “Fakebench: Probing explainable fake image detection via large multimodal mod- els,” inIEEE Transactions on Information Forensics and Se- curity, 2024
work page 2024
-
[32]
Deepseekmath: Pushing the limits of mathematical reasoning in open language models,
Zhihong Shao, Peiyi Wang, Qihao Zhu, et al., “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,” 2024
work page 2024
-
[33]
Statistical rejec- tion sampling improves preference optimization,
Tianqi Liu, Yao Zhao, Rishabh Joshi, et al., “Statistical rejec- tion sampling improves preference optimization,” 2024
work page 2024
-
[34]
Internvl3: Ex- ploring advanced training and test-time recipes for open-source multimodal models,
Jinguo Zhu, Weiyun Wang, Zhe Chen, et al., “Internvl3: Ex- ploring advanced training and test-time recipes for open-source multimodal models,” 2025
work page 2025
-
[35]
OpenAI, Josh Achiam, Steven Adler, et al., “Gpt-4 technical report,” 2024
work page 2024
-
[36]
Decoupled weight decay regularization,
Ilya Loshchilov and Frank Hutter, “Decoupled weight decay regularization,” inICLR, 2019
work page 2019
-
[37]
RADAR: Robust AI-text detection via adversarial learning,
Xiaomeng Hu, Pin-Yu Chen, and Tsung-Yi Ho, “RADAR: Robust AI-text detection via adversarial learning,” inNeurIPS, 2023
work page 2023
-
[38]
Ai-generated video detection via spatial-temporal anomaly learning,
Jianfa Bai, Man Lin, Gang Cao, and Zijie Lou, “Ai-generated video detection via spatial-temporal anomaly learning,” inPat- tern Recognition and Computer Vision, Singapore, 2025, pp. 460–470, Springer Nature Singapore
work page 2025
-
[39]
Ivy-Fake: A Unified Explainable Framework and Benchmark for Image and Video AIGC Detection
Changjiang Jiang, Wenhui Dong, Zhonghao Zhang, Chenyang Si, Fengchang Yu, Wei Peng, Xinbin Yuan, Yifei Bi, Ming Zhao, Zian Zhou, et al., “Ivy-fake: A unified explainable framework and benchmark for image and video aigc detec- tion,”arXiv preprint arXiv:2506.00979, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
Effort: Efficient orthog- onal modeling for generalizable ai-generated image detection,
Zhiyuan Yan, Jiangming Wang, et al., “Effort: Efficient orthog- onal modeling for generalizable ai-generated image detection,” inICML, 2024
work page 2024
-
[41]
Zhipei Xu, Xuanyu Zhang, Runyi Li, Zecheng Tang, Qing Huang, and Jian Zhang, “Fakeshield: Explainable image forgery detection and localization via multi-modal large lan- guage models,” inICLR, 2025
work page 2025
-
[42]
Un- ishield: An adaptive multi-agent framework for unified forgery image detection and localization,
Qing Huang, Zhipei Xu, Xuanyu Zhang, and Jian Zhang, “Un- ishield: An adaptive multi-agent framework for unified forgery image detection and localization,” 2025
work page 2025
-
[43]
Sida: Synthetic image driven zero-shot do- main adaptation,
Ye-Chan Kim, SeungJu Cha, Si-Woo Kim, Taewhan Kim, and Dong-Jin Kim, “Sida: Synthetic image driven zero-shot do- main adaptation,” 2025
work page 2025
-
[44]
Diffusionfake: Enhancing generalization in deepfake detection via guided stable diffu- sion,
Ke Sun, Shen Chen, Taiping Yao, Hong Liu, Xiaoshuai Sun, Shouhong Ding, and Rongrong Ji, “Diffusionfake: Enhancing generalization in deepfake detection via guided stable diffu- sion,” inNeurIPS, 2024
work page 2024
-
[45]
An Yang, Anfeng Li, Baosong Yang, et al., “Qwen3 technical report,” 2025
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.