arxiv: 2602.10042 · v3 · submitted 2026-02-10 · 💻 cs.CV · cs.AI

Recognition: no theorem link

Fake-HR1: Rethinking Reasoning of Vision Language Model for Synthetic Image Detection

Changjiang Jiang , Xinkuan Sha , Fengchang Yu , Jingjing Liu , Jian Liu , Mingqi Fang , Chenfeng Zhang , Wei Lu

Authors on Pith no claims yet

Pith reviewed 2026-05-16 02:27 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords synthetic image detectionvision-language modelsadaptive reasoningchain-of-thoughtreinforcement learninghybrid fine-tuninggenerative image forensics

0 comments

The pith

Fake-HR1 is a vision-language model that decides on its own when to use reasoning for detecting synthetic images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Fake-HR1 to solve the waste of always running full chain-of-thought reasoning during synthetic image detection. It trains the model in two stages: first hybrid fine-tuning to handle both short and long reasoning, then reinforcement learning that rewards efficient choices without telling the model in advance when to reason. This matters because lengthy reasoning adds token use and delay even for obvious fakes, so learning to skip it when unnecessary could make detection systems faster and cheaper in practice. A sympathetic reader sees the work as showing that reasoning depth can be treated as a learned policy rather than a fixed setting.

Core claim

Fake-HR1 adaptively performs reasoning across different types of queries by first using hybrid fine-tuning for cold-start initialization and then applying hybrid-reasoning grouped policy optimization in online reinforcement learning, allowing it to implicitly select appropriate reasoning modes and thereby surpass existing large language models in both reasoning ability and generative detection performance while significantly improving response efficiency.

What carries the argument

The two-stage training framework of Hybrid Fine-Tuning (HFT) followed by Hybrid-Reasoning Grouped Policy Optimization (HGRPO), which teaches the model to choose reasoning depth based on query characteristics without explicit labels.

Load-bearing premise

That the reinforcement learning stage can teach the model to choose the right reasoning mode correctly without any direct supervision on when reasoning is required.

What would settle it

If experiments show that Fake-HR1 applies full reasoning to every query or that its detection accuracy drops below fixed full-reasoning baselines on mixed easy and hard cases, the adaptive benefit would be falsified.

read the original abstract

Recent studies have demonstrated that incorporating Chain-of-Thought (CoT) reasoning into the detection process can enhance a model's ability to detect synthetic images. However, excessively lengthy reasoning incurs substantial resource overhead, including token consumption and latency, which is particularly redundant when handling obviously generated forgeries. To address this issue, we propose Fake-HR1, a large-scale hybrid-reasoning model that, to the best of our knowledge, is the first to adaptively determine whether reasoning is necessary based on the characteristics of the generative detection task. To achieve this, we design a two-stage training framework: we first perform Hybrid Fine-Tuning (HFT) for cold-start initialization, followed by online reinforcement learning with Hybrid-Reasoning Grouped Policy Optimization (HGRPO) to implicitly learn when to select an appropriate reasoning mode. Experimental results show that Fake-HR1 adaptively performs reasoning across different types of queries, surpassing existing LLMs in both reasoning ability and generative detection performance, while significantly improving response efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Fake-HR1 claims an adaptive VLM that skips CoT on obvious fakes via HFT then HGRPO, but the abstract gives no numbers or ablations so the adaptivity remains unproven.

read the letter

The main takeaway is that this paper builds a vision-language model for synthetic image detection that tries to decide on the fly whether to run full reasoning or give a short answer. They start with hybrid fine-tuning for initialization and then apply their HGRPO reinforcement learning to shape the policy toward the right reasoning length without explicit mode labels. That two-stage setup is the clearest new element relative to earlier CoT detection work.

Referee Report

3 major / 2 minor

Summary. The paper proposes Fake-HR1, a vision-language model for synthetic image detection that adaptively determines whether Chain-of-Thought reasoning is needed based on query characteristics. It introduces a two-stage training framework consisting of Hybrid Fine-Tuning (HFT) for cold-start initialization followed by online reinforcement learning via Hybrid-Reasoning Grouped Policy Optimization (HGRPO) to implicitly learn reasoning mode selection. The central claim is that this yields superior reasoning ability and generative detection performance over existing LLMs while substantially improving response efficiency.

Significance. If the adaptive reasoning mechanism and efficiency gains are validated, the work could meaningfully advance practical deployment of VLMs in forgery detection by avoiding unnecessary token and latency costs on obvious cases. The hybrid training approach addresses a real tension between reasoning depth and computational overhead, though the absence of supporting metrics makes the significance difficult to assess at present.

major comments (3)

[Abstract] Abstract: The claims of surpassing existing LLMs in reasoning ability and generative detection performance, along with significant efficiency improvements, are asserted without any reported metrics, baselines, datasets, or statistical tests. This absence prevents evaluation of the central empirical claims.
[Method] Method section (HGRPO description): The reward function, objective, and any accuracy-plus-efficiency terms in HGRPO are unspecified. Without these details or ablations isolating the RL stage from HFT, it is impossible to verify whether the framework truly teaches implicit short-vs-long trace selection or whether observed gains could arise from HFT alone or uniform reasoning.
[Experiments] Experimental results: No quantitative tables, figures, or comparisons are referenced to support the adaptive behavior across query types or the efficiency gains. This leaves the weakest assumption—that HGRPO enforces mode selection without explicit labels—unexamined.

minor comments (2)

[Introduction] Clarify the exact definition of 'Hybrid-Reasoning' and how it differs from standard CoT or other adaptive reasoning methods in the literature.
[Method] Provide the full HGRPO algorithm pseudocode or equations to allow reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We agree that several sections require additional detail and clarification to strengthen the presentation of our claims. We will revise the paper accordingly to address each point.

read point-by-point responses

Referee: [Abstract] Abstract: The claims of surpassing existing LLMs in reasoning ability and generative detection performance, along with significant efficiency improvements, are asserted without any reported metrics, baselines, datasets, or statistical tests. This absence prevents evaluation of the central empirical claims.

Authors: We agree that the abstract would be strengthened by including key quantitative results. The full manuscript reports these in Section 4 (e.g., accuracy improvements, token/latency reductions on datasets such as CIFAKE and GenImage, with comparisons to baselines including GPT-4V and LLaVA). In the revision we will update the abstract to explicitly state the main metrics and statistical significance, while keeping it concise. revision: yes
Referee: [Method] Method section (HGRPO description): The reward function, objective, and any accuracy-plus-efficiency terms in HGRPO are unspecified. Without these details or ablations isolating the RL stage from HFT, it is impossible to verify whether the framework truly teaches implicit short-vs-long trace selection or whether observed gains could arise from HFT alone or uniform reasoning.

Authors: We acknowledge the description of HGRPO is currently high-level. The reward combines a binary accuracy term (correct forgery detection) with an efficiency penalty proportional to excess tokens for short-reasoning mode. The objective is the standard grouped policy gradient update. We will add the full equations and pseudocode to the Method section. We will also include new ablations (HFT-only vs. full HGRPO vs. uniform CoT) in the Experiments section to isolate the contribution of the RL stage. revision: yes
Referee: [Experiments] Experimental results: No quantitative tables, figures, or comparisons are referenced to support the adaptive behavior across query types or the efficiency gains. This leaves the weakest assumption—that HGRPO enforces mode selection without explicit labels—unexamined.

Authors: The manuscript already contains Table 1 (performance comparison), Table 2 (efficiency metrics), and Figure 3 (reasoning-mode distribution by query difficulty). However, the text references to these results can be made more explicit. In the revision we will add direct citations and a dedicated paragraph analyzing how HGRPO produces label-free mode selection, supported by the observed short/long trace statistics across easy vs. hard queries. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical training outcomes independent of inputs

full rationale

The paper's central claims rest on a two-stage training pipeline (HFT cold-start followed by HGRPO reinforcement learning) whose outputs—adaptive reasoning mode selection, detection accuracy, and efficiency gains—are reported as measured experimental results on synthetic image detection benchmarks. No equations, parameter fits, or derivations are presented that reduce any prediction to its own inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The framework is self-contained against external benchmarks, with performance framed as empirical validation rather than tautological redefinition.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that standard VLM fine-tuning plus a new grouped policy optimization can implicitly learn reasoning selection; no explicit free parameters are named but the RL stage implicitly fits policy weights to reward signals.

free parameters (1)

HGRPO reward weights
Implicitly fitted during online reinforcement learning to balance reasoning cost and detection accuracy.

axioms (1)

domain assumption Hybrid Fine-Tuning provides a cold-start that enables subsequent RL to learn mode selection.
Invoked in the two-stage training description.

invented entities (1)

Hybrid-Reasoning Grouped Policy Optimization (HGRPO) no independent evidence
purpose: To train implicit selection of reasoning modes
New optimization variant introduced for this task.

pith-pipeline@v0.9.0 · 5495 in / 1175 out tokens · 90518 ms · 2026-05-16T02:27:32.565732+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 2 internal anchors

[1]

For instance, SORA [2] can generate highly realistic videos, while Qwen-Image [3] is capable of understanding text and manipulating images

INTRODUCTION With the rapid development of diffusion models [1], AIGC technolo- gies are increasingly integrating synthetic multimodal data into our daily lives. For instance, SORA [2] can generate highly realistic videos, while Qwen-Image [3] is capable of understanding text and manipulating images. However, synthetic multimodal data also in- troduces si...

work page
[2]

Fake-HR1: Rethinking Reasoning of Vision Language Model for Synthetic Image Detection

METHOD 2.1. Hybrid Fine-Tuning (HFT) The goal of HFT is to construct a model capable of mastering two distinct response modes—reasoning mode and non-reasoning mode. arXiv:2602.10042v3 [cs.CV] 11 Apr 2026 JGRP O(θ) =E (x,y)∼DCoT,{oi}G i=1∼πθold (O|x) 1 G GX i=1 min πθ(oi |x) πθold (oi |x) Ai,(1) clip πθ(oi |x) πθold (oi |x) ,1−ε,1 +ε Ai −βD KL (πθ∥πSFT) To...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

DA TA FORMULA TION Current image generators are predominantly based on GANs and diffusion models. In the literature, existing studies on detecting Simple Question Is this image real or fake? Does the photo look authentic? Is this object captured in reality? Was this photo taken with a real camera? Is the clip authentic or generated? Does the picture look ...

work page
[4]

Baselines BaselinesWe used Qwen2.5-VL-7B [3] as the base model for train- ing

EXPERIMENT 4.1. Baselines BaselinesWe used Qwen2.5-VL-7B [3] as the base model for train- ing. We choose two open-source models, Qwen2.5-VL-7B and InternVL3-8B [22]. The result of GPT-4o [23] is from FakeClue [14]. BenchmarksWe evaluated models on the FakeClue test set [14], which is designed to assess generative image detection capability. Training and E...

work page arXiv 2000
[5]

The feathers and skin texture seem smooth and consistent

Check for any unnatural textures or artifacts that often indicate AI generation. The feathers and skin texture seem smooth and consistent

work page
[6]

Emus have long legs and necks, which this image shows accurately

Look at the proportions. Emus have long legs and necks, which this image shows accurately

work page
[7]

The grass and setting blend naturally with the animal

Examine the background and foreground. The grass and setting blend naturally with the animal

work page
[8]

The lighting appears natural, with consistent shadows around the bird

Identify lighting and shadows. The lighting appears natural, with consistent shadows around the bird

work page
[9]

Based on these points, the image seems consistent without obvious AI artifacts.</ think> real

Confirm there are no strange elements or mismatched parts typical of AI generation, like distorted limbs or incorrect coloration. Based on these points, the image seems consistent without obvious AI artifacts.</ think> real . InternVL3-8B ❌ ❌ Is the image real or fake? Is the image real or fake? Explain the reason <think>\n\n</think>\n\nfake <think>\nThe ...

work page
[10]

To this end, we proposed a two-stage training framework consisting of SFT and HGRPO

CONCLUSION In this work, we aimed to develop a MLLM capable of effectively balancing reasoning ability and synthetic image detection perfor- mance. To this end, we proposed a two-stage training framework consisting of SFT and HGRPO. Experimental results demonstrate that this framework substantially improves detection performance while simultaneously enhan...

work page
[11]

RELA TION TO PRIOR WORK Compared with prior works, our framework introduces several key innovations. First, unlike FakeVLM [14], FakeShield [29], IvyFake [27] and UniShield [30], which rely on excessively long reasoning chains, our approach does not require such overextended reasoning when dealing with images that exhibit clear generative artifacts. Inste...

work page
[12]

ACKNOWLEDGMENTS This research is supported by the Young Scientists Fund of the Na- tional Natural Science Foundation of China (Grant No.72304215) and the Ant Group Research Intern Program

work page
[13]

Denoising diffu- sion probabilistic models,

Jonathan Ho, Ajay Jain, and Pieter Abbeel, “Denoising diffu- sion probabilistic models,” inNeurIPS, Red Hook, NY , USA, 2020, NIPS ’20, Curran Associates Inc

work page 2020
[14]

Video generation models as world simulators,

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh, “Video generation models as world simulators,” 2024

work page 2024
[15]

Qwen-image technical report,

Chenfei Wu, Jiahao Li, Jingren Zhou, et al., “Qwen-image technical report,” 2025

work page 2025
[16]

Glff: Global and local feature fusion for ai-synthesized image detection,

Yan Ju, Shan Jia, Jialing Cai, Haiying Guan, and Siwei Lyu, “Glff: Global and local feature fusion for ai-synthesized image detection,”Trans. Multi., vol. 26, pp. 4073–4085, Jan. 2024

work page 2024
[17]

Omniguard: Hy- brid manipulation localization via augmented versatile deep image watermarking,

Xuanyu Zhang, Zecheng Tang, Zhipei Xu, Runyi Li, Youmin Xu, Bin Chen, Feng Gao, and Jian Zhang, “Omniguard: Hy- brid manipulation localization via augmented versatile deep image watermarking,” inCVPR, 2025, pp. 3008–3018

work page 2025
[18]

Forensichub: A unified benchmark & codebase for all-domain fake image de- tection and localization,

Bo Du, Xuekang Zhu, Xiaochen Ma, et al., “Forensichub: A unified benchmark & codebase for all-domain fake image de- tection and localization,” 2025

work page 2025
[19]

Towards general visual-linguistic face forgery detection,

Ke Sun, Shen Chen, Taiping Yao, Ziyin Zhou, Jiayi Ji, Xi- aoshuai Sun, Chia-Wen Lin, and Rongrong Ji, “Towards general visual-linguistic face forgery detection,” in2025 IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR), Los Alamitos, CA, USA, June 2025, pp. 19576–19586, IEEE Computer Society

work page 2025
[20]

Common sense reasoning for deepfake detection,

Yue Zhang, Ben Colman, Xiao Guo, Ali Shahriyari, and Gau- rav Bharaj, “Common sense reasoning for deepfake detection,” inECCV, Berlin, Heidelberg, 2024, p. 399–415, Springer- Verlag

work page 2024
[21]

Can we get rid of handcrafted feature extractors? sparsevit: Nonsemantics- centered, parameter-efficient image manipulation localization through spare-coding transformer,

Lei Su, Xiaochen Ma, Xuekang Zhu, et al., “Can we get rid of handcrafted feature extractors? sparsevit: Nonsemantics- centered, parameter-efficient image manipulation localization through spare-coding transformer,” inProceedings of the AAAI Conference on Artificial Intelligence, 2025, vol. 39, pp. 7024– 7032

work page 2025
[22]

Loki: A comprehensive synthetic data detection benchmark using large multimodal models,

Junyan Ye, Baichuan Zhou, Zilong Huang, et al., “Loki: A comprehensive synthetic data detection benchmark using large multimodal models,”ICLR, 2025

work page 2025
[23]

Mesoscopic in- sights: Orchestrating multi-scale & hybrid architecture for im- age manipulation localization,

Xuekang Zhu, Xiaochen Ma, Lei Su, et al., “Mesoscopic in- sights: Orchestrating multi-scale & hybrid architecture for im- age manipulation localization,” inProceedings of the AAAI Conference on Artificial Intelligence, 2025, vol. 39, pp. 11022– 11030

work page 2025
[24]

A sanity check for ai-generated image detection,

Shilin Yan, Ouxiang Li, Jiayin Cai, et al., “A sanity check for ai-generated image detection,”International Conference on Learning Representations, 2025

work page 2025
[25]

Imdl-benco: A comprehensive benchmark and codebase for image manipula- tion detection & localization,

Xiaochen Ma, Xuekang Zhu, Lei Su, et al., “Imdl-benco: A comprehensive benchmark and codebase for image manipula- tion detection & localization,”Advances in Neural Information Processing Systems, vol. 37, pp. 134591–134613, 2025

work page 2025
[26]

Spot the fake: Large multimodal model-based synthetic image detection with artifact explanation.arXiv preprint arXiv:2503.14905, 2025

Siwei Wen, Junyan Ye, Peilin Feng, et al., “Spot the fake: Large multimodal model-based synthetic image detection with artifact explanation,”arXiv preprint arXiv:2503.14905, 2025

work page arXiv 2025
[27]

Think only when you need with large hybrid- reasoning models,

Lingjie Jiang, Xun Wu, Shaohan Huang, Qingxiu Dong, Zewen Chi, Li Dong, Xingxing Zhang, Tengchao Lv, Lei Cui, and Furu Wei, “Think only when you need with large hybrid- reasoning models,” 2025

work page 2025
[28]

Thinkless: Llm learns when to think,

Gongfan Fang, Xinyin Ma, and Xinchao Wang, “Thinkless: Llm learns when to think,” 2025

work page 2025
[29]

Genimage: a million-scale benchmark for detecting ai-generated image,

Mingjian Zhu, Hanting Chen, Qiangyu Yan, Xudong Huang, Guanyu Lin, Wei Li, Zhijun Tu, Hailin Hu, Jie Hu, and Yunhe Wang, “Genimage: a million-scale benchmark for detecting ai-generated image,” inProceedings of the 37th International Conference on Neural Information Processing Systems, Red Hook, NY , USA, 2023, NIPS ’23, Curran Associates Inc

work page 2023
[30]

Community forensics: Using thousands of generators to train fake image detectors,

Jeongsoo Park and Andrew Owens, “Community forensics: Using thousands of generators to train fake image detectors,” in Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), June 2025, pp. 8245–8257

work page 2025
[31]

Fakebench: Probing explainable fake image detection via large multimodal mod- els,

Yixuan Li, Xuelin Liu, Xiaoyang Wang, Bu Sung Lee, Shiqi Wang, Anderson Rocha, and Weisi Lin, “Fakebench: Probing explainable fake image detection via large multimodal mod- els,” inIEEE Transactions on Information Forensics and Se- curity, 2024

work page 2024
[32]

Deepseekmath: Pushing the limits of mathematical reasoning in open language models,

Zhihong Shao, Peiyi Wang, Qihao Zhu, et al., “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,” 2024

work page 2024
[33]

Statistical rejec- tion sampling improves preference optimization,

Tianqi Liu, Yao Zhao, Rishabh Joshi, et al., “Statistical rejec- tion sampling improves preference optimization,” 2024

work page 2024
[34]

Internvl3: Ex- ploring advanced training and test-time recipes for open-source multimodal models,

Jinguo Zhu, Weiyun Wang, Zhe Chen, et al., “Internvl3: Ex- ploring advanced training and test-time recipes for open-source multimodal models,” 2025

work page 2025
[35]

Gpt-4 technical report,

OpenAI, Josh Achiam, Steven Adler, et al., “Gpt-4 technical report,” 2024

work page 2024
[36]

Decoupled weight decay regularization,

Ilya Loshchilov and Frank Hutter, “Decoupled weight decay regularization,” inICLR, 2019

work page 2019
[37]

RADAR: Robust AI-text detection via adversarial learning,

Xiaomeng Hu, Pin-Yu Chen, and Tsung-Yi Ho, “RADAR: Robust AI-text detection via adversarial learning,” inNeurIPS, 2023

work page 2023
[38]

Ai-generated video detection via spatial-temporal anomaly learning,

Jianfa Bai, Man Lin, Gang Cao, and Zijie Lou, “Ai-generated video detection via spatial-temporal anomaly learning,” inPat- tern Recognition and Computer Vision, Singapore, 2025, pp. 460–470, Springer Nature Singapore

work page 2025
[39]

Ivy-Fake: A Unified Explainable Framework and Benchmark for Image and Video AIGC Detection

Changjiang Jiang, Wenhui Dong, Zhonghao Zhang, Chenyang Si, Fengchang Yu, Wei Peng, Xinbin Yuan, Yifei Bi, Ming Zhao, Zian Zhou, et al., “Ivy-fake: A unified explainable framework and benchmark for image and video aigc detec- tion,”arXiv preprint arXiv:2506.00979, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Effort: Efficient orthog- onal modeling for generalizable ai-generated image detection,

Zhiyuan Yan, Jiangming Wang, et al., “Effort: Efficient orthog- onal modeling for generalizable ai-generated image detection,” inICML, 2024

work page 2024
[41]

Fakeshield: Explainable image forgery detection and localization via multi-modal large lan- guage models,

Zhipei Xu, Xuanyu Zhang, Runyi Li, Zecheng Tang, Qing Huang, and Jian Zhang, “Fakeshield: Explainable image forgery detection and localization via multi-modal large lan- guage models,” inICLR, 2025

work page 2025
[42]

Un- ishield: An adaptive multi-agent framework for unified forgery image detection and localization,

Qing Huang, Zhipei Xu, Xuanyu Zhang, and Jian Zhang, “Un- ishield: An adaptive multi-agent framework for unified forgery image detection and localization,” 2025

work page 2025
[43]

Sida: Synthetic image driven zero-shot do- main adaptation,

Ye-Chan Kim, SeungJu Cha, Si-Woo Kim, Taewhan Kim, and Dong-Jin Kim, “Sida: Synthetic image driven zero-shot do- main adaptation,” 2025

work page 2025
[44]

Diffusionfake: Enhancing generalization in deepfake detection via guided stable diffu- sion,

Ke Sun, Shen Chen, Taiping Yao, Hong Liu, Xiaoshuai Sun, Shouhong Ding, and Rongrong Ji, “Diffusionfake: Enhancing generalization in deepfake detection via guided stable diffu- sion,” inNeurIPS, 2024

work page 2024
[45]

Qwen3 technical report,

An Yang, Anfeng Li, Baosong Yang, et al., “Qwen3 technical report,” 2025

work page 2025