pith. machine review for the scientific record.
sign in

arxiv: 2509.26272 · v3 · submitted 2025-09-30 · 💻 cs.CV · cs.LG

PRPO: Paragraph-level Policy Optimization for Vision-Language Deepfake Detection

Pith reviewed 2026-05-18 12:30 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords deepfake detectionvision-language modelsreinforcement learningpolicy optimizationmultimodal reasoningsynthetic mediaLLM alignment
0
0 comments X

The pith

Paragraph-level policy optimization aligns LLM reasoning with visual evidence to improve deepfake detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Paragraph-level Relative Policy Optimization to train vision-language models on deepfake detection. It builds a dataset with reasoning annotations and applies reinforcement learning so that model explanations receive rewards when they match image content at the paragraph level. A sympathetic reader would care because current multimodal models often produce explanations that ignore or contradict the visual evidence, which undermines their usefulness for verifying synthetic media. If the central claim holds, the result is detection systems that are both more accurate and more interpretable in practice.

Core claim

The central claim is that optimizing reinforcement learning policies at the paragraph level, using reward signals derived from alignment with visual evidence, produces models whose reasoning stays grounded in the actual image content and therefore detects deepfakes more reliably than earlier approaches.

What carries the argument

Paragraph-level Relative Policy Optimization (PRPO) is the reinforcement learning algorithm that supplies paragraph-level reward signals to align multimodal LLM outputs with image content.

If this is right

  • Deepfake detection accuracy rises when reasoning is optimized at the paragraph level with visual rewards.
  • The method outperforms GRPO under test-time conditions.
  • Explanations become more consistent with image content and less prone to misalignment.
  • Multimodal models gain the capacity to produce reliable, grounded outputs for safety-critical verification tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same paragraph-level reward structure could be tested on other vision-language tasks that require explanations to match visual details.
  • Finer-grained alignment during RL training may offer a route to lower hallucination rates in multimodal models more broadly.
  • Applying the technique to real-time streams of social-media images would reveal whether the gains hold outside laboratory datasets.

Load-bearing premise

Paragraph-level reward signals derived from visual evidence will produce stable generalization to unseen deepfakes without the annotations introducing dataset-specific biases or the RL process amplifying subtle misalignments.

What would settle it

Evaluating the trained model on a fresh deepfake dataset that uses generation methods absent from the training distribution and checking whether detection accuracy and explanation alignment both remain high or drop sharply.

Figures

Figures reproduced from arXiv: 2509.26272 by Issa Khalil, Khang Tran, Naseem Khan, NHatHai Phan, Tuan Nguyen.

Figure 1
Figure 1. Figure 1: Reasoning quality comparison between LLaVA and the proposed PRPO. LLaVA and other [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Three-step pipeline for generating high-quality reasoning annotations in DF-R5. The [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Proposed DX-LLaVA, a LLaVA fine-tuning framework for [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of deepfake detection features by category in the DF-R5 dataset (total of [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Prompt for generating a comprehensive set of visual cues to identify deepfake facial [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Prompt for consolidating 4 × K (e.g., 4 × 50) forensic-relevant features into a unified and non-redundant list, used across GPT-4o, Gemini 2.5, Qwen 2.5, and LLaMA 4. A.4 PROMPT FOR QUALITATIVE EVALUATION We provide the full prompt used to evaluate the quality of reasoning responses from different models in [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Prompt for systematic scoring of k = 74 forensic-relevant characteristics in deepfake images, requiring per-feature evaluation and a final overall decision. A.5 DETAILED COMPUTATION OF PREDICTION CONSISTENCY REWARD (PCR) The Prediction Consistency Reward is computed through paragraph-level evidence scoring. Each paragraph is analyzed using dictionaries of real terms R, fake terms F, and negation terms N to… view at source ↗
Figure 8
Figure 8. Figure 8: Prompt for grouping real-indicative features into logical categories with reasoning. Each [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Prompt provided to evaluators for scoring deepfake detection responses on a 0-5 scale [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
read the original abstract

The rapid rise of synthetic media has made deepfake detection a critical challenge for online safety and trust. Progress remains constrained by the scarcity of large, high-quality datasets. Although multimodal large language models (LLMs) exhibit strong reasoning capabilities, their performance on deepfake detection is poor, often producing explanations that are misaligned with visual evidence or hallucinatory. To address this limitation, we introduce a reasoning-annotated dataset for deepfake detection and propose Paragraph-level Relative Policy Optimization (PRPO), a reinforcement learning algorithm that aligns LLM reasoning with image content at the paragraph level. Experiments show that PRPO improves detection accuracy by a wide margin and achieves the highest reasoning score of 4.55/5.0. Ablation studies further demonstrate that PRPO significantly outperforms GRPO under test-time conditions. These results underscore the importance of grounding multimodal reasoning in visual evidence to enable more reliable and interpretable deepfake detection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces a reasoning-annotated dataset for deepfake detection and proposes Paragraph-level Relative Policy Optimization (PRPO), a reinforcement learning algorithm that aligns multimodal LLM reasoning with image content at the paragraph level. It claims that PRPO yields substantial gains in detection accuracy, achieves the highest reasoning score of 4.55/5.0, and outperforms GRPO in ablation studies under test-time conditions.

Significance. If the empirical claims hold after proper validation, the work could advance interpretable deepfake detection by grounding LLM outputs in visual evidence, addressing common issues of misalignment and hallucination in vision-language models for safety-critical applications.

major comments (3)
  1. [Abstract] Abstract: the central claims of 'wide margin' accuracy improvement and a 4.55/5.0 reasoning score are presented without any baseline numbers, dataset sizes, statistical tests, or implementation details, preventing verification of the reported gains.
  2. [Ablation studies] Ablation studies: the assertion that PRPO significantly outperforms GRPO under test-time conditions lacks reported variance, cross-dataset hold-out results, or analysis of reward variance across paragraph boundaries, leaving generalization unsupported.
  3. [Method] Method: paragraph-level reward signals are derived from visual evidence and reasoning annotations, yet no quantitative check on annotation fidelity or potential dataset-specific biases is provided, raising the risk that the RL objective amplifies rather than corrects subtle misalignments.
minor comments (2)
  1. [Method] Notation for the PRPO objective and the paragraph-level reward function should be defined more explicitly with equations to improve reproducibility.
  2. [Experiments] Figure captions and table headers would benefit from clearer labeling of metrics (e.g., accuracy vs. reasoning score) and experimental conditions.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment point by point below, indicating planned revisions to improve clarity, verifiability, and rigor where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claims of 'wide margin' accuracy improvement and a 4.55/5.0 reasoning score are presented without any baseline numbers, dataset sizes, statistical tests, or implementation details, preventing verification of the reported gains.

    Authors: We agree that the abstract would benefit from greater specificity to support immediate verification. The full manuscript provides baseline comparisons, the scale of the reasoning-annotated dataset, and details on the evaluation protocol including repeated runs in Sections 4 and 5. We will revise the abstract to incorporate key quantitative highlights and references to the experimental setup while preserving brevity. revision: yes

  2. Referee: [Ablation studies] Ablation studies: the assertion that PRPO significantly outperforms GRPO under test-time conditions lacks reported variance, cross-dataset hold-out results, or analysis of reward variance across paragraph boundaries, leaving generalization unsupported.

    Authors: We acknowledge that additional statistical and generalization details would strengthen the ablation claims. The current results report mean improvements; we will add standard deviations from multiple runs, cross-dataset hold-out evaluations, and an analysis of reward variance across paragraph boundaries in the revised ablation studies section. revision: yes

  3. Referee: [Method] Method: paragraph-level reward signals are derived from visual evidence and reasoning annotations, yet no quantitative check on annotation fidelity or potential dataset-specific biases is provided, raising the risk that the RL objective amplifies rather than corrects subtle misalignments.

    Authors: This concern about annotation quality and potential biases is well-taken. The manuscript describes the derivation of paragraph-level rewards from visual evidence and annotations in Section 3. To directly address fidelity and bias risks, we will add quantitative checks such as inter-annotator agreement metrics and alignment verification in the revised method section. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical RL training procedure with independent experimental validation

full rationale

The paper introduces a new reasoning-annotated dataset and applies a standard reinforcement learning algorithm (PRPO) to align multimodal LLM outputs with visual evidence for deepfake detection. All performance claims (accuracy gains, 4.55/5 reasoning score, outperformance vs GRPO) are presented as results of training and ablation experiments rather than as derivations or predictions that reduce to the method's own fitted quantities by construction. No equations, uniqueness theorems, or self-citations are used to justify the core improvement; the work is self-contained as an empirical procedure whose validity rests on held-out test performance rather than internal redefinition of inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The proposal rests on standard reinforcement-learning assumptions about reward design and policy optimization; no free parameters, ad-hoc axioms, or new physical entities are described in the abstract.

axioms (1)
  • domain assumption Reinforcement learning with paragraph-level rewards can align model-generated reasoning to visual evidence without introducing new hallucinations.
    Implicit foundation for claiming PRPO improves both accuracy and reasoning quality.

pith-pipeline@v0.9.0 · 5695 in / 1184 out tokens · 52690 ms · 2026-05-18T12:30:27.223050+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · 5 internal anchors

  1. [1]

    URLhttps://huggingface.co/laion/CLIP-convnext_large_d_ 320.laion2B-s29B-b131K-ft-soup

    Clip convnext-large-d 320: Laion-2b s29b b131k fine-tuned (soup) – hugging face, Septem- ber 2023. URLhttps://huggingface.co/laion/CLIP-convnext_large_d_ 320.laion2B-s29B-b131K-ft-soup

  2. [2]

    InIEEE Symposium on Security and Privacy, SP 2024, San Francisco, CA, USA, May 19-23, 2024

    Sifat Muhammad Abdullah, Aravind Cheruvu, Shravya Kanchi, Taejoong Chung, Peng Gao, Murtuza Jadliwala, and Bimal Viswanath. An analysis of recent advances in deepfake image detection in an evolving threat landscape. In2024 IEEE Symposium on Security and Privacy (SP), pp. 91–109, 2024. doi: 10.1109/SP54263.2024.00194

  3. [3]

    The claude 3 model family: Opus, sonnet, haiku.https://www.anthropic

    Anthropic. The claude 3 model family: Opus, sonnet, haiku.https://www.anthropic. com/claude-3-model-card, 2024. Model announced on March 4, 2024

  4. [4]

    SiT: Self-supervised vision transformer

    Sara Atito, Muhammad Awais, and Josef Kittler. SiT: Self-supervised vision transformer. arXiv preprint arXiv:2104.03602, 2021

  5. [5]

    Improving image generation with better captions, 2023

    James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, Wesam Manassra, Prafulla Dhariwal, Casey Chu, Yunxin Jiao, and Aditya Ramesh. Improving image generation with better captions, 2023

  6. [6]

    Y AKE!: Keyword extraction from single documents using multiple local features

    Ricardo Campos, V ´ıtor Mangaravite, Arian Pasquali, Al´ıpio Jorge, C´elia Nunes, and Adam Jatowt. Y AKE!: Keyword extraction from single documents using multiple local features. Information Sciences, 509:257–289, 2020

  7. [7]

    Pixart-$\alpha$: Fast training of diffusion transformer for photorealistic text-to-image synthesis

    Junsong Chen, Jincheng YU, Chongjian GE, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-$\alpha$: Fast training of diffusion transformer for photorealistic text-to-image synthesis. InThe Twelfth International Conference on Learning Representations (ICLR), 2024

  8. [8]

    X2-DFD: A framework for explainable and extendable deepfake detection.arXiv preprint arXiv:2410.06126, 2024

    Yize Chen, Zhiyuan Yan, Siwei Lyu, and Baoyuan Wu. X2-DFD: A framework for explainable and extendable deepfake detection.arXiv preprint arXiv:2410.06126, 2024

  9. [9]

    Gonzalez, Ion Stoica, and Eric P

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.https://lmsys. org/blog/2023-03-30-vicuna/, March 2023

  10. [10]

    Deep reinforcement learning from human preferences

    Paul F Christiano, Jan Leike, Tom B Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. InAdvances in Neural Information Processing Systems, volume 30, pp. 4299–4307, 2017

  11. [11]

    Process Reinforcement through Implicit Rewards

    Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, and Ning Ding. PROCESS REINFORCEMENT THROUGH IMPLICIT REW ARDS.arXiv preprint arXiv:2502.01456, 2025. ...

  12. [12]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations (ICLR), 2021

  13. [13]

    Gemma 3 technical report, 2025

    Gemma Team et al. Gemma 3 technical report, 2025

  14. [14]

    A hitchhiker’s guide to fine- grained face forgery detection using common sense reasoning

    Niki M Foteinopoulou, Enjie Ghorbel, and Djamila Aouada. A hitchhiker’s guide to fine- grained face forgery detection using common sense reasoning. InAdvances in Neural Infor- mation Processing Systems, volume 37, pp. 2943–2976, 2025

  15. [15]

    Leveraging frequency analysis for deep fake image recognition

    Joel Frank, Thorsten Eisenhofer, Lea Sch¨onherr, Asja Fischer, Dorothea Kolossa, and Thorsten Holz. Leveraging frequency analysis for deep fake image recognition. InProceedings of the 37th International Conference on Machine Learning, pp. 3247–3258. PMLR, November 2020

  16. [16]

    Gemini: A family of highly capable multimodal models, 2023

    Gemini Team. Gemini: A family of highly capable multimodal models, 2023. 10 Preprint

  17. [17]

    Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio

    Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks, June 2014

  18. [18]

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

  19. [19]

    Ffaa: Multimodal large language model based explainable open-world face forgery analysis assistant.arXiv preprint arXiv:2408.10072, 2024

    Haomian He, Xuelin Zhao, Yuhang Gao, Zhengchao Huang, and Bin Xia. Ffaa: Multimodal large language model based explainable open-world face forgery analysis assistant.arXiv preprint arXiv:2408.10072, 2024

  20. [20]

    Classifier-free diffusion guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. InNeurIPS 2021 Workshop on Deep Generative Models and Causal Reasoning, 2022

  21. [21]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, volume 33, pp. 6840–6851, 2020

  22. [22]

    Video diffusion models.arXiv preprint arXiv:2204.03682, 2022

    Jonathan Ho, Niki Kalchbrenner, Robert Weichwald, Andreas Weiskopf, Prafulla Dhariwal, Ajay Jain, Christian K ¨uttler, and Tim Salimans. Video diffusion models.arXiv preprint arXiv:2204.03682, 2022

  23. [23]

    SIDA: social media image deepfake detection, local- ization and explanation with large multimodal model

    Zhenglin Huang, Jinwei Hu, Xiangtai Li, Yiwei He, Xingyu Zhao, Bei Peng, Baoyuan Wu, Xiaowei Huang, and Guangliang Cheng. SIDA: social media image deepfake detection, local- ization and explanation with large multimodal model. InIEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR) 2025, 2025

  24. [24]

    Image-to-image translation with conditional adversarial networks

    Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pp. 112–120, 2017

  25. [25]

    Focal frequency loss for image reconstruction and synthesis

    Liming Jiang, Bo Dai, Wayne Wu, and Chen Change Loy. Focal frequency loss for image reconstruction and synthesis. In2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 13899–13909, Montreal, QC, Canada, October 2021. IEEE. ISBN 978-1-6654- 2812-5

  26. [26]

    A style-based generator architecture for generative adversarial networks, March 2019

    Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks, March 2019

  27. [27]

    Alias-Free Generative Adversarial Networks

    Tero Karras, Miika Aittala, Samuli Laine, Erik H ¨ark¨onen, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Alias-Free Generative Adversarial Networks. InAdvances in Neural Informa- tion Processing Systems, volume 34, pp. 852–863, 2021

  28. [28]

    Lee, Ian P

    Jan Kietzmann, Linda W. Lee, Ian P. McCarthy, and Tim C. Kietzmann. Deepfakes: Trick or treat?Business Horizons, 63(2):135–146, March 2020

  29. [29]

    Gwanhyeong Koo, Sunjae Yoon, Ji Woo Hong, and Chang D. Yoo. Flexiedit: Frequency-aware latent refinement for enhanced non-rigid editing, July 2024

  30. [30]

    Deepfakes: a new threat to face recognition? assess- ment and detection, December 2018

    Pavel Korshunov and Sebastien Marcel. Deepfakes: a new threat to face recognition? assess- ment and detection, December 2018

  31. [31]

    Freqblender: Enhancing deepfake detection by blending frequency knowledge, May 2024

    Hanzhe Li, Yuezun Li, Jiaran Zhou, Bin Li, and Junyu Dong. Freqblender: Enhancing deepfake detection by blending frequency knowledge, May 2024

  32. [32]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InProceedings of the 39th International Conference on Machine Learning (ICML), pp. 12888–12900. PMLR, June 2022

  33. [33]

    BLIP-2: Bootstrapping language- image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language- image pre-training with frozen image encoders and large language models. InProceedings of the 40th International Conference on Machine Learning (ICML), volume 202 ofProceedings of Machine Learning Research, pp. 19730–19742. PMLR, 23–29 Jul 2023. 11 Preprint

  34. [34]

    Fakebench: Probing ex- plainable fake image detection via large multimodal models.arXiv preprint arXiv:2404.13306, 2024

    Yixuan Li, Xuelin Liu, Xiaoyang Wang, Shiqi Wang, and Weisi Lin. Fakebench: Probing ex- plainable fake image detection via large multimodal models.arXiv preprint arXiv:2404.13306, 2024

  35. [35]

    Celeb-df: A large-scale challeng- ing dataset for deepfake forensics

    Yuezun Li, Xin Yang, Pu Sun, Honggang Qi, and Siwei Lyu. Celeb-df: A large-scale challeng- ing dataset for deepfake forensics. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020

  36. [36]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In Advances in Neural Information Processing Systems, volume 36, 2023

  37. [37]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 286–295, 2024

  38. [38]

    A convnet for the 2020s

    Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11976–11986, 2022

  39. [39]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations (ICLR), 2019. URLhttps://openreview. net/forum?id=Bkg6RiCqY7

  40. [40]

    Benchmarking human and model perception of ai-generated images

    Zeyu Lu, Di Huang, Jingjing Qu, Chengyue Wu, and Wanli Ouyang. Benchmarking human and model perception of ai-generated images. InAdvances in Neural Information Processing Systems, volume 36, 2023

  41. [41]

    The llama 4 herd: The beginning of a new era of natively multimodal intelli- gence.https://ai.meta.com/blog/llama-4-multimodal-intelligence/, April 2025

    Meta AI. The llama 4 herd: The beginning of a new era of natively multimodal intelli- gence.https://ai.meta.com/blog/llama-4-multimodal-intelligence/, April 2025

  42. [42]

    The creation and detection of deepfakes: A survey, September 2020

    Yisroel Mirsky and Wenke Lee. The creation and detection of deepfakes: A survey, September 2020

  43. [43]

    Pixtral 12B

    Mistral AI Team. Pixtral 12b.arXiv preprint arXiv:2410.07073, 2024. URLhttps:// arxiv.org/abs/2410.07073

  44. [44]

    Thanh Thi Nguyen, Quoc Viet Hung Nguyen, Dung Tien Nguyen, Duc Thanh Nguyen, Thien Huynh-The, Saeid Nahavandi, Thanh Tam Nguyen, Quoc-Viet Pham, and Cuong M. Nguyen. Deep learning for deepfakes creation and detection: A survey, August 2022

  45. [45]

    Ai-synthesized faces are indistinguishable from real faces and more trustworthy.Proceedings of the National Academy of Sciences, 119(8): e2120481119, 2022

    Sophie J Nightingale and Hany Farid. Ai-synthesized faces are indistinguishable from real faces and more trustworthy.Proceedings of the National Academy of Sciences, 119(8): e2120481119, 2022

  46. [46]

    Towards universal fake image detectors that gen- eralize across generative models

    Utkarsh Ojha, Yuheng Li, and Yong Jae Lee. Towards universal fake image detectors that gen- eralize across generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 24480–24489, June 2023

  47. [47]

    Gpt-4 technical report, 2024

    OpenAI. Gpt-4 technical report, 2024

  48. [48]

    Hello GPT-4o.https://openai.com/index/hello-gpt-4o/, May 2024

    OpenAI. Hello GPT-4o.https://openai.com/index/hello-gpt-4o/, May 2024

  49. [49]

    Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke E

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke E. Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Francis Christiano, Jan Leike, and Ryan J. Lowe. Training language models to follow instructions wit...

  50. [50]

    Styleclip: Text-driven manipulation of stylegan imagery

    Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. Styleclip: Text-driven manipulation of stylegan imagery. In2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2065–2074, Montreal, QC, Canada, October 2021. IEEE. ISBN 978-1-6654-2812-5. 12 Preprint

  51. [51]

    Sdxl: Improving latent diffusion models for high-resolution image synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. InThe Twelfth International Conference on Learning Representations (ICLR), 2024

  52. [52]

    Qwen2.5 Technical Report

    Qwen Team. Qwen2.5 Technical Report.arXiv preprint arXiv:2412.15115, 2024. URL https://arxiv.org/abs/2412.15115

  53. [53]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InPro- ceedings of the 38th International Conference on Machine Learning (ICML), pp. 8748–8763. PM...

  54. [54]

    Manning, and Chelsea Finn

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model, 2023

  55. [55]

    Deep reinforcement learning- based image captioning with embedding reward

    Zhou Ren, Xiaoyu Wang, Ning Zhang, Xutao Lv, and Li-Jia Li. Deep reinforcement learning- based image captioning with embedding reward. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1151–1159, 2017

  56. [56]

    Towards the detection of dif- fusion model deepfakes

    Jonas Ricker, Simon Damm, Thorsten Holz, and Asja Fischer. Towards the detection of dif- fusion model deepfakes. InInternational Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP), pp. 446–457, 01 2024

  57. [57]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10684–10695, June 2022

  58. [58]

    FaceForensics++: Learning to Detect Manipulated Facial Images

    Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Niessner. FaceForensics++: Learning to Detect Manipulated Facial Images . In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1–11, Los Alami- tos, CA, USA, November 2019. IEEE Computer Society

  59. [59]

    Progressive distillation for fast sampling of diffusion models

    Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. InInternational Conference on Learning Representations, 2022

  60. [60]

    Proximal policy optimization algorithms, 2017

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017

  61. [61]

    De-fake: Detection and attribution of fake images generated by text-to-image generation models

    Zeyang Sha, Zheng Li, Ning Yu, and Yang Zhang. De-fake: Detection and attribution of fake images generated by text-to-image generation models. InProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security (CCS), pp. 3418–3432, New York, NY , USA, 2023. Association for Computing Machinery

  62. [62]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y .K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. URLhttps://arxiv.org/abs/2402.03300

  63. [63]

    Hybridflow: A flexible and efficient rlhf framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. InPro- ceedings of the 20th European Conference on Computer Systems, EuroSys ’25, pp. 303–319, New York, NY , USA, 2025. Association for Computing Machinery. ISBN 9798400707742

  64. [64]

    Denoising diffusion implicit models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations (ICLR), 2021

  65. [65]

    Score-based generative modeling through stochastic differential equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. 13 Preprint

  66. [66]

    Frequency-aware deepfake detection: Improving generalizability through frequency space learning, March 2024

    Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. Frequency-aware deepfake detection: Improving generalizability through frequency space learning, March 2024

  67. [67]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Tim- oth´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aur ´elien Ro- driguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

  68. [68]

    Peiyi Wang, Lei Li, Zhihong Shao, R. X. Xu, Damai Dai, Yifei Li, Deli Chen, Y . Wu, and Zhifang Sui. MATH-SHEPHERD: Verify and reinforce LLMs step-by-step without human annotations. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pp. 9426–9439. Association for Computational Linguis- tics,...

  69. [69]

    Visionary-r1: Mitigating shortcuts in visual reasoning with reinforcement learning, 2025

    Jiaer Xia, Yuhang Zang, Peng Gao, Yixuan Li, and Kaiyang Zhou. Visionary-r1: Mitigating shortcuts in visual reasoning with reinforcement learning, 2025

  70. [70]

    Fakeshield: Explainable image forgery detection and localization via multi-modal large language models

    Zhipei Xu, Xuanyu Zhang, Runyi Li, Zecheng Tang, Qing Huang, and Jian Zhang. Fakeshield: Explainable image forgery detection and localization via multi-modal large language models. InThe Thirteenth International Conference on Learning Representations, 2025

  71. [71]

    Df40: Toward next-generation deepfake detection

    Zhiyuan Yan, Taiping Yao, Shen Chen, Yandan Zhao, Xinghe Fu, Junwei Zhu, Donghao Luo, Chengjie Wang, Shouhong Ding, Yunsheng Wu, and Li Yuan. Df40: Toward next-generation deepfake detection. InAdvances in Neural Information Processing Systems, 2024

  72. [72]

    Generative adversarial networks for medical image synthesis: A review.Medical Image Analysis, 51:1–18, 2019

    Xun Yi, Esther Walia, and Mohammed Babar. Generative adversarial networks for medical image synthesis: A review.Medical Image Analysis, 51:1–18, 2019

  73. [73]

    Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D. Goodman. Star: Bootstrapping reasoning with reasoning. InAdvances in Neural Information Processing Systems, volume 35, pp. 39114– 39129, 2022

  74. [74]

    Sc-captioner: Improving image captioning with self-correction by reinforcement learning, 2025

    Lin Zhang, Xianfang Zeng, Kangcong Li, Gang Yu, and Tao Chen. Sc-captioner: Improving image captioning with self-correction by reinforcement learning, 2025

  75. [75]

    Improve vision language model chain-of-thought reasoning

    Ruohong Zhang, Bowen Zhang, Yanghao Li, Haotian Zhang, Zhiqing Sun, Zhe Gan, Yinfei Yang, Ruoming Pang, and Yiming Yang. Improve vision language model chain-of-thought reasoning. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), 2025

  76. [76]

    Common sense rea- soning for deepfake detection

    Yue Zhang, Ben Colman, Xiao Guo, Ali Shahriyari, and Gaurav Bharaj. Common sense rea- soning for deepfake detection. InEuropean Conference on Computer Vision (ECCV), 2024

  77. [77]

    Learning to reason without external rewards, 2025

    Xuandong Zhao, Zhewei Kang, Aosong Feng, Sergey Levine, and Dawn Song. Learning to reason without external rewards, 2025

  78. [78]

    Ttrl: Test-time reinforcement learning, 2025

    Yuxin Zuo, Kaiyan Zhang, Li Sheng, Shang Qu, Ganqu Cui, Xuekai Zhu, Haozhan Li, Yuchen Zhang, Xinwei Long, Ermo Hua, Biqing Qi, Youbang Sun, Zhiyuan Ma, Lifan Yuan, Ning Ding, and Bowen Zhou. Ttrl: Test-time reinforcement learning, 2025. A APPENDIX A.1 THEUSE OFLARGELANGUAGEMODELS(LLMS) In this work, we acknowledge the use of ChatGPT (GPT-5) for writing a...

  79. [79]

    Combine all{K}x4={4 *K}features across these models into a single unified list

  80. [80]

    Eliminate duplicate or overlapping features to ensure clarity and uniqueness

Showing first 80 references.