arxiv: 2509.26272 · v3 · submitted 2025-09-30 · 💻 cs.CV · cs.LG

PRPO: Paragraph-level Policy Optimization for Vision-Language Deepfake Detection

Tuan Nguyen , Naseem Khan , Khang Tran , NHatHai Phan , Issa Khalil This is my paper

Pith reviewed 2026-05-18 12:30 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords deepfake detectionvision-language modelsreinforcement learningpolicy optimizationmultimodal reasoningsynthetic mediaLLM alignment

0 comments

The pith

Paragraph-level policy optimization aligns LLM reasoning with visual evidence to improve deepfake detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Paragraph-level Relative Policy Optimization to train vision-language models on deepfake detection. It builds a dataset with reasoning annotations and applies reinforcement learning so that model explanations receive rewards when they match image content at the paragraph level. A sympathetic reader would care because current multimodal models often produce explanations that ignore or contradict the visual evidence, which undermines their usefulness for verifying synthetic media. If the central claim holds, the result is detection systems that are both more accurate and more interpretable in practice.

Core claim

The central claim is that optimizing reinforcement learning policies at the paragraph level, using reward signals derived from alignment with visual evidence, produces models whose reasoning stays grounded in the actual image content and therefore detects deepfakes more reliably than earlier approaches.

What carries the argument

Paragraph-level Relative Policy Optimization (PRPO) is the reinforcement learning algorithm that supplies paragraph-level reward signals to align multimodal LLM outputs with image content.

If this is right

Deepfake detection accuracy rises when reasoning is optimized at the paragraph level with visual rewards.
The method outperforms GRPO under test-time conditions.
Explanations become more consistent with image content and less prone to misalignment.
Multimodal models gain the capacity to produce reliable, grounded outputs for safety-critical verification tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same paragraph-level reward structure could be tested on other vision-language tasks that require explanations to match visual details.
Finer-grained alignment during RL training may offer a route to lower hallucination rates in multimodal models more broadly.
Applying the technique to real-time streams of social-media images would reveal whether the gains hold outside laboratory datasets.

Load-bearing premise

Paragraph-level reward signals derived from visual evidence will produce stable generalization to unseen deepfakes without the annotations introducing dataset-specific biases or the RL process amplifying subtle misalignments.

What would settle it

Evaluating the trained model on a fresh deepfake dataset that uses generation methods absent from the training distribution and checking whether detection accuracy and explanation alignment both remain high or drop sharply.

Figures

Figures reproduced from arXiv: 2509.26272 by Issa Khalil, Khang Tran, Naseem Khan, NHatHai Phan, Tuan Nguyen.

**Figure 2.** Figure 2: Three-step pipeline for generating high-quality reasoning annotations in DF-R5. The [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Proposed DX-LLaVA, a LLaVA fine-tuning framework for [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of deepfake detection features by category in the DF-R5 dataset (total of [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Prompt for generating a comprehensive set of visual cues to identify deepfake facial [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Prompt for consolidating 4 × K (e.g., 4 × 50) forensic-relevant features into a unified and non-redundant list, used across GPT-4o, Gemini 2.5, Qwen 2.5, and LLaMA 4. A.4 PROMPT FOR QUALITATIVE EVALUATION We provide the full prompt used to evaluate the quality of reasoning responses from different models in [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Prompt for systematic scoring of k = 74 forensic-relevant characteristics in deepfake images, requiring per-feature evaluation and a final overall decision. A.5 DETAILED COMPUTATION OF PREDICTION CONSISTENCY REWARD (PCR) The Prediction Consistency Reward is computed through paragraph-level evidence scoring. Each paragraph is analyzed using dictionaries of real terms R, fake terms F, and negation terms N to… view at source ↗

**Figure 8.** Figure 8: Prompt for grouping real-indicative features into logical categories with reasoning. Each [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Prompt provided to evaluators for scoring deepfake detection responses on a 0-5 scale [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

read the original abstract

The rapid rise of synthetic media has made deepfake detection a critical challenge for online safety and trust. Progress remains constrained by the scarcity of large, high-quality datasets. Although multimodal large language models (LLMs) exhibit strong reasoning capabilities, their performance on deepfake detection is poor, often producing explanations that are misaligned with visual evidence or hallucinatory. To address this limitation, we introduce a reasoning-annotated dataset for deepfake detection and propose Paragraph-level Relative Policy Optimization (PRPO), a reinforcement learning algorithm that aligns LLM reasoning with image content at the paragraph level. Experiments show that PRPO improves detection accuracy by a wide margin and achieves the highest reasoning score of 4.55/5.0. Ablation studies further demonstrate that PRPO significantly outperforms GRPO under test-time conditions. These results underscore the importance of grounding multimodal reasoning in visual evidence to enable more reliable and interpretable deepfake detection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PRPO adds paragraph-level RL to ground multimodal reasoning on deepfakes but the abstract gives too little data to judge whether the reported gains are solid.

read the letter

The main point is that the authors created a reasoning-annotated dataset for deepfake detection and trained with Paragraph-level Relative Policy Optimization to force LLM explanations to stay consistent with the image evidence at the paragraph scale. They claim this lifts detection accuracy by a wide margin, reaches a 4.55/5 reasoning score, and beats GRPO at test time. That is the core new piece: moving the reward signal from tokens or sentences up to whole paragraphs so longer chains of reasoning stay grounded rather than drifting into hallucination. The dataset itself is also new, built specifically to tie each reasoning paragraph to visible artifacts instead of generic captions. Both moves address a real gap in current multimodal LLM work on this task. The ablation that shows PRPO outperforming GRPO under test-time conditions is the part that actually tests whether the paragraph granularity helps. If the full paper has clean controls and reproducible numbers, that comparison could be useful for anyone adapting RL to explanatory vision-language tasks. The soft spots sit in the missing details. The abstract mentions positive outcomes and ablations but supplies no dataset sizes, baseline scores, statistical tests, or implementation choices, so the size of the accuracy jump cannot be checked. The stress-test concern about annotation biases holds up on what is shown: if the paragraph annotations contain subtle image-text mismatches or generator-specific patterns, the RL objective can amplify them instead of correcting them. No inter-annotator numbers, no cross-generator hold-out, and no reward-variance analysis appear in the provided text, which leaves the generalization claim unsupported. This paper is for people building detection tools for platforms or studying reward design for grounded reasoning in vision-language models. A reader who wants concrete ideas for paragraph-scale alignment might extract something usable even if the experiments need tightening. I would send it for peer review. The application is important and the method direction is clear enough that referees can push for the missing quantitative checks rather than desk-rejecting outright.

Referee Report

3 major / 2 minor

Summary. The paper introduces a reasoning-annotated dataset for deepfake detection and proposes Paragraph-level Relative Policy Optimization (PRPO), a reinforcement learning algorithm that aligns multimodal LLM reasoning with image content at the paragraph level. It claims that PRPO yields substantial gains in detection accuracy, achieves the highest reasoning score of 4.55/5.0, and outperforms GRPO in ablation studies under test-time conditions.

Significance. If the empirical claims hold after proper validation, the work could advance interpretable deepfake detection by grounding LLM outputs in visual evidence, addressing common issues of misalignment and hallucination in vision-language models for safety-critical applications.

major comments (3)

[Abstract] Abstract: the central claims of 'wide margin' accuracy improvement and a 4.55/5.0 reasoning score are presented without any baseline numbers, dataset sizes, statistical tests, or implementation details, preventing verification of the reported gains.
[Ablation studies] Ablation studies: the assertion that PRPO significantly outperforms GRPO under test-time conditions lacks reported variance, cross-dataset hold-out results, or analysis of reward variance across paragraph boundaries, leaving generalization unsupported.
[Method] Method: paragraph-level reward signals are derived from visual evidence and reasoning annotations, yet no quantitative check on annotation fidelity or potential dataset-specific biases is provided, raising the risk that the RL objective amplifies rather than corrects subtle misalignments.

minor comments (2)

[Method] Notation for the PRPO objective and the paragraph-level reward function should be defined more explicitly with equations to improve reproducibility.
[Experiments] Figure captions and table headers would benefit from clearer labeling of metrics (e.g., accuracy vs. reasoning score) and experimental conditions.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment point by point below, indicating planned revisions to improve clarity, verifiability, and rigor where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: the central claims of 'wide margin' accuracy improvement and a 4.55/5.0 reasoning score are presented without any baseline numbers, dataset sizes, statistical tests, or implementation details, preventing verification of the reported gains.

Authors: We agree that the abstract would benefit from greater specificity to support immediate verification. The full manuscript provides baseline comparisons, the scale of the reasoning-annotated dataset, and details on the evaluation protocol including repeated runs in Sections 4 and 5. We will revise the abstract to incorporate key quantitative highlights and references to the experimental setup while preserving brevity. revision: yes
Referee: [Ablation studies] Ablation studies: the assertion that PRPO significantly outperforms GRPO under test-time conditions lacks reported variance, cross-dataset hold-out results, or analysis of reward variance across paragraph boundaries, leaving generalization unsupported.

Authors: We acknowledge that additional statistical and generalization details would strengthen the ablation claims. The current results report mean improvements; we will add standard deviations from multiple runs, cross-dataset hold-out evaluations, and an analysis of reward variance across paragraph boundaries in the revised ablation studies section. revision: yes
Referee: [Method] Method: paragraph-level reward signals are derived from visual evidence and reasoning annotations, yet no quantitative check on annotation fidelity or potential dataset-specific biases is provided, raising the risk that the RL objective amplifies rather than corrects subtle misalignments.

Authors: This concern about annotation quality and potential biases is well-taken. The manuscript describes the derivation of paragraph-level rewards from visual evidence and annotations in Section 3. To directly address fidelity and bias risks, we will add quantitative checks such as inter-annotator agreement metrics and alignment verification in the revised method section. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical RL training procedure with independent experimental validation

full rationale

The paper introduces a new reasoning-annotated dataset and applies a standard reinforcement learning algorithm (PRPO) to align multimodal LLM outputs with visual evidence for deepfake detection. All performance claims (accuracy gains, 4.55/5 reasoning score, outperformance vs GRPO) are presented as results of training and ablation experiments rather than as derivations or predictions that reduce to the method's own fitted quantities by construction. No equations, uniqueness theorems, or self-citations are used to justify the core improvement; the work is self-contained as an empirical procedure whose validity rests on held-out test performance rather than internal redefinition of inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The proposal rests on standard reinforcement-learning assumptions about reward design and policy optimization; no free parameters, ad-hoc axioms, or new physical entities are described in the abstract.

axioms (1)

domain assumption Reinforcement learning with paragraph-level rewards can align model-generated reasoning to visual evidence without introducing new hallucinations.
Implicit foundation for claiming PRPO improves both accuracy and reasoning quality.

pith-pipeline@v0.9.0 · 5695 in / 1184 out tokens · 52690 ms · 2026-05-18T12:30:27.223050+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PRPO maximizes the log-probabilities of paragraphs weighted by their own relative advantage... L_PRPO(θ) = E... min(π_θ/π_old A, clip...)
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Visual Consistency Reward (VCR) ... R_VCR(p) = 1/2 [sim(s, CLIP(x)) + 1]

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · 5 internal anchors

[1]

URLhttps://huggingface.co/laion/CLIP-convnext_large_d_ 320.laion2B-s29B-b131K-ft-soup

Clip convnext-large-d 320: Laion-2b s29b b131k fine-tuned (soup) – hugging face, Septem- ber 2023. URLhttps://huggingface.co/laion/CLIP-convnext_large_d_ 320.laion2B-s29B-b131K-ft-soup

work page 2023
[2]

InIEEE Symposium on Security and Privacy, SP 2024, San Francisco, CA, USA, May 19-23, 2024

Sifat Muhammad Abdullah, Aravind Cheruvu, Shravya Kanchi, Taejoong Chung, Peng Gao, Murtuza Jadliwala, and Bimal Viswanath. An analysis of recent advances in deepfake image detection in an evolving threat landscape. In2024 IEEE Symposium on Security and Privacy (SP), pp. 91–109, 2024. doi: 10.1109/SP54263.2024.00194

work page doi:10.1109/sp54263.2024.00194 2024
[3]

The claude 3 model family: Opus, sonnet, haiku.https://www.anthropic

Anthropic. The claude 3 model family: Opus, sonnet, haiku.https://www.anthropic. com/claude-3-model-card, 2024. Model announced on March 4, 2024

work page 2024
[4]

SiT: Self-supervised vision transformer

Sara Atito, Muhammad Awais, and Josef Kittler. SiT: Self-supervised vision transformer. arXiv preprint arXiv:2104.03602, 2021

work page arXiv 2021
[5]

Improving image generation with better captions, 2023

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, Wesam Manassra, Prafulla Dhariwal, Casey Chu, Yunxin Jiao, and Aditya Ramesh. Improving image generation with better captions, 2023

work page 2023
[6]

Y AKE!: Keyword extraction from single documents using multiple local features

Ricardo Campos, V ´ıtor Mangaravite, Arian Pasquali, Al´ıpio Jorge, C´elia Nunes, and Adam Jatowt. Y AKE!: Keyword extraction from single documents using multiple local features. Information Sciences, 509:257–289, 2020

work page 2020
[7]

Pixart-$\alpha$: Fast training of diffusion transformer for photorealistic text-to-image synthesis

Junsong Chen, Jincheng YU, Chongjian GE, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-$\alpha$: Fast training of diffusion transformer for photorealistic text-to-image synthesis. InThe Twelfth International Conference on Learning Representations (ICLR), 2024

work page 2024
[8]

X2-DFD: A framework for explainable and extendable deepfake detection.arXiv preprint arXiv:2410.06126, 2024

Yize Chen, Zhiyuan Yan, Siwei Lyu, and Baoyuan Wu. X2-DFD: A framework for explainable and extendable deepfake detection.arXiv preprint arXiv:2410.06126, 2024

work page arXiv 2024
[9]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.https://lmsys. org/blog/2023-03-30-vicuna/, March 2023

work page 2023
[10]

Deep reinforcement learning from human preferences

Paul F Christiano, Jan Leike, Tom B Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. InAdvances in Neural Information Processing Systems, volume 30, pp. 4299–4307, 2017

work page 2017
[11]

Process Reinforcement through Implicit Rewards

Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, and Ning Ding. PROCESS REINFORCEMENT THROUGH IMPLICIT REW ARDS.arXiv preprint arXiv:2502.01456, 2025. ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations (ICLR), 2021

work page 2021
[13]

Gemma 3 technical report, 2025

Gemma Team et al. Gemma 3 technical report, 2025

work page 2025
[14]

A hitchhiker’s guide to fine- grained face forgery detection using common sense reasoning

Niki M Foteinopoulou, Enjie Ghorbel, and Djamila Aouada. A hitchhiker’s guide to fine- grained face forgery detection using common sense reasoning. InAdvances in Neural Infor- mation Processing Systems, volume 37, pp. 2943–2976, 2025

work page 2025
[15]

Leveraging frequency analysis for deep fake image recognition

Joel Frank, Thorsten Eisenhofer, Lea Sch¨onherr, Asja Fischer, Dorothea Kolossa, and Thorsten Holz. Leveraging frequency analysis for deep fake image recognition. InProceedings of the 37th International Conference on Machine Learning, pp. 3247–3258. PMLR, November 2020

work page 2020
[16]

Gemini: A family of highly capable multimodal models, 2023

Gemini Team. Gemini: A family of highly capable multimodal models, 2023. 10 Preprint

work page 2023
[17]

Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks, June 2014

work page 2014
[18]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

work page 2025
[19]

Ffaa: Multimodal large language model based explainable open-world face forgery analysis assistant.arXiv preprint arXiv:2408.10072, 2024

Haomian He, Xuelin Zhao, Yuhang Gao, Zhengchao Huang, and Bin Xia. Ffaa: Multimodal large language model based explainable open-world face forgery analysis assistant.arXiv preprint arXiv:2408.10072, 2024

work page arXiv 2024
[20]

Classifier-free diffusion guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. InNeurIPS 2021 Workshop on Deep Generative Models and Causal Reasoning, 2022

work page 2021
[21]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, volume 33, pp. 6840–6851, 2020

work page 2020
[22]

Video diffusion models.arXiv preprint arXiv:2204.03682, 2022

Jonathan Ho, Niki Kalchbrenner, Robert Weichwald, Andreas Weiskopf, Prafulla Dhariwal, Ajay Jain, Christian K ¨uttler, and Tim Salimans. Video diffusion models.arXiv preprint arXiv:2204.03682, 2022

work page arXiv 2022
[23]

SIDA: social media image deepfake detection, local- ization and explanation with large multimodal model

Zhenglin Huang, Jinwei Hu, Xiangtai Li, Yiwei He, Xingyu Zhao, Bei Peng, Baoyuan Wu, Xiaowei Huang, and Guangliang Cheng. SIDA: social media image deepfake detection, local- ization and explanation with large multimodal model. InIEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR) 2025, 2025

work page 2025
[24]

Image-to-image translation with conditional adversarial networks

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pp. 112–120, 2017

work page 2017
[25]

Focal frequency loss for image reconstruction and synthesis

Liming Jiang, Bo Dai, Wayne Wu, and Chen Change Loy. Focal frequency loss for image reconstruction and synthesis. In2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 13899–13909, Montreal, QC, Canada, October 2021. IEEE. ISBN 978-1-6654- 2812-5

work page 2021
[26]

A style-based generator architecture for generative adversarial networks, March 2019

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks, March 2019

work page 2019
[27]

Alias-Free Generative Adversarial Networks

Tero Karras, Miika Aittala, Samuli Laine, Erik H ¨ark¨onen, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Alias-Free Generative Adversarial Networks. InAdvances in Neural Informa- tion Processing Systems, volume 34, pp. 852–863, 2021

work page 2021
[28]

Lee, Ian P

Jan Kietzmann, Linda W. Lee, Ian P. McCarthy, and Tim C. Kietzmann. Deepfakes: Trick or treat?Business Horizons, 63(2):135–146, March 2020

work page 2020
[29]

Gwanhyeong Koo, Sunjae Yoon, Ji Woo Hong, and Chang D. Yoo. Flexiedit: Frequency-aware latent refinement for enhanced non-rigid editing, July 2024

work page 2024
[30]

Deepfakes: a new threat to face recognition? assess- ment and detection, December 2018

Pavel Korshunov and Sebastien Marcel. Deepfakes: a new threat to face recognition? assess- ment and detection, December 2018

work page 2018
[31]

Freqblender: Enhancing deepfake detection by blending frequency knowledge, May 2024

Hanzhe Li, Yuezun Li, Jiaran Zhou, Bin Li, and Junyu Dong. Freqblender: Enhancing deepfake detection by blending frequency knowledge, May 2024

work page 2024
[32]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InProceedings of the 39th International Conference on Machine Learning (ICML), pp. 12888–12900. PMLR, June 2022

work page 2022
[33]

BLIP-2: Bootstrapping language- image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language- image pre-training with frozen image encoders and large language models. InProceedings of the 40th International Conference on Machine Learning (ICML), volume 202 ofProceedings of Machine Learning Research, pp. 19730–19742. PMLR, 23–29 Jul 2023. 11 Preprint

work page 2023
[34]

Fakebench: Probing ex- plainable fake image detection via large multimodal models.arXiv preprint arXiv:2404.13306, 2024

Yixuan Li, Xuelin Liu, Xiaoyang Wang, Shiqi Wang, and Weisi Lin. Fakebench: Probing ex- plainable fake image detection via large multimodal models.arXiv preprint arXiv:2404.13306, 2024

work page arXiv 2024
[35]

Celeb-df: A large-scale challeng- ing dataset for deepfake forensics

Yuezun Li, Xin Yang, Pu Sun, Honggang Qi, and Siwei Lyu. Celeb-df: A large-scale challeng- ing dataset for deepfake forensics. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020

work page 2020
[36]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In Advances in Neural Information Processing Systems, volume 36, 2023

work page 2023
[37]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 286–295, 2024

work page 2024
[38]

A convnet for the 2020s

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11976–11986, 2022

work page 2022
[39]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations (ICLR), 2019. URLhttps://openreview. net/forum?id=Bkg6RiCqY7

work page 2019
[40]

Benchmarking human and model perception of ai-generated images

Zeyu Lu, Di Huang, Jingjing Qu, Chengyue Wu, and Wanli Ouyang. Benchmarking human and model perception of ai-generated images. InAdvances in Neural Information Processing Systems, volume 36, 2023

work page 2023
[41]

The llama 4 herd: The beginning of a new era of natively multimodal intelli- gence.https://ai.meta.com/blog/llama-4-multimodal-intelligence/, April 2025

Meta AI. The llama 4 herd: The beginning of a new era of natively multimodal intelli- gence.https://ai.meta.com/blog/llama-4-multimodal-intelligence/, April 2025

work page 2025
[42]

The creation and detection of deepfakes: A survey, September 2020

Yisroel Mirsky and Wenke Lee. The creation and detection of deepfakes: A survey, September 2020

work page 2020
[43]

Pixtral 12B

Mistral AI Team. Pixtral 12b.arXiv preprint arXiv:2410.07073, 2024. URLhttps:// arxiv.org/abs/2410.07073

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Thanh Thi Nguyen, Quoc Viet Hung Nguyen, Dung Tien Nguyen, Duc Thanh Nguyen, Thien Huynh-The, Saeid Nahavandi, Thanh Tam Nguyen, Quoc-Viet Pham, and Cuong M. Nguyen. Deep learning for deepfakes creation and detection: A survey, August 2022

work page 2022
[45]

Ai-synthesized faces are indistinguishable from real faces and more trustworthy.Proceedings of the National Academy of Sciences, 119(8): e2120481119, 2022

Sophie J Nightingale and Hany Farid. Ai-synthesized faces are indistinguishable from real faces and more trustworthy.Proceedings of the National Academy of Sciences, 119(8): e2120481119, 2022

work page 2022
[46]

Towards universal fake image detectors that gen- eralize across generative models

Utkarsh Ojha, Yuheng Li, and Yong Jae Lee. Towards universal fake image detectors that gen- eralize across generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 24480–24489, June 2023

work page 2023
[47]

Gpt-4 technical report, 2024

OpenAI. Gpt-4 technical report, 2024

work page 2024
[48]

Hello GPT-4o.https://openai.com/index/hello-gpt-4o/, May 2024

OpenAI. Hello GPT-4o.https://openai.com/index/hello-gpt-4o/, May 2024

work page 2024
[49]

Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke E

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke E. Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Francis Christiano, Jan Leike, and Ryan J. Lowe. Training language models to follow instructions wit...

work page 2022
[50]

Styleclip: Text-driven manipulation of stylegan imagery

Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. Styleclip: Text-driven manipulation of stylegan imagery. In2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2065–2074, Montreal, QC, Canada, October 2021. IEEE. ISBN 978-1-6654-2812-5. 12 Preprint

work page 2065
[51]

Sdxl: Improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. InThe Twelfth International Conference on Learning Representations (ICLR), 2024

work page 2024
[52]

Qwen2.5 Technical Report

Qwen Team. Qwen2.5 Technical Report.arXiv preprint arXiv:2412.15115, 2024. URL https://arxiv.org/abs/2412.15115

work page internal anchor Pith review Pith/arXiv arXiv 2024
[53]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InPro- ceedings of the 38th International Conference on Machine Learning (ICML), pp. 8748–8763. PM...

work page 2021
[54]

Manning, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model, 2023

work page 2023
[55]

Deep reinforcement learning- based image captioning with embedding reward

Zhou Ren, Xiaoyu Wang, Ning Zhang, Xutao Lv, and Li-Jia Li. Deep reinforcement learning- based image captioning with embedding reward. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1151–1159, 2017

work page 2017
[56]

Towards the detection of dif- fusion model deepfakes

Jonas Ricker, Simon Damm, Thorsten Holz, and Asja Fischer. Towards the detection of dif- fusion model deepfakes. InInternational Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP), pp. 446–457, 01 2024

work page 2024
[57]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10684–10695, June 2022

work page 2022
[58]

FaceForensics++: Learning to Detect Manipulated Facial Images

Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Niessner. FaceForensics++: Learning to Detect Manipulated Facial Images . In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1–11, Los Alami- tos, CA, USA, November 2019. IEEE Computer Society

work page 2019
[59]

Progressive distillation for fast sampling of diffusion models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. InInternational Conference on Learning Representations, 2022

work page 2022
[60]

Proximal policy optimization algorithms, 2017

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017

work page 2017
[61]

De-fake: Detection and attribution of fake images generated by text-to-image generation models

Zeyang Sha, Zheng Li, Ning Yu, and Yang Zhang. De-fake: Detection and attribution of fake images generated by text-to-image generation models. InProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security (CCS), pp. 3418–3432, New York, NY , USA, 2023. Association for Computing Machinery

work page 2023
[62]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y .K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. URLhttps://arxiv.org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[63]

Hybridflow: A flexible and efficient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. InPro- ceedings of the 20th European Conference on Computer Systems, EuroSys ’25, pp. 303–319, New York, NY , USA, 2025. Association for Computing Machinery. ISBN 9798400707742

work page 2025
[64]

Denoising diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations (ICLR), 2021

work page 2021
[65]

Score-based generative modeling through stochastic differential equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. 13 Preprint

work page 2021
[66]

Frequency-aware deepfake detection: Improving generalizability through frequency space learning, March 2024

Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. Frequency-aware deepfake detection: Improving generalizability through frequency space learning, March 2024

work page 2024
[67]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Tim- oth´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aur ´elien Ro- driguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[68]

Peiyi Wang, Lei Li, Zhihong Shao, R. X. Xu, Damai Dai, Yifei Li, Deli Chen, Y . Wu, and Zhifang Sui. MATH-SHEPHERD: Verify and reinforce LLMs step-by-step without human annotations. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pp. 9426–9439. Association for Computational Linguis- tics,...

work page 2024
[69]

Visionary-r1: Mitigating shortcuts in visual reasoning with reinforcement learning, 2025

Jiaer Xia, Yuhang Zang, Peng Gao, Yixuan Li, and Kaiyang Zhou. Visionary-r1: Mitigating shortcuts in visual reasoning with reinforcement learning, 2025

work page 2025
[70]

Fakeshield: Explainable image forgery detection and localization via multi-modal large language models

Zhipei Xu, Xuanyu Zhang, Runyi Li, Zecheng Tang, Qing Huang, and Jian Zhang. Fakeshield: Explainable image forgery detection and localization via multi-modal large language models. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[71]

Df40: Toward next-generation deepfake detection

Zhiyuan Yan, Taiping Yao, Shen Chen, Yandan Zhao, Xinghe Fu, Junwei Zhu, Donghao Luo, Chengjie Wang, Shouhong Ding, Yunsheng Wu, and Li Yuan. Df40: Toward next-generation deepfake detection. InAdvances in Neural Information Processing Systems, 2024

work page 2024
[72]

Generative adversarial networks for medical image synthesis: A review.Medical Image Analysis, 51:1–18, 2019

Xun Yi, Esther Walia, and Mohammed Babar. Generative adversarial networks for medical image synthesis: A review.Medical Image Analysis, 51:1–18, 2019

work page 2019
[73]

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D. Goodman. Star: Bootstrapping reasoning with reasoning. InAdvances in Neural Information Processing Systems, volume 35, pp. 39114– 39129, 2022

work page 2022
[74]

Sc-captioner: Improving image captioning with self-correction by reinforcement learning, 2025

Lin Zhang, Xianfang Zeng, Kangcong Li, Gang Yu, and Tao Chen. Sc-captioner: Improving image captioning with self-correction by reinforcement learning, 2025

work page 2025
[75]

Improve vision language model chain-of-thought reasoning

Ruohong Zhang, Bowen Zhang, Yanghao Li, Haotian Zhang, Zhiqing Sun, Zhe Gan, Yinfei Yang, Ruoming Pang, and Yiming Yang. Improve vision language model chain-of-thought reasoning. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), 2025

work page 2025
[76]

Common sense rea- soning for deepfake detection

Yue Zhang, Ben Colman, Xiao Guo, Ali Shahriyari, and Gaurav Bharaj. Common sense rea- soning for deepfake detection. InEuropean Conference on Computer Vision (ECCV), 2024

work page 2024
[77]

Learning to reason without external rewards, 2025

Xuandong Zhao, Zhewei Kang, Aosong Feng, Sergey Levine, and Dawn Song. Learning to reason without external rewards, 2025

work page 2025
[78]

Ttrl: Test-time reinforcement learning, 2025

Yuxin Zuo, Kaiyan Zhang, Li Sheng, Shang Qu, Ganqu Cui, Xuekai Zhu, Haozhan Li, Yuchen Zhang, Xinwei Long, Ermo Hua, Biqing Qi, Youbang Sun, Zhiyuan Ma, Lifan Yuan, Ning Ding, and Bowen Zhou. Ttrl: Test-time reinforcement learning, 2025. A APPENDIX A.1 THEUSE OFLARGELANGUAGEMODELS(LLMS) In this work, we acknowledge the use of ChatGPT (GPT-5) for writing a...

work page 2025
[79]

Combine all{K}x4={4 *K}features across these models into a single unified list

work page
[80]

Eliminate duplicate or overlapping features to ensure clarity and uniqueness

work page

Showing first 80 references.