PRPO: Paragraph-level Policy Optimization for Vision-Language Deepfake Detection
Pith reviewed 2026-05-18 12:30 UTC · model grok-4.3
The pith
Paragraph-level policy optimization aligns LLM reasoning with visual evidence to improve deepfake detection.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that optimizing reinforcement learning policies at the paragraph level, using reward signals derived from alignment with visual evidence, produces models whose reasoning stays grounded in the actual image content and therefore detects deepfakes more reliably than earlier approaches.
What carries the argument
Paragraph-level Relative Policy Optimization (PRPO) is the reinforcement learning algorithm that supplies paragraph-level reward signals to align multimodal LLM outputs with image content.
If this is right
- Deepfake detection accuracy rises when reasoning is optimized at the paragraph level with visual rewards.
- The method outperforms GRPO under test-time conditions.
- Explanations become more consistent with image content and less prone to misalignment.
- Multimodal models gain the capacity to produce reliable, grounded outputs for safety-critical verification tasks.
Where Pith is reading between the lines
- The same paragraph-level reward structure could be tested on other vision-language tasks that require explanations to match visual details.
- Finer-grained alignment during RL training may offer a route to lower hallucination rates in multimodal models more broadly.
- Applying the technique to real-time streams of social-media images would reveal whether the gains hold outside laboratory datasets.
Load-bearing premise
Paragraph-level reward signals derived from visual evidence will produce stable generalization to unseen deepfakes without the annotations introducing dataset-specific biases or the RL process amplifying subtle misalignments.
What would settle it
Evaluating the trained model on a fresh deepfake dataset that uses generation methods absent from the training distribution and checking whether detection accuracy and explanation alignment both remain high or drop sharply.
Figures
read the original abstract
The rapid rise of synthetic media has made deepfake detection a critical challenge for online safety and trust. Progress remains constrained by the scarcity of large, high-quality datasets. Although multimodal large language models (LLMs) exhibit strong reasoning capabilities, their performance on deepfake detection is poor, often producing explanations that are misaligned with visual evidence or hallucinatory. To address this limitation, we introduce a reasoning-annotated dataset for deepfake detection and propose Paragraph-level Relative Policy Optimization (PRPO), a reinforcement learning algorithm that aligns LLM reasoning with image content at the paragraph level. Experiments show that PRPO improves detection accuracy by a wide margin and achieves the highest reasoning score of 4.55/5.0. Ablation studies further demonstrate that PRPO significantly outperforms GRPO under test-time conditions. These results underscore the importance of grounding multimodal reasoning in visual evidence to enable more reliable and interpretable deepfake detection.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a reasoning-annotated dataset for deepfake detection and proposes Paragraph-level Relative Policy Optimization (PRPO), a reinforcement learning algorithm that aligns multimodal LLM reasoning with image content at the paragraph level. It claims that PRPO yields substantial gains in detection accuracy, achieves the highest reasoning score of 4.55/5.0, and outperforms GRPO in ablation studies under test-time conditions.
Significance. If the empirical claims hold after proper validation, the work could advance interpretable deepfake detection by grounding LLM outputs in visual evidence, addressing common issues of misalignment and hallucination in vision-language models for safety-critical applications.
major comments (3)
- [Abstract] Abstract: the central claims of 'wide margin' accuracy improvement and a 4.55/5.0 reasoning score are presented without any baseline numbers, dataset sizes, statistical tests, or implementation details, preventing verification of the reported gains.
- [Ablation studies] Ablation studies: the assertion that PRPO significantly outperforms GRPO under test-time conditions lacks reported variance, cross-dataset hold-out results, or analysis of reward variance across paragraph boundaries, leaving generalization unsupported.
- [Method] Method: paragraph-level reward signals are derived from visual evidence and reasoning annotations, yet no quantitative check on annotation fidelity or potential dataset-specific biases is provided, raising the risk that the RL objective amplifies rather than corrects subtle misalignments.
minor comments (2)
- [Method] Notation for the PRPO objective and the paragraph-level reward function should be defined more explicitly with equations to improve reproducibility.
- [Experiments] Figure captions and table headers would benefit from clearer labeling of metrics (e.g., accuracy vs. reasoning score) and experimental conditions.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We address each major comment point by point below, indicating planned revisions to improve clarity, verifiability, and rigor where appropriate.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claims of 'wide margin' accuracy improvement and a 4.55/5.0 reasoning score are presented without any baseline numbers, dataset sizes, statistical tests, or implementation details, preventing verification of the reported gains.
Authors: We agree that the abstract would benefit from greater specificity to support immediate verification. The full manuscript provides baseline comparisons, the scale of the reasoning-annotated dataset, and details on the evaluation protocol including repeated runs in Sections 4 and 5. We will revise the abstract to incorporate key quantitative highlights and references to the experimental setup while preserving brevity. revision: yes
-
Referee: [Ablation studies] Ablation studies: the assertion that PRPO significantly outperforms GRPO under test-time conditions lacks reported variance, cross-dataset hold-out results, or analysis of reward variance across paragraph boundaries, leaving generalization unsupported.
Authors: We acknowledge that additional statistical and generalization details would strengthen the ablation claims. The current results report mean improvements; we will add standard deviations from multiple runs, cross-dataset hold-out evaluations, and an analysis of reward variance across paragraph boundaries in the revised ablation studies section. revision: yes
-
Referee: [Method] Method: paragraph-level reward signals are derived from visual evidence and reasoning annotations, yet no quantitative check on annotation fidelity or potential dataset-specific biases is provided, raising the risk that the RL objective amplifies rather than corrects subtle misalignments.
Authors: This concern about annotation quality and potential biases is well-taken. The manuscript describes the derivation of paragraph-level rewards from visual evidence and annotations in Section 3. To directly address fidelity and bias risks, we will add quantitative checks such as inter-annotator agreement metrics and alignment verification in the revised method section. revision: yes
Circularity Check
No circularity: empirical RL training procedure with independent experimental validation
full rationale
The paper introduces a new reasoning-annotated dataset and applies a standard reinforcement learning algorithm (PRPO) to align multimodal LLM outputs with visual evidence for deepfake detection. All performance claims (accuracy gains, 4.55/5 reasoning score, outperformance vs GRPO) are presented as results of training and ablation experiments rather than as derivations or predictions that reduce to the method's own fitted quantities by construction. No equations, uniqueness theorems, or self-citations are used to justify the core improvement; the work is self-contained as an empirical procedure whose validity rests on held-out test performance rather than internal redefinition of inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Reinforcement learning with paragraph-level rewards can align model-generated reasoning to visual evidence without introducing new hallucinations.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
PRPO maximizes the log-probabilities of paragraphs weighted by their own relative advantage... L_PRPO(θ) = E... min(π_θ/π_old A, clip...)
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Visual Consistency Reward (VCR) ... R_VCR(p) = 1/2 [sim(s, CLIP(x)) + 1]
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
URLhttps://huggingface.co/laion/CLIP-convnext_large_d_ 320.laion2B-s29B-b131K-ft-soup
Clip convnext-large-d 320: Laion-2b s29b b131k fine-tuned (soup) – hugging face, Septem- ber 2023. URLhttps://huggingface.co/laion/CLIP-convnext_large_d_ 320.laion2B-s29B-b131K-ft-soup
work page 2023
-
[2]
InIEEE Symposium on Security and Privacy, SP 2024, San Francisco, CA, USA, May 19-23, 2024
Sifat Muhammad Abdullah, Aravind Cheruvu, Shravya Kanchi, Taejoong Chung, Peng Gao, Murtuza Jadliwala, and Bimal Viswanath. An analysis of recent advances in deepfake image detection in an evolving threat landscape. In2024 IEEE Symposium on Security and Privacy (SP), pp. 91–109, 2024. doi: 10.1109/SP54263.2024.00194
-
[3]
The claude 3 model family: Opus, sonnet, haiku.https://www.anthropic
Anthropic. The claude 3 model family: Opus, sonnet, haiku.https://www.anthropic. com/claude-3-model-card, 2024. Model announced on March 4, 2024
work page 2024
-
[4]
SiT: Self-supervised vision transformer
Sara Atito, Muhammad Awais, and Josef Kittler. SiT: Self-supervised vision transformer. arXiv preprint arXiv:2104.03602, 2021
-
[5]
Improving image generation with better captions, 2023
James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, Wesam Manassra, Prafulla Dhariwal, Casey Chu, Yunxin Jiao, and Aditya Ramesh. Improving image generation with better captions, 2023
work page 2023
-
[6]
Y AKE!: Keyword extraction from single documents using multiple local features
Ricardo Campos, V ´ıtor Mangaravite, Arian Pasquali, Al´ıpio Jorge, C´elia Nunes, and Adam Jatowt. Y AKE!: Keyword extraction from single documents using multiple local features. Information Sciences, 509:257–289, 2020
work page 2020
-
[7]
Pixart-$\alpha$: Fast training of diffusion transformer for photorealistic text-to-image synthesis
Junsong Chen, Jincheng YU, Chongjian GE, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-$\alpha$: Fast training of diffusion transformer for photorealistic text-to-image synthesis. InThe Twelfth International Conference on Learning Representations (ICLR), 2024
work page 2024
-
[8]
Yize Chen, Zhiyuan Yan, Siwei Lyu, and Baoyuan Wu. X2-DFD: A framework for explainable and extendable deepfake detection.arXiv preprint arXiv:2410.06126, 2024
-
[9]
Gonzalez, Ion Stoica, and Eric P
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.https://lmsys. org/blog/2023-03-30-vicuna/, March 2023
work page 2023
-
[10]
Deep reinforcement learning from human preferences
Paul F Christiano, Jan Leike, Tom B Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. InAdvances in Neural Information Processing Systems, volume 30, pp. 4299–4307, 2017
work page 2017
-
[11]
Process Reinforcement through Implicit Rewards
Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, and Ning Ding. PROCESS REINFORCEMENT THROUGH IMPLICIT REW ARDS.arXiv preprint arXiv:2502.01456, 2025. ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations (ICLR), 2021
work page 2021
- [13]
-
[14]
A hitchhiker’s guide to fine- grained face forgery detection using common sense reasoning
Niki M Foteinopoulou, Enjie Ghorbel, and Djamila Aouada. A hitchhiker’s guide to fine- grained face forgery detection using common sense reasoning. InAdvances in Neural Infor- mation Processing Systems, volume 37, pp. 2943–2976, 2025
work page 2025
-
[15]
Leveraging frequency analysis for deep fake image recognition
Joel Frank, Thorsten Eisenhofer, Lea Sch¨onherr, Asja Fischer, Dorothea Kolossa, and Thorsten Holz. Leveraging frequency analysis for deep fake image recognition. InProceedings of the 37th International Conference on Machine Learning, pp. 3247–3258. PMLR, November 2020
work page 2020
-
[16]
Gemini: A family of highly capable multimodal models, 2023
Gemini Team. Gemini: A family of highly capable multimodal models, 2023. 10 Preprint
work page 2023
-
[17]
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks, June 2014
work page 2014
-
[18]
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025
work page 2025
-
[19]
Haomian He, Xuelin Zhao, Yuhang Gao, Zhengchao Huang, and Bin Xia. Ffaa: Multimodal large language model based explainable open-world face forgery analysis assistant.arXiv preprint arXiv:2408.10072, 2024
-
[20]
Classifier-free diffusion guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. InNeurIPS 2021 Workshop on Deep Generative Models and Causal Reasoning, 2022
work page 2021
-
[21]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, volume 33, pp. 6840–6851, 2020
work page 2020
-
[22]
Video diffusion models.arXiv preprint arXiv:2204.03682, 2022
Jonathan Ho, Niki Kalchbrenner, Robert Weichwald, Andreas Weiskopf, Prafulla Dhariwal, Ajay Jain, Christian K ¨uttler, and Tim Salimans. Video diffusion models.arXiv preprint arXiv:2204.03682, 2022
-
[23]
Zhenglin Huang, Jinwei Hu, Xiangtai Li, Yiwei He, Xingyu Zhao, Bei Peng, Baoyuan Wu, Xiaowei Huang, and Guangliang Cheng. SIDA: social media image deepfake detection, local- ization and explanation with large multimodal model. InIEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR) 2025, 2025
work page 2025
-
[24]
Image-to-image translation with conditional adversarial networks
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pp. 112–120, 2017
work page 2017
-
[25]
Focal frequency loss for image reconstruction and synthesis
Liming Jiang, Bo Dai, Wayne Wu, and Chen Change Loy. Focal frequency loss for image reconstruction and synthesis. In2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 13899–13909, Montreal, QC, Canada, October 2021. IEEE. ISBN 978-1-6654- 2812-5
work page 2021
-
[26]
A style-based generator architecture for generative adversarial networks, March 2019
Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks, March 2019
work page 2019
-
[27]
Alias-Free Generative Adversarial Networks
Tero Karras, Miika Aittala, Samuli Laine, Erik H ¨ark¨onen, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Alias-Free Generative Adversarial Networks. InAdvances in Neural Informa- tion Processing Systems, volume 34, pp. 852–863, 2021
work page 2021
-
[28]
Jan Kietzmann, Linda W. Lee, Ian P. McCarthy, and Tim C. Kietzmann. Deepfakes: Trick or treat?Business Horizons, 63(2):135–146, March 2020
work page 2020
-
[29]
Gwanhyeong Koo, Sunjae Yoon, Ji Woo Hong, and Chang D. Yoo. Flexiedit: Frequency-aware latent refinement for enhanced non-rigid editing, July 2024
work page 2024
-
[30]
Deepfakes: a new threat to face recognition? assess- ment and detection, December 2018
Pavel Korshunov and Sebastien Marcel. Deepfakes: a new threat to face recognition? assess- ment and detection, December 2018
work page 2018
-
[31]
Freqblender: Enhancing deepfake detection by blending frequency knowledge, May 2024
Hanzhe Li, Yuezun Li, Jiaran Zhou, Bin Li, and Junyu Dong. Freqblender: Enhancing deepfake detection by blending frequency knowledge, May 2024
work page 2024
-
[32]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InProceedings of the 39th International Conference on Machine Learning (ICML), pp. 12888–12900. PMLR, June 2022
work page 2022
-
[33]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language- image pre-training with frozen image encoders and large language models. InProceedings of the 40th International Conference on Machine Learning (ICML), volume 202 ofProceedings of Machine Learning Research, pp. 19730–19742. PMLR, 23–29 Jul 2023. 11 Preprint
work page 2023
-
[34]
Yixuan Li, Xuelin Liu, Xiaoyang Wang, Shiqi Wang, and Weisi Lin. Fakebench: Probing ex- plainable fake image detection via large multimodal models.arXiv preprint arXiv:2404.13306, 2024
-
[35]
Celeb-df: A large-scale challeng- ing dataset for deepfake forensics
Yuezun Li, Xin Yang, Pu Sun, Honggang Qi, and Siwei Lyu. Celeb-df: A large-scale challeng- ing dataset for deepfake forensics. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020
work page 2020
-
[36]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In Advances in Neural Information Processing Systems, volume 36, 2023
work page 2023
-
[37]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 286–295, 2024
work page 2024
-
[38]
Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11976–11986, 2022
work page 2022
-
[39]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations (ICLR), 2019. URLhttps://openreview. net/forum?id=Bkg6RiCqY7
work page 2019
-
[40]
Benchmarking human and model perception of ai-generated images
Zeyu Lu, Di Huang, Jingjing Qu, Chengyue Wu, and Wanli Ouyang. Benchmarking human and model perception of ai-generated images. InAdvances in Neural Information Processing Systems, volume 36, 2023
work page 2023
-
[41]
Meta AI. The llama 4 herd: The beginning of a new era of natively multimodal intelli- gence.https://ai.meta.com/blog/llama-4-multimodal-intelligence/, April 2025
work page 2025
-
[42]
The creation and detection of deepfakes: A survey, September 2020
Yisroel Mirsky and Wenke Lee. The creation and detection of deepfakes: A survey, September 2020
work page 2020
-
[43]
Mistral AI Team. Pixtral 12b.arXiv preprint arXiv:2410.07073, 2024. URLhttps:// arxiv.org/abs/2410.07073
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[44]
Thanh Thi Nguyen, Quoc Viet Hung Nguyen, Dung Tien Nguyen, Duc Thanh Nguyen, Thien Huynh-The, Saeid Nahavandi, Thanh Tam Nguyen, Quoc-Viet Pham, and Cuong M. Nguyen. Deep learning for deepfakes creation and detection: A survey, August 2022
work page 2022
-
[45]
Sophie J Nightingale and Hany Farid. Ai-synthesized faces are indistinguishable from real faces and more trustworthy.Proceedings of the National Academy of Sciences, 119(8): e2120481119, 2022
work page 2022
-
[46]
Towards universal fake image detectors that gen- eralize across generative models
Utkarsh Ojha, Yuheng Li, and Yong Jae Lee. Towards universal fake image detectors that gen- eralize across generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 24480–24489, June 2023
work page 2023
- [47]
-
[48]
Hello GPT-4o.https://openai.com/index/hello-gpt-4o/, May 2024
OpenAI. Hello GPT-4o.https://openai.com/index/hello-gpt-4o/, May 2024
work page 2024
-
[49]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke E. Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Francis Christiano, Jan Leike, and Ryan J. Lowe. Training language models to follow instructions wit...
work page 2022
-
[50]
Styleclip: Text-driven manipulation of stylegan imagery
Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. Styleclip: Text-driven manipulation of stylegan imagery. In2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2065–2074, Montreal, QC, Canada, October 2021. IEEE. ISBN 978-1-6654-2812-5. 12 Preprint
work page 2065
-
[51]
Sdxl: Improving latent diffusion models for high-resolution image synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. InThe Twelfth International Conference on Learning Representations (ICLR), 2024
work page 2024
-
[52]
Qwen Team. Qwen2.5 Technical Report.arXiv preprint arXiv:2412.15115, 2024. URL https://arxiv.org/abs/2412.15115
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[53]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InPro- ceedings of the 38th International Conference on Machine Learning (ICML), pp. 8748–8763. PM...
work page 2021
-
[54]
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model, 2023
work page 2023
-
[55]
Deep reinforcement learning- based image captioning with embedding reward
Zhou Ren, Xiaoyu Wang, Ning Zhang, Xutao Lv, and Li-Jia Li. Deep reinforcement learning- based image captioning with embedding reward. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1151–1159, 2017
work page 2017
-
[56]
Towards the detection of dif- fusion model deepfakes
Jonas Ricker, Simon Damm, Thorsten Holz, and Asja Fischer. Towards the detection of dif- fusion model deepfakes. InInternational Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP), pp. 446–457, 01 2024
work page 2024
-
[57]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10684–10695, June 2022
work page 2022
-
[58]
FaceForensics++: Learning to Detect Manipulated Facial Images
Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Niessner. FaceForensics++: Learning to Detect Manipulated Facial Images . In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1–11, Los Alami- tos, CA, USA, November 2019. IEEE Computer Society
work page 2019
-
[59]
Progressive distillation for fast sampling of diffusion models
Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. InInternational Conference on Learning Representations, 2022
work page 2022
-
[60]
Proximal policy optimization algorithms, 2017
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017
work page 2017
-
[61]
De-fake: Detection and attribution of fake images generated by text-to-image generation models
Zeyang Sha, Zheng Li, Ning Yu, and Yang Zhang. De-fake: Detection and attribution of fake images generated by text-to-image generation models. InProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security (CCS), pp. 3418–3432, New York, NY , USA, 2023. Association for Computing Machinery
work page 2023
-
[62]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y .K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. URLhttps://arxiv.org/abs/2402.03300
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[63]
Hybridflow: A flexible and efficient rlhf framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. InPro- ceedings of the 20th European Conference on Computer Systems, EuroSys ’25, pp. 303–319, New York, NY , USA, 2025. Association for Computing Machinery. ISBN 9798400707742
work page 2025
-
[64]
Denoising diffusion implicit models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations (ICLR), 2021
work page 2021
-
[65]
Score-based generative modeling through stochastic differential equations
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. 13 Preprint
work page 2021
-
[66]
Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. Frequency-aware deepfake detection: Improving generalizability through frequency space learning, March 2024
work page 2024
-
[67]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Tim- oth´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aur ´elien Ro- driguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[68]
Peiyi Wang, Lei Li, Zhihong Shao, R. X. Xu, Damai Dai, Yifei Li, Deli Chen, Y . Wu, and Zhifang Sui. MATH-SHEPHERD: Verify and reinforce LLMs step-by-step without human annotations. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pp. 9426–9439. Association for Computational Linguis- tics,...
work page 2024
-
[69]
Visionary-r1: Mitigating shortcuts in visual reasoning with reinforcement learning, 2025
Jiaer Xia, Yuhang Zang, Peng Gao, Yixuan Li, and Kaiyang Zhou. Visionary-r1: Mitigating shortcuts in visual reasoning with reinforcement learning, 2025
work page 2025
-
[70]
Zhipei Xu, Xuanyu Zhang, Runyi Li, Zecheng Tang, Qing Huang, and Jian Zhang. Fakeshield: Explainable image forgery detection and localization via multi-modal large language models. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[71]
Df40: Toward next-generation deepfake detection
Zhiyuan Yan, Taiping Yao, Shen Chen, Yandan Zhao, Xinghe Fu, Junwei Zhu, Donghao Luo, Chengjie Wang, Shouhong Ding, Yunsheng Wu, and Li Yuan. Df40: Toward next-generation deepfake detection. InAdvances in Neural Information Processing Systems, 2024
work page 2024
-
[72]
Xun Yi, Esther Walia, and Mohammed Babar. Generative adversarial networks for medical image synthesis: A review.Medical Image Analysis, 51:1–18, 2019
work page 2019
-
[73]
Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D. Goodman. Star: Bootstrapping reasoning with reasoning. InAdvances in Neural Information Processing Systems, volume 35, pp. 39114– 39129, 2022
work page 2022
-
[74]
Sc-captioner: Improving image captioning with self-correction by reinforcement learning, 2025
Lin Zhang, Xianfang Zeng, Kangcong Li, Gang Yu, and Tao Chen. Sc-captioner: Improving image captioning with self-correction by reinforcement learning, 2025
work page 2025
-
[75]
Improve vision language model chain-of-thought reasoning
Ruohong Zhang, Bowen Zhang, Yanghao Li, Haotian Zhang, Zhiqing Sun, Zhe Gan, Yinfei Yang, Ruoming Pang, and Yiming Yang. Improve vision language model chain-of-thought reasoning. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), 2025
work page 2025
-
[76]
Common sense rea- soning for deepfake detection
Yue Zhang, Ben Colman, Xiao Guo, Ali Shahriyari, and Gaurav Bharaj. Common sense rea- soning for deepfake detection. InEuropean Conference on Computer Vision (ECCV), 2024
work page 2024
-
[77]
Learning to reason without external rewards, 2025
Xuandong Zhao, Zhewei Kang, Aosong Feng, Sergey Levine, and Dawn Song. Learning to reason without external rewards, 2025
work page 2025
-
[78]
Ttrl: Test-time reinforcement learning, 2025
Yuxin Zuo, Kaiyan Zhang, Li Sheng, Shang Qu, Ganqu Cui, Xuekai Zhu, Haozhan Li, Yuchen Zhang, Xinwei Long, Ermo Hua, Biqing Qi, Youbang Sun, Zhiyuan Ma, Lifan Yuan, Ning Ding, and Bowen Zhou. Ttrl: Test-time reinforcement learning, 2025. A APPENDIX A.1 THEUSE OFLARGELANGUAGEMODELS(LLMS) In this work, we acknowledge the use of ChatGPT (GPT-5) for writing a...
work page 2025
-
[79]
Combine all{K}x4={4 *K}features across these models into a single unified list
-
[80]
Eliminate duplicate or overlapping features to ensure clarity and uniqueness
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.