What Is Preference Optimization Doing, and Why?
Pith reviewed 2026-05-21 18:33 UTC · model grok-4.3
The pith
DPO follows stable targets while PPO balances exploration and exploitation during preference optimization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Examining the target directions of gradient-based updates reveals that DPO follows stable targets, whereas PPO balances exploration and exploitation. In DPO, positive and negative learning jointly shape the targets while loss reweighting acts more as a regularizer to mitigate overfitting. In PPO, negative learning primarily supports exploration and loss reweighting indicates distinct roles of token groups in updating targets. Carefully designed ablation studies examine how controlling these dynamics impacts optimization efficiency and practical performance.
What carries the argument
Gradient-based update target directions and the differentiated roles of positive learning, negative learning, and loss reweighting in shaping optimization behavior.
If this is right
- These dynamics explain the distinct behaviors of DPO and PPO in practice.
- Ablation studies demonstrate that adjusting these components affects how efficiently models optimize and perform on alignment tasks.
- The findings provide a basis for developing improved preference optimization techniques for LLMs.
Where Pith is reading between the lines
- Designers of new alignment algorithms could selectively incorporate stable targeting from DPO with targeted exploration from PPO.
- The component analysis might apply to other preference optimization variants not examined here.
- Empirical tests could measure how these roles manifest in training trajectories of real models.
Load-bearing premise
Differences in gradient update targets and the specific roles played by positive and negative learning plus loss reweighting are the main drivers of DPO versus PPO behaviors, isolated cleanly by the ablation studies without major confounding from other factors.
What would settle it
Observing that altering loss reweighting in DPO does not affect overfitting rates as expected, or that PPO negative learning does not enhance exploration in isolation, would indicate the roles are not as described.
Figures
read the original abstract
Preference optimization (PO) is indispensable for large language models (LLMs), with methods such as direct preference optimization (DPO) and proximal policy optimization (PPO) achieving great success. A common belief is that DPO is supervised learning while PPO is reinforcement learning, yet deeper analyses for the reasons underlying these differences remain lacking. To fill this gap, we analyze their optimization dynamics, revealing distinct algorithmic behaviors and comprehending their underlying causes. First, we examine the target directions of gradient-based updates and find that DPO follows stable targets, whereas PPO balances exploration and exploitation, validating the common belief yet from this new perspective. Second, we examine the roles of positive learning, negative learning, and loss reweighting, which are three key yet seldom discussed components within PO methods. Our analyses reveal that these components play fairly different roles. In DPO, positive and negative learning jointly shape the targets. However, loss reweighting in DPO acts less as a reward signal but more as a regularizer to mitigate overfitting. In PPO, negative learning primarily supports exploration rather than determining the targets. Meanwhile, loss reweighting, related to the absolute advantages, indicates the distinct roles of token groups in updating targets. Given these findings, we conduct carefully designed ablation studies to further examine how controlling these dynamics impacts optimization efficiency and practical performance. The insights gained from our analyses not only deepen the understanding of PO methods but also inspire the development of more preference-aligned LLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper analyzes the optimization dynamics of preference optimization methods, focusing on DPO and PPO for LLMs. It claims that examining gradient update targets reveals DPO follows stable targets while PPO balances exploration and exploitation. It further dissects the roles of positive learning, negative learning, and loss reweighting, arguing these components have distinct functions in each method (joint target shaping and regularization in DPO; exploration support and token-group indication in PPO). These insights are validated through carefully designed ablation studies examining impacts on optimization efficiency and practical performance.
Significance. If the gradient-direction analyses and ablation results hold under controlled conditions, the work offers a mechanistic explanation for behavioral differences between DPO and PPO beyond the standard supervised-vs-reinforcement-learning framing. The explicit component-role breakdown and ablation validation provide concrete, testable distinctions that could guide hybrid or improved preference optimization algorithms. The direct inspection of update rules and the ablation studies constitute a strength in grounding the claims.
major comments (2)
- [§5] §5 (Ablation studies): The paper states that ablations are 'carefully designed' to examine how controlling the dynamics of positive/negative learning and loss reweighting impacts performance, yet does not report explicit matching of effective learning rates, sampling distributions, or KL coefficients across DPO and PPO variants. This leaves open the possibility that observed differences arise from unmatched implementation details rather than the claimed distinct roles of the components.
- [§3] §3 (Gradient target analysis): The distinction that DPO follows stable targets while PPO balances exploration/exploitation rests on inspection of update directions; however, the analysis does not quantify sensitivity to reference-policy strength or batch-size variations, which could alter the stability claim when the reference model is not fixed.
minor comments (2)
- Notation for loss reweighting terms is introduced without a consolidated table of symbols, making it difficult to track how absolute advantages map to token groups across equations.
- Figure captions for ablation results could more explicitly state which hyperparameters were held constant versus varied in each condition.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications and indicate the revisions we will make to improve the manuscript.
read point-by-point responses
-
Referee: [§5] §5 (Ablation studies): The paper states that ablations are 'carefully designed' to examine how controlling the dynamics of positive/negative learning and loss reweighting impacts performance, yet does not report explicit matching of effective learning rates, sampling distributions, or KL coefficients across DPO and PPO variants. This leaves open the possibility that observed differences arise from unmatched implementation details rather than the claimed distinct roles of the components.
Authors: We acknowledge the validity of this point. The ablations were performed by varying only the targeted components (positive/negative learning and reweighting) while holding other settings to the standard values reported in the original DPO and PPO implementations. However, we did not provide an explicit cross-method matching of effective learning rates, sampling distributions, or KL coefficients, as the methods differ in their core formulations. To strengthen the presentation, we will revise §5 and add an appendix with a comprehensive hyperparameter table for all variants, along with a discussion of why the observed distinctions align with the component roles rather than implementation mismatches. We will also include a limited set of additional runs with adjusted learning rates to confirm the robustness of the findings. revision: yes
-
Referee: [§3] §3 (Gradient target analysis): The distinction that DPO follows stable targets while PPO balances exploration/exploitation rests on inspection of update directions; however, the analysis does not quantify sensitivity to reference-policy strength or batch-size variations, which could alter the stability claim when the reference model is not fixed.
Authors: The gradient analysis in §3 is derived from the closed-form expressions of the update targets, which mathematically establish DPO's fixed target direction (dependent on the preference ratio and fixed reference policy) versus PPO's dynamic dependence on the current policy. We agree that explicit quantification of sensitivity to reference-policy strength and batch-size variations would strengthen the stability claim. We will add a dedicated paragraph in §3 discussing the theoretical dependence on the reference model coefficient and include a small-scale empirical study varying this coefficient and batch size to show that the core distinction in target stability persists. revision: partial
Circularity Check
No significant circularity; analysis rests on direct inspection of objectives and ablations
full rationale
The paper derives its claims by inspecting the gradient update targets and component roles directly from the standard DPO and PPO loss formulations and objectives. It then validates interpretations via ablation studies that control the identified dynamics. No step reduces a claimed result or prediction to a fitted parameter from the same data, a self-definitional loop, or a load-bearing self-citation whose content is itself unverified. The central mechanistic distinctions are obtained from explicit decomposition of the update rules rather than by construction from the inputs being explained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Gradient directions computed from the loss functions accurately reflect the optimization targets and component contributions in DPO and PPO.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We examine the target directions of gradient-based updates and find that DPO follows stable targets, whereas PPO balances exploration and exploitation
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
positive and negative learning jointly shape the targets... loss reweighting acts more as a regularizer
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction. MIT press, 1998
work page 1998
-
[2]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[5]
Direct preference optimization: Your language model is secretly a reward model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InNeurIPS, 2024
work page 2024
-
[6]
Towards effective evaluations and comparison for llm unlearning methods
Qizhou Wang, Bo Han, Puning Yang, Jianing Zhu, Tongliang Liu, and Masashi Sugiyama. Towards effective evaluations and comparison for llm unlearning methods. InICLR, 2025
work page 2025
-
[7]
Runtian Zhai, Chen Dan, Zico Kolter, and Pradeep Ravikumar. Understanding why generalized reweighting does not improve over erm.arXiv preprint arXiv:2201.12293, 2022
-
[8]
Spurious Rewards: Rethinking Training Signals in RLVR
Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng, Yiping Wang, Sewoong Oh, Simon Shaolei Du, Nathan Lambert, Sewon Min, Ranjay Krishna, et al. Spurious rewards: Rethinking training signals in rlvr.arXiv preprint arXiv:2506.10947, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar.Foundations of machine learning. MIT press, 2018
work page 2018
-
[10]
Understanding black-box predictions via influence functions
Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. InICML, 2017
work page 2017
-
[11]
Learning dynamics of llm finetuning.arXiv preprint arXiv:2407.10490, 2024
Yi Ren and Danica J Sutherland. Learning dynamics of llm finetuning.arXiv preprint arXiv:2407.10490, 2024
-
[12]
Towards a theoretical framework of out-of-distribution generalization.NeurIPS, 2021
Haotian Ye, Chuanlong Xie, Tianle Cai, Ruichen Li, Zhenguo Li, and Liwei Wang. Towards a theoretical framework of out-of-distribution generalization.NeurIPS, 2021
work page 2021
-
[13]
Pythia: A suite for analyzing large language models across training and scaling
Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. InICML, 2023
work page 2023
-
[14]
Enhancing Chat Language Models by Scaling High-quality Instructional Conversations
Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations.arXiv preprint arXiv:2305.14233, 2023. 10 What Is Preference Optimization Doing, How and Why?
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
Ultrafeedback: Boosting language models with high-quality feedback
Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. Ultrafeedback: Boosting language models with high-quality feedback. InICML, 2024
work page 2024
-
[16]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[17]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024
work page 2024
-
[19]
Reinforcement learning by reward-weighted regression for operational space control
Jan Peters and Stefan Schaal. Reinforcement learning by reward-weighted regression for operational space control. InICML, 2007
work page 2007
-
[20]
Mm algorithms for generalized bradley-terry models.The annals of statistics, 32(1):384–406, 2004
David R Hunter. Mm algorithms for generalized bradley-terry models.The annals of statistics, 32(1):384–406, 2004
work page 2004
-
[21]
Yi Ren and Danica J. Sutherland. Learning dynamics of LLM finetuning. InICLR, 2025
work page 2025
-
[22]
M. Sugiyama and M. Kawanabe.Machine Learning in Non-Stationary Environments: Introduction to Covariate Shift Adaptation. MIT Press, 2012
work page 2012
-
[23]
T. Lodkaew, T. Fang, T. Ishida, and M. Sugiyama. Importance weighting for aligning language models under deployment distribution shift.Transactions on Machine Learning Research, page 25 pages, 2025
work page 2025
-
[24]
Simple statistical gradient-following algorithms for connectionist reinforcement learning
Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3):229–256, 1992
work page 1992
-
[25]
Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning
Ruiqi Zhang, Licong Lin, Yu Bai, and Song Mei. Negative preference optimization: From catastrophic collapse to effective unlearning.arXiv preprint arXiv:2404.05868, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
Xinyu Zhu, Mengzhou Xia, Zhepei Wei, Wei-Lin Chen, Danqi Chen, and Yu Meng. The surprising effectiveness of negative reinforcement in llm reasoning.arXiv preprint arXiv:2506.01347, 2025
-
[27]
Rethinking llm unlearning objectives: A gradient perspective and go beyond
Qizhou Wang, Jin Peng Zhou, Zhanke Zhou, Saebyeol Shin, Bo Han, and Kilian Q Weinberger. Rethinking llm unlearning objectives: A gradient perspective and go beyond. InICLR, 2025
work page 2025
-
[28]
Tongliang Liu and Dacheng Tao. Classification with noisy labels by importance reweighting.IEEE Transactions on pattern analysis and machine intelligence, 38(3):447–461, 2015
work page 2015
-
[29]
Learning to reweight examples for robust deep learning
Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. Learning to reweight examples for robust deep learning. InICML, 2018
work page 2018
-
[30]
Adam: A Method for Stochastic Optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[31]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[32]
Simpo: Simple preference optimization with a reference-free reward
Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward. InNeurIPS, 2024
work page 2024
-
[33]
Amir Saeidi, Shivanshu Verma, Md Nayem Uddin, and Chitta Baral. Insights into alignment: Evaluating dpo and its variants across multiple tasks.arXiv preprint arXiv:2404.14723, 2024
-
[34]
KTO: Model Alignment as Prospect Theoretic Optimization
Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
Haoran Xu, Amr Sharaf, Yunmo Chen, Weiting Tan, Lingfeng Shen, Benjamin Van Durme, Kenton Murray, and Young Jin Kim. Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation.arXiv preprint arXiv:2401.08417, 2024. 11 What Is Preference Optimization Doing, How and Why?
-
[36]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
Reward design with language models.arXiv preprint arXiv:2303.00001, 2023
Minae Kwon, Sang Michael Xie, Kalesha Bullard, and Dorsa Sadigh. Reward design with language models.arXiv preprint arXiv:2303.00001, 2023
-
[39]
Estimating training data influence by tracing gradient descent.NeurIPS, 2020
Garima Pruthi, Frederick Liu, Satyen Kale, and Mukund Sundararajan. Estimating training data influence by tracing gradient descent.NeurIPS, 2020
work page 2020
-
[40]
Unrolling sgd: Understanding factors influencing machine unlearning
Anvith Thudi, Gabriel Deza, Varun Chandrasekaran, and Nicolas Papernot. Unrolling sgd: Understanding factors influencing machine unlearning. InEuroS&P, 2022
work page 2022
-
[41]
Zephyr: Direct Distillation of LM Alignment
Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Clémentine Fourrier, Nathan Habib, et al. Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[42]
Interpretable preferences via multi- objective reward modeling and mixture-of-experts
Haoxiang Wang, Wei Xiong, Tengyang Xie, Han Zhao, and Tong Zhang. Interpretable preferences via multi- objective reward modeling and mixture-of-experts. InEMNLP, 2024
work page 2024
-
[43]
Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs
Chris Yuhao Liu, Liang Zeng, Jiacai Liu, Rui Yan, Jujie He, Chaojie Wang, Shuicheng Yan, Yang Liu, and Yahui Zhou. Skywork-reward: Bag of tricks for reward modeling in llms.arXiv preprint arXiv:2410.18451, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[44]
Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators
Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475, 2024. 12 What Is Preference Optimization Doing, How and Why? A Limitations and Further Discussions Here, we acknowledge our limitations. First, Adam [30] or its variants are commo...
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.