pith. sign in

arxiv: 2512.00778 · v2 · pith:CZF5FQS5new · submitted 2025-11-30 · 💻 cs.LG

What Is Preference Optimization Doing, and Why?

Pith reviewed 2026-05-21 18:33 UTC · model grok-4.3

classification 💻 cs.LG
keywords preference optimizationDPOPPOgradient updatespositive learningnegative learningloss reweightingLLM alignment
0
0 comments X

The pith

DPO follows stable targets while PPO balances exploration and exploitation during preference optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper analyzes the optimization dynamics of direct preference optimization (DPO) and proximal policy optimization (PPO) used to align large language models with preferences. It shows that DPO adheres to stable gradient update targets, in contrast to PPO which balances exploration and exploitation. The study breaks down the effects of positive learning, negative learning, and loss reweighting, finding that in DPO these first two jointly shape targets and reweighting regularizes against overfitting, whereas in PPO negative learning aids exploration and reweighting differentiates token groups. These distinctions clarify why the methods behave differently and suggest ways to enhance their use for better model alignment.

Core claim

Examining the target directions of gradient-based updates reveals that DPO follows stable targets, whereas PPO balances exploration and exploitation. In DPO, positive and negative learning jointly shape the targets while loss reweighting acts more as a regularizer to mitigate overfitting. In PPO, negative learning primarily supports exploration and loss reweighting indicates distinct roles of token groups in updating targets. Carefully designed ablation studies examine how controlling these dynamics impacts optimization efficiency and practical performance.

What carries the argument

Gradient-based update target directions and the differentiated roles of positive learning, negative learning, and loss reweighting in shaping optimization behavior.

If this is right

  • These dynamics explain the distinct behaviors of DPO and PPO in practice.
  • Ablation studies demonstrate that adjusting these components affects how efficiently models optimize and perform on alignment tasks.
  • The findings provide a basis for developing improved preference optimization techniques for LLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Designers of new alignment algorithms could selectively incorporate stable targeting from DPO with targeted exploration from PPO.
  • The component analysis might apply to other preference optimization variants not examined here.
  • Empirical tests could measure how these roles manifest in training trajectories of real models.

Load-bearing premise

Differences in gradient update targets and the specific roles played by positive and negative learning plus loss reweighting are the main drivers of DPO versus PPO behaviors, isolated cleanly by the ablation studies without major confounding from other factors.

What would settle it

Observing that altering loss reweighting in DPO does not affect overfitting rates as expected, or that PPO negative learning does not enhance exploration in isolation, would indicate the roles are not as described.

Figures

Figures reproduced from arXiv: 2512.00778 by Bo Han, Gang Niu, Masashi Sugiyama, Qizhou Wang, Yue Wang, Zizhuo Zhang.

Figure 1
Figure 1. Figure 1: DPO Learning Dynamics. For the Pythia-2.8B model trained on UltraFeedback and tested on HH-RLHF￾helpfulness, we show the dynamics of G measured per 1000 training steps: (a) the overall objective Ldpo (TOT); (b) the positive L + dpo (POS) and negative L − dpo (NEG) components; and (c) the weighted top L ↑ dpo (TOP), middle L→ dpo (MID), and bottom L ↓ dpo (BOT) components. The log scale is used for G due to… view at source ↗
Figure 2
Figure 2. Figure 2: PPO Learning Dynamics. For the Pythia-2.8B model trained on UltraFeedback and tested on HH-RLHF￾helpfulness, we show the dynamics of G measured per 400 training steps: (a) the overall objective Lppo (TOT); (b) the positive L + ppo (POS) and negative L − ppo (NEG) components; and (c) the weighted top L ↑ ppo (TOP), middle L→ ppo (MID), and bottom L ↓ ppo (BOT) components. The log scale is used for G to alig… view at source ↗
Figure 3
Figure 3. Figure 3: Average (Raw) Advantages during PPO for top (TOP), middle (MID), and bottom (BOT) weighted data. When it comes to loss reweighting, as shown in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Performance under Ablation. For the Pythia-2.8B model trained on UltraFeedback and tested on HH-RLHF￾helpfulness, we show performance measured by Win Rate under: (a) DPO ablations removing negative learning (w/o NEG) before 3000 steps and positive learning (w/o POS) after 6000 steps, (b) PPO ablations removing top (w/o TOP) and middle (w/o MID) weighted data, (c) cDPO that emphasizes positive learning earl… view at source ↗
Figure 5
Figure 5. Figure 5: Gradient Magnitudes. For the Pythia-2.8B model trained on UltraFeedback and tested on HH-RLHF￾helpfulness, we illustrate the distributions of gradient magnitudes computed with respect to mini-batches for DPO and PPO, across training steps. Normal data points are colored in blue, while outliers detected by IQR are colored in red. Moreover, to align with the common practice of gradient clipping, we excluded … view at source ↗
Figure 6
Figure 6. Figure 6: Gradient Dynamics With and Without Outlier Filtering. For the Pythia-2.8B model trained on UltraFeedback and tested on HH-RLHF-helpfulness, we present the learning dynamics of G for (a) DPO and (b) PPO without excluding batches with extreme gradient magnitudes, in contrast to the main results with outlier filtering shown in (c) and (d), as reported in the main text. 1000 2000 3000 4000 5000 6000 7000 train… view at source ↗
Figure 7
Figure 7. Figure 7: DPO Gradient Dynamics. For the Pythia-2.8B model trained on UltraFeedback and tested on HH-RLHF￾helpfulness, we report (a) gradient magnitudes and (b) gradient alignments for DPO. Here, D′ is built from the training dataset, so we focus on in-distribution rather than out-of-distribution responses as in [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: DPO Gradient Dynamics. For the Pythia-2.8B model trained on UltraFeedback and tested on HH-RLHF￾helpfulness, we compare DPO with and without SFT, reporting gradient magnitudes (a) with SFT, (b) without SFT, and (c) gradient dynamics without SFT. 1000 2000 3000 4000 5000 6000 7000 training step 10 4 10 3 10 2 10 1 0 10 1 10 2 10 3 10 4 gradient alignment TOP MID BOT (a) DPO 400 800 1200 1600 2000 2400 train… view at source ↗
Figure 9
Figure 9. Figure 9: Layer-wise Learning Dynamics. For the Pythia-2.8B model trained on UltraFeedback and tested on HH￾RLHF-helpfulness, we report the layer-wise dynamics for (a) DPO and (b) PPO. We show the corresponding results in [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qwen Learning Dynamics. For the Qwen3-1.7B model trained on UltraFeedback and tested on HH-RLHF￾helpfulness, we show the learning dynamics of G for DPO and PPO, covering the overall objectives ((a) for DPO and (d) for PPO), the positive and negative components ((b) for DPO and (e) for PPO), and each weighted component ((c) for DPO and (f) for PPO), respectively. Pythia-2.8B. As observed, Qwen3-1.7B exhibi… view at source ↗
Figure 11
Figure 11. Figure 11: Performance under Ablation. For the Qwen3-1.7B model trained on UltraFeedback and tested on HH￾RLHF-helpfulness, we show performance measured by Win Rate under: (a) DPO ablations removing negative learning (w/o NEG) and positive learning (w/o POS), (b) PPO ablations removing top (w/o TOP) and middle (w/o MID) weighted data, (c) cDPO that emphasizes positive learning early and negative learning later, wher… view at source ↗
Figure 12
Figure 12. Figure 12: DPO vs. cDPO. For the Pythia-2.8B model trained on UltraFeedback and tested on HH-RLHF-helpfulness, we compare DPO with cDPO across hyper-parameters. Finally, we summarize the results of cPPO, cDPO, and hPPO with standard deviations. For each method and its corresponding baseline, we used the best-performing hyperparameters and conducted five runs with different random seeds. The improvements are both not… view at source ↗
Figure 13
Figure 13. Figure 13: DPO vs. cDPO. For the Qwen3-1.7B model trained on UltraFeedback and tested on HH-RLHF-helpfulness, we compare DPO with cDPO across hyper-parameters [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: PPO vs. cPPO. For the Pythia-2.8B model trained on UltraFeedback and tested on HH-RLHF-helpfulness, we compare PPO with cPPO across hyper-parameters. Both variants of cPPO are considered: one controlling the top weighted data (TOP) and the other controlling the middle weighted data (MID). 23 [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: PPO vs. cPPO. For the Qwen3-1.7B model trained on UltraFeedback and tested on HH-RLHF-helpfulness, we compare PPO with cPPO across hyper-parameters. Both variants of cPPO are considered: one controlling the top weighted data (TOP) and the other controlling the middle weighted data (MID). 20 40 60 80 100 120 140 160 180 200 training step 50 52 54 56 win rate (%) PPO =0.3 (a) MID 20 40 60 80 100 120 140 160… view at source ↗
Figure 16
Figure 16. Figure 16: PPO vs. cPPO. For the Llama3-8B model trained on UltraFeedback and tested on HH-RLHF-helpfulness, we compare PPO with cPPO across hyper-parameters. Both variants of cPPO are considered: one controlling the top weighted data (TOP) and the other controlling the middle weighted data (MID). 24 [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: PPO vs. hPPO. For the Pythia-2.8B model trained on UltraFeedback and tested on HH-RLHF-helpfulness, we compare PPO with hPPO across hyper-parameters. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: PPO vs. hPPO. For the Qwen3-1.7B model trained on UltraFeedback and tested on HH-RLHF-helpfulness, we compare PPO with hPPO across hyper-parameters. 20 40 60 80 100 120 140 160 180 200 training step 50 52 54 56 win rate (%) PPO t3 = 10, = 0.90 (a) 20 40 60 80 100 120 140 160 180 200 training step 50 52 54 56 58 win rate (%) PPO t3 = 5, = 0.99 (b) 20 40 60 80 100 120 140 160 180 200 training step 50 52 54 … view at source ↗
Figure 19
Figure 19. Figure 19: PPO vs. hPPO. For the Llama3-8B model trained on UltraFeedback and tested on HH-RLHF-helpfulness, we compare PPO with hPPO across hyper-parameters. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_19.png] view at source ↗
read the original abstract

Preference optimization (PO) is indispensable for large language models (LLMs), with methods such as direct preference optimization (DPO) and proximal policy optimization (PPO) achieving great success. A common belief is that DPO is supervised learning while PPO is reinforcement learning, yet deeper analyses for the reasons underlying these differences remain lacking. To fill this gap, we analyze their optimization dynamics, revealing distinct algorithmic behaviors and comprehending their underlying causes. First, we examine the target directions of gradient-based updates and find that DPO follows stable targets, whereas PPO balances exploration and exploitation, validating the common belief yet from this new perspective. Second, we examine the roles of positive learning, negative learning, and loss reweighting, which are three key yet seldom discussed components within PO methods. Our analyses reveal that these components play fairly different roles. In DPO, positive and negative learning jointly shape the targets. However, loss reweighting in DPO acts less as a reward signal but more as a regularizer to mitigate overfitting. In PPO, negative learning primarily supports exploration rather than determining the targets. Meanwhile, loss reweighting, related to the absolute advantages, indicates the distinct roles of token groups in updating targets. Given these findings, we conduct carefully designed ablation studies to further examine how controlling these dynamics impacts optimization efficiency and practical performance. The insights gained from our analyses not only deepen the understanding of PO methods but also inspire the development of more preference-aligned LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper analyzes the optimization dynamics of preference optimization methods, focusing on DPO and PPO for LLMs. It claims that examining gradient update targets reveals DPO follows stable targets while PPO balances exploration and exploitation. It further dissects the roles of positive learning, negative learning, and loss reweighting, arguing these components have distinct functions in each method (joint target shaping and regularization in DPO; exploration support and token-group indication in PPO). These insights are validated through carefully designed ablation studies examining impacts on optimization efficiency and practical performance.

Significance. If the gradient-direction analyses and ablation results hold under controlled conditions, the work offers a mechanistic explanation for behavioral differences between DPO and PPO beyond the standard supervised-vs-reinforcement-learning framing. The explicit component-role breakdown and ablation validation provide concrete, testable distinctions that could guide hybrid or improved preference optimization algorithms. The direct inspection of update rules and the ablation studies constitute a strength in grounding the claims.

major comments (2)
  1. [§5] §5 (Ablation studies): The paper states that ablations are 'carefully designed' to examine how controlling the dynamics of positive/negative learning and loss reweighting impacts performance, yet does not report explicit matching of effective learning rates, sampling distributions, or KL coefficients across DPO and PPO variants. This leaves open the possibility that observed differences arise from unmatched implementation details rather than the claimed distinct roles of the components.
  2. [§3] §3 (Gradient target analysis): The distinction that DPO follows stable targets while PPO balances exploration/exploitation rests on inspection of update directions; however, the analysis does not quantify sensitivity to reference-policy strength or batch-size variations, which could alter the stability claim when the reference model is not fixed.
minor comments (2)
  1. Notation for loss reweighting terms is introduced without a consolidated table of symbols, making it difficult to track how absolute advantages map to token groups across equations.
  2. Figure captions for ablation results could more explicitly state which hyperparameters were held constant versus varied in each condition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications and indicate the revisions we will make to improve the manuscript.

read point-by-point responses
  1. Referee: [§5] §5 (Ablation studies): The paper states that ablations are 'carefully designed' to examine how controlling the dynamics of positive/negative learning and loss reweighting impacts performance, yet does not report explicit matching of effective learning rates, sampling distributions, or KL coefficients across DPO and PPO variants. This leaves open the possibility that observed differences arise from unmatched implementation details rather than the claimed distinct roles of the components.

    Authors: We acknowledge the validity of this point. The ablations were performed by varying only the targeted components (positive/negative learning and reweighting) while holding other settings to the standard values reported in the original DPO and PPO implementations. However, we did not provide an explicit cross-method matching of effective learning rates, sampling distributions, or KL coefficients, as the methods differ in their core formulations. To strengthen the presentation, we will revise §5 and add an appendix with a comprehensive hyperparameter table for all variants, along with a discussion of why the observed distinctions align with the component roles rather than implementation mismatches. We will also include a limited set of additional runs with adjusted learning rates to confirm the robustness of the findings. revision: yes

  2. Referee: [§3] §3 (Gradient target analysis): The distinction that DPO follows stable targets while PPO balances exploration/exploitation rests on inspection of update directions; however, the analysis does not quantify sensitivity to reference-policy strength or batch-size variations, which could alter the stability claim when the reference model is not fixed.

    Authors: The gradient analysis in §3 is derived from the closed-form expressions of the update targets, which mathematically establish DPO's fixed target direction (dependent on the preference ratio and fixed reference policy) versus PPO's dynamic dependence on the current policy. We agree that explicit quantification of sensitivity to reference-policy strength and batch-size variations would strengthen the stability claim. We will add a dedicated paragraph in §3 discussing the theoretical dependence on the reference model coefficient and include a small-scale empirical study varying this coefficient and batch size to show that the core distinction in target stability persists. revision: partial

Circularity Check

0 steps flagged

No significant circularity; analysis rests on direct inspection of objectives and ablations

full rationale

The paper derives its claims by inspecting the gradient update targets and component roles directly from the standard DPO and PPO loss formulations and objectives. It then validates interpretations via ablation studies that control the identified dynamics. No step reduces a claimed result or prediction to a fitted parameter from the same data, a self-definitional loop, or a load-bearing self-citation whose content is itself unverified. The central mechanistic distinctions are obtained from explicit decomposition of the update rules rather than by construction from the inputs being explained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard optimization and RL assumptions rather than introducing new fitted parameters or entities.

axioms (1)
  • domain assumption Gradient directions computed from the loss functions accurately reflect the optimization targets and component contributions in DPO and PPO.
    The central analyses begin by examining these gradient targets and roles.

pith-pipeline@v0.9.0 · 5804 in / 1284 out tokens · 38061 ms · 2026-05-21T18:33:22.627665+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 16 internal anchors

  1. [1]

    MIT press, 1998

    Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction. MIT press, 1998

  2. [2]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  3. [3]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

  4. [4]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  5. [5]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InNeurIPS, 2024

  6. [6]

    Towards effective evaluations and comparison for llm unlearning methods

    Qizhou Wang, Bo Han, Puning Yang, Jianing Zhu, Tongliang Liu, and Masashi Sugiyama. Towards effective evaluations and comparison for llm unlearning methods. InICLR, 2025

  7. [7]

    Understanding why generalized reweighting does not improve over erm.arXiv preprint arXiv:2201.12293, 2022

    Runtian Zhai, Chen Dan, Zico Kolter, and Pradeep Ravikumar. Understanding why generalized reweighting does not improve over erm.arXiv preprint arXiv:2201.12293, 2022

  8. [8]

    Spurious Rewards: Rethinking Training Signals in RLVR

    Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng, Yiping Wang, Sewoong Oh, Simon Shaolei Du, Nathan Lambert, Sewon Min, Ranjay Krishna, et al. Spurious rewards: Rethinking training signals in rlvr.arXiv preprint arXiv:2506.10947, 2025

  9. [9]

    MIT press, 2018

    Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar.Foundations of machine learning. MIT press, 2018

  10. [10]

    Understanding black-box predictions via influence functions

    Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. InICML, 2017

  11. [11]

    Learning dynamics of llm finetuning.arXiv preprint arXiv:2407.10490, 2024

    Yi Ren and Danica J Sutherland. Learning dynamics of llm finetuning.arXiv preprint arXiv:2407.10490, 2024

  12. [12]

    Towards a theoretical framework of out-of-distribution generalization.NeurIPS, 2021

    Haotian Ye, Chuanlong Xie, Tianle Cai, Ruichen Li, Zhenguo Li, and Liwei Wang. Towards a theoretical framework of out-of-distribution generalization.NeurIPS, 2021

  13. [13]

    Pythia: A suite for analyzing large language models across training and scaling

    Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. InICML, 2023

  14. [14]

    Enhancing Chat Language Models by Scaling High-quality Instructional Conversations

    Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations.arXiv preprint arXiv:2305.14233, 2023. 10 What Is Preference Optimization Doing, How and Why?

  15. [15]

    Ultrafeedback: Boosting language models with high-quality feedback

    Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. Ultrafeedback: Boosting language models with high-quality feedback. InICML, 2024

  16. [16]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

  17. [17]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  18. [18]

    The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

  19. [19]

    Reinforcement learning by reward-weighted regression for operational space control

    Jan Peters and Stefan Schaal. Reinforcement learning by reward-weighted regression for operational space control. InICML, 2007

  20. [20]

    Mm algorithms for generalized bradley-terry models.The annals of statistics, 32(1):384–406, 2004

    David R Hunter. Mm algorithms for generalized bradley-terry models.The annals of statistics, 32(1):384–406, 2004

  21. [21]

    Sutherland

    Yi Ren and Danica J. Sutherland. Learning dynamics of LLM finetuning. InICLR, 2025

  22. [22]

    Sugiyama and M

    M. Sugiyama and M. Kawanabe.Machine Learning in Non-Stationary Environments: Introduction to Covariate Shift Adaptation. MIT Press, 2012

  23. [23]

    Lodkaew, T

    T. Lodkaew, T. Fang, T. Ishida, and M. Sugiyama. Importance weighting for aligning language models under deployment distribution shift.Transactions on Machine Learning Research, page 25 pages, 2025

  24. [24]

    Simple statistical gradient-following algorithms for connectionist reinforcement learning

    Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3):229–256, 1992

  25. [25]

    Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning

    Ruiqi Zhang, Licong Lin, Yu Bai, and Song Mei. Negative preference optimization: From catastrophic collapse to effective unlearning.arXiv preprint arXiv:2404.05868, 2024

  26. [26]

    The surprising effectiveness of negative reinforcement in llm reasoning.arXiv preprint arXiv:2506.01347,

    Xinyu Zhu, Mengzhou Xia, Zhepei Wei, Wei-Lin Chen, Danqi Chen, and Yu Meng. The surprising effectiveness of negative reinforcement in llm reasoning.arXiv preprint arXiv:2506.01347, 2025

  27. [27]

    Rethinking llm unlearning objectives: A gradient perspective and go beyond

    Qizhou Wang, Jin Peng Zhou, Zhanke Zhou, Saebyeol Shin, Bo Han, and Kilian Q Weinberger. Rethinking llm unlearning objectives: A gradient perspective and go beyond. InICLR, 2025

  28. [28]

    Classification with noisy labels by importance reweighting.IEEE Transactions on pattern analysis and machine intelligence, 38(3):447–461, 2015

    Tongliang Liu and Dacheng Tao. Classification with noisy labels by importance reweighting.IEEE Transactions on pattern analysis and machine intelligence, 38(3):447–461, 2015

  29. [29]

    Learning to reweight examples for robust deep learning

    Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. Learning to reweight examples for robust deep learning. InICML, 2018

  30. [30]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

  31. [31]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

  32. [32]

    Simpo: Simple preference optimization with a reference-free reward

    Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward. InNeurIPS, 2024

  33. [33]

    Insights into alignment: Evaluating dpo and its variants across multiple tasks.arXiv preprint arXiv:2404.14723, 2024

    Amir Saeidi, Shivanshu Verma, Md Nayem Uddin, and Chitta Baral. Insights into alignment: Evaluating dpo and its variants across multiple tasks.arXiv preprint arXiv:2404.14723, 2024

  34. [34]

    KTO: Model Alignment as Prospect Theoretic Optimization

    Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306, 2024

  35. [35]

    Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation.arXiv preprint arXiv:2401.08417, 2024

    Haoran Xu, Amr Sharaf, Yunmo Chen, Weiting Tan, Lingfeng Shen, Benjamin Van Durme, Kenton Murray, and Young Jin Kim. Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation.arXiv preprint arXiv:2401.08417, 2024. 11 What Is Preference Optimization Doing, How and Why?

  36. [36]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  37. [37]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  38. [38]

    Reward design with language models.arXiv preprint arXiv:2303.00001, 2023

    Minae Kwon, Sang Michael Xie, Kalesha Bullard, and Dorsa Sadigh. Reward design with language models.arXiv preprint arXiv:2303.00001, 2023

  39. [39]

    Estimating training data influence by tracing gradient descent.NeurIPS, 2020

    Garima Pruthi, Frederick Liu, Satyen Kale, and Mukund Sundararajan. Estimating training data influence by tracing gradient descent.NeurIPS, 2020

  40. [40]

    Unrolling sgd: Understanding factors influencing machine unlearning

    Anvith Thudi, Gabriel Deza, Varun Chandrasekaran, and Nicolas Papernot. Unrolling sgd: Understanding factors influencing machine unlearning. InEuroS&P, 2022

  41. [41]

    Zephyr: Direct Distillation of LM Alignment

    Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Clémentine Fourrier, Nathan Habib, et al. Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944, 2023

  42. [42]

    Interpretable preferences via multi- objective reward modeling and mixture-of-experts

    Haoxiang Wang, Wei Xiong, Tengyang Xie, Han Zhao, and Tong Zhang. Interpretable preferences via multi- objective reward modeling and mixture-of-experts. InEMNLP, 2024

  43. [43]

    Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs

    Chris Yuhao Liu, Liang Zeng, Jiacai Liu, Rui Yan, Jujie He, Chaojie Wang, Shuicheng Yan, Yang Liu, and Yahui Zhou. Skywork-reward: Bag of tricks for reward modeling in llms.arXiv preprint arXiv:2410.18451, 2024

  44. [44]

    Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

    Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475, 2024. 12 What Is Preference Optimization Doing, How and Why? A Limitations and Further Discussions Here, we acknowledge our limitations. First, Adam [30] or its variants are commo...