What Is Preference Optimization Doing, and Why?

Bo Han; Gang Niu; Masashi Sugiyama; Qizhou Wang; Yue Wang; Zizhuo Zhang

arxiv: 2512.00778 · v2 · pith:CZF5FQS5new · submitted 2025-11-30 · 💻 cs.LG

What Is Preference Optimization Doing, and Why?

Yue Wang , Qizhou Wang , Zizhuo Zhang , Gang Niu , Bo Han , Masashi Sugiyama This is my paper

Pith reviewed 2026-05-21 18:33 UTC · model grok-4.3

classification 💻 cs.LG

keywords preference optimizationDPOPPOgradient updatespositive learningnegative learningloss reweightingLLM alignment

0 comments

The pith

DPO follows stable targets while PPO balances exploration and exploitation during preference optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper analyzes the optimization dynamics of direct preference optimization (DPO) and proximal policy optimization (PPO) used to align large language models with preferences. It shows that DPO adheres to stable gradient update targets, in contrast to PPO which balances exploration and exploitation. The study breaks down the effects of positive learning, negative learning, and loss reweighting, finding that in DPO these first two jointly shape targets and reweighting regularizes against overfitting, whereas in PPO negative learning aids exploration and reweighting differentiates token groups. These distinctions clarify why the methods behave differently and suggest ways to enhance their use for better model alignment.

Core claim

Examining the target directions of gradient-based updates reveals that DPO follows stable targets, whereas PPO balances exploration and exploitation. In DPO, positive and negative learning jointly shape the targets while loss reweighting acts more as a regularizer to mitigate overfitting. In PPO, negative learning primarily supports exploration and loss reweighting indicates distinct roles of token groups in updating targets. Carefully designed ablation studies examine how controlling these dynamics impacts optimization efficiency and practical performance.

What carries the argument

Gradient-based update target directions and the differentiated roles of positive learning, negative learning, and loss reweighting in shaping optimization behavior.

If this is right

These dynamics explain the distinct behaviors of DPO and PPO in practice.
Ablation studies demonstrate that adjusting these components affects how efficiently models optimize and perform on alignment tasks.
The findings provide a basis for developing improved preference optimization techniques for LLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Designers of new alignment algorithms could selectively incorporate stable targeting from DPO with targeted exploration from PPO.
The component analysis might apply to other preference optimization variants not examined here.
Empirical tests could measure how these roles manifest in training trajectories of real models.

Load-bearing premise

Differences in gradient update targets and the specific roles played by positive and negative learning plus loss reweighting are the main drivers of DPO versus PPO behaviors, isolated cleanly by the ablation studies without major confounding from other factors.

What would settle it

Observing that altering loss reweighting in DPO does not affect overfitting rates as expected, or that PPO negative learning does not enhance exploration in isolation, would indicate the roles are not as described.

Figures

Figures reproduced from arXiv: 2512.00778 by Bo Han, Gang Niu, Masashi Sugiyama, Qizhou Wang, Yue Wang, Zizhuo Zhang.

**Figure 1.** Figure 1: DPO Learning Dynamics. For the Pythia-2.8B model trained on UltraFeedback and tested on HH-RLHFhelpfulness, we show the dynamics of G measured per 1000 training steps: (a) the overall objective Ldpo (TOT); (b) the positive L + dpo (POS) and negative L − dpo (NEG) components; and (c) the weighted top L ↑ dpo (TOP), middle L→ dpo (MID), and bottom L ↓ dpo (BOT) components. The log scale is used for G due to… view at source ↗

**Figure 2.** Figure 2: PPO Learning Dynamics. For the Pythia-2.8B model trained on UltraFeedback and tested on HH-RLHFhelpfulness, we show the dynamics of G measured per 400 training steps: (a) the overall objective Lppo (TOT); (b) the positive L + ppo (POS) and negative L − ppo (NEG) components; and (c) the weighted top L ↑ ppo (TOP), middle L→ ppo (MID), and bottom L ↓ ppo (BOT) components. The log scale is used for G to alig… view at source ↗

**Figure 3.** Figure 3: Average (Raw) Advantages during PPO for top (TOP), middle (MID), and bottom (BOT) weighted data. When it comes to loss reweighting, as shown in [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Performance under Ablation. For the Pythia-2.8B model trained on UltraFeedback and tested on HH-RLHFhelpfulness, we show performance measured by Win Rate under: (a) DPO ablations removing negative learning (w/o NEG) before 3000 steps and positive learning (w/o POS) after 6000 steps, (b) PPO ablations removing top (w/o TOP) and middle (w/o MID) weighted data, (c) cDPO that emphasizes positive learning earl… view at source ↗

**Figure 5.** Figure 5: Gradient Magnitudes. For the Pythia-2.8B model trained on UltraFeedback and tested on HH-RLHFhelpfulness, we illustrate the distributions of gradient magnitudes computed with respect to mini-batches for DPO and PPO, across training steps. Normal data points are colored in blue, while outliers detected by IQR are colored in red. Moreover, to align with the common practice of gradient clipping, we excluded … view at source ↗

**Figure 6.** Figure 6: Gradient Dynamics With and Without Outlier Filtering. For the Pythia-2.8B model trained on UltraFeedback and tested on HH-RLHF-helpfulness, we present the learning dynamics of G for (a) DPO and (b) PPO without excluding batches with extreme gradient magnitudes, in contrast to the main results with outlier filtering shown in (c) and (d), as reported in the main text. 1000 2000 3000 4000 5000 6000 7000 train… view at source ↗

**Figure 7.** Figure 7: DPO Gradient Dynamics. For the Pythia-2.8B model trained on UltraFeedback and tested on HH-RLHFhelpfulness, we report (a) gradient magnitudes and (b) gradient alignments for DPO. Here, D′ is built from the training dataset, so we focus on in-distribution rather than out-of-distribution responses as in [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: DPO Gradient Dynamics. For the Pythia-2.8B model trained on UltraFeedback and tested on HH-RLHFhelpfulness, we compare DPO with and without SFT, reporting gradient magnitudes (a) with SFT, (b) without SFT, and (c) gradient dynamics without SFT. 1000 2000 3000 4000 5000 6000 7000 training step 10 4 10 3 10 2 10 1 0 10 1 10 2 10 3 10 4 gradient alignment TOP MID BOT (a) DPO 400 800 1200 1600 2000 2400 train… view at source ↗

**Figure 9.** Figure 9: Layer-wise Learning Dynamics. For the Pythia-2.8B model trained on UltraFeedback and tested on HHRLHF-helpfulness, we report the layer-wise dynamics for (a) DPO and (b) PPO. We show the corresponding results in [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Qwen Learning Dynamics. For the Qwen3-1.7B model trained on UltraFeedback and tested on HH-RLHFhelpfulness, we show the learning dynamics of G for DPO and PPO, covering the overall objectives ((a) for DPO and (d) for PPO), the positive and negative components ((b) for DPO and (e) for PPO), and each weighted component ((c) for DPO and (f) for PPO), respectively. Pythia-2.8B. As observed, Qwen3-1.7B exhibi… view at source ↗

**Figure 11.** Figure 11: Performance under Ablation. For the Qwen3-1.7B model trained on UltraFeedback and tested on HHRLHF-helpfulness, we show performance measured by Win Rate under: (a) DPO ablations removing negative learning (w/o NEG) and positive learning (w/o POS), (b) PPO ablations removing top (w/o TOP) and middle (w/o MID) weighted data, (c) cDPO that emphasizes positive learning early and negative learning later, wher… view at source ↗

**Figure 12.** Figure 12: DPO vs. cDPO. For the Pythia-2.8B model trained on UltraFeedback and tested on HH-RLHF-helpfulness, we compare DPO with cDPO across hyper-parameters. Finally, we summarize the results of cPPO, cDPO, and hPPO with standard deviations. For each method and its corresponding baseline, we used the best-performing hyperparameters and conducted five runs with different random seeds. The improvements are both not… view at source ↗

**Figure 13.** Figure 13: DPO vs. cDPO. For the Qwen3-1.7B model trained on UltraFeedback and tested on HH-RLHF-helpfulness, we compare DPO with cDPO across hyper-parameters [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗

**Figure 14.** Figure 14: PPO vs. cPPO. For the Pythia-2.8B model trained on UltraFeedback and tested on HH-RLHF-helpfulness, we compare PPO with cPPO across hyper-parameters. Both variants of cPPO are considered: one controlling the top weighted data (TOP) and the other controlling the middle weighted data (MID). 23 [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗

**Figure 15.** Figure 15: PPO vs. cPPO. For the Qwen3-1.7B model trained on UltraFeedback and tested on HH-RLHF-helpfulness, we compare PPO with cPPO across hyper-parameters. Both variants of cPPO are considered: one controlling the top weighted data (TOP) and the other controlling the middle weighted data (MID). 20 40 60 80 100 120 140 160 180 200 training step 50 52 54 56 win rate (%) PPO =0.3 (a) MID 20 40 60 80 100 120 140 160… view at source ↗

**Figure 16.** Figure 16: PPO vs. cPPO. For the Llama3-8B model trained on UltraFeedback and tested on HH-RLHF-helpfulness, we compare PPO with cPPO across hyper-parameters. Both variants of cPPO are considered: one controlling the top weighted data (TOP) and the other controlling the middle weighted data (MID). 24 [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗

**Figure 17.** Figure 17: PPO vs. hPPO. For the Pythia-2.8B model trained on UltraFeedback and tested on HH-RLHF-helpfulness, we compare PPO with hPPO across hyper-parameters. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_17.png] view at source ↗

**Figure 18.** Figure 18: PPO vs. hPPO. For the Qwen3-1.7B model trained on UltraFeedback and tested on HH-RLHF-helpfulness, we compare PPO with hPPO across hyper-parameters. 20 40 60 80 100 120 140 160 180 200 training step 50 52 54 56 win rate (%) PPO t3 = 10, = 0.90 (a) 20 40 60 80 100 120 140 160 180 200 training step 50 52 54 56 58 win rate (%) PPO t3 = 5, = 0.99 (b) 20 40 60 80 100 120 140 160 180 200 training step 50 52 54 … view at source ↗

**Figure 19.** Figure 19: PPO vs. hPPO. For the Llama3-8B model trained on UltraFeedback and tested on HH-RLHF-helpfulness, we compare PPO with hPPO across hyper-parameters. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_19.png] view at source ↗

read the original abstract

Preference optimization (PO) is indispensable for large language models (LLMs), with methods such as direct preference optimization (DPO) and proximal policy optimization (PPO) achieving great success. A common belief is that DPO is supervised learning while PPO is reinforcement learning, yet deeper analyses for the reasons underlying these differences remain lacking. To fill this gap, we analyze their optimization dynamics, revealing distinct algorithmic behaviors and comprehending their underlying causes. First, we examine the target directions of gradient-based updates and find that DPO follows stable targets, whereas PPO balances exploration and exploitation, validating the common belief yet from this new perspective. Second, we examine the roles of positive learning, negative learning, and loss reweighting, which are three key yet seldom discussed components within PO methods. Our analyses reveal that these components play fairly different roles. In DPO, positive and negative learning jointly shape the targets. However, loss reweighting in DPO acts less as a reward signal but more as a regularizer to mitigate overfitting. In PPO, negative learning primarily supports exploration rather than determining the targets. Meanwhile, loss reweighting, related to the absolute advantages, indicates the distinct roles of token groups in updating targets. Given these findings, we conduct carefully designed ablation studies to further examine how controlling these dynamics impacts optimization efficiency and practical performance. The insights gained from our analyses not only deepen the understanding of PO methods but also inspire the development of more preference-aligned LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper decomposes DPO and PPO through their gradient targets and the separate jobs of positive learning, negative learning, and loss reweighting, but the ablation controls need close checking.

read the letter

The main thing here is a component-level look at why DPO and PPO end up behaving differently. The authors inspect the direction of the gradient updates and argue that DPO follows relatively stable targets while PPO mixes exploration and exploitation. They then assign roles: in DPO, positive and negative learning together set the targets and loss reweighting mostly acts as a regularizer against overfitting; in PPO, negative learning mainly helps exploration and reweighting flags different token groups for target updates. The ablations are presented as tests of how controlling these pieces changes optimization speed and final results.

Referee Report

2 major / 2 minor

Summary. The paper analyzes the optimization dynamics of preference optimization methods, focusing on DPO and PPO for LLMs. It claims that examining gradient update targets reveals DPO follows stable targets while PPO balances exploration and exploitation. It further dissects the roles of positive learning, negative learning, and loss reweighting, arguing these components have distinct functions in each method (joint target shaping and regularization in DPO; exploration support and token-group indication in PPO). These insights are validated through carefully designed ablation studies examining impacts on optimization efficiency and practical performance.

Significance. If the gradient-direction analyses and ablation results hold under controlled conditions, the work offers a mechanistic explanation for behavioral differences between DPO and PPO beyond the standard supervised-vs-reinforcement-learning framing. The explicit component-role breakdown and ablation validation provide concrete, testable distinctions that could guide hybrid or improved preference optimization algorithms. The direct inspection of update rules and the ablation studies constitute a strength in grounding the claims.

major comments (2)

[§5] §5 (Ablation studies): The paper states that ablations are 'carefully designed' to examine how controlling the dynamics of positive/negative learning and loss reweighting impacts performance, yet does not report explicit matching of effective learning rates, sampling distributions, or KL coefficients across DPO and PPO variants. This leaves open the possibility that observed differences arise from unmatched implementation details rather than the claimed distinct roles of the components.
[§3] §3 (Gradient target analysis): The distinction that DPO follows stable targets while PPO balances exploration/exploitation rests on inspection of update directions; however, the analysis does not quantify sensitivity to reference-policy strength or batch-size variations, which could alter the stability claim when the reference model is not fixed.

minor comments (2)

Notation for loss reweighting terms is introduced without a consolidated table of symbols, making it difficult to track how absolute advantages map to token groups across equations.
Figure captions for ablation results could more explicitly state which hyperparameters were held constant versus varied in each condition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications and indicate the revisions we will make to improve the manuscript.

read point-by-point responses

Referee: [§5] §5 (Ablation studies): The paper states that ablations are 'carefully designed' to examine how controlling the dynamics of positive/negative learning and loss reweighting impacts performance, yet does not report explicit matching of effective learning rates, sampling distributions, or KL coefficients across DPO and PPO variants. This leaves open the possibility that observed differences arise from unmatched implementation details rather than the claimed distinct roles of the components.

Authors: We acknowledge the validity of this point. The ablations were performed by varying only the targeted components (positive/negative learning and reweighting) while holding other settings to the standard values reported in the original DPO and PPO implementations. However, we did not provide an explicit cross-method matching of effective learning rates, sampling distributions, or KL coefficients, as the methods differ in their core formulations. To strengthen the presentation, we will revise §5 and add an appendix with a comprehensive hyperparameter table for all variants, along with a discussion of why the observed distinctions align with the component roles rather than implementation mismatches. We will also include a limited set of additional runs with adjusted learning rates to confirm the robustness of the findings. revision: yes
Referee: [§3] §3 (Gradient target analysis): The distinction that DPO follows stable targets while PPO balances exploration/exploitation rests on inspection of update directions; however, the analysis does not quantify sensitivity to reference-policy strength or batch-size variations, which could alter the stability claim when the reference model is not fixed.

Authors: The gradient analysis in §3 is derived from the closed-form expressions of the update targets, which mathematically establish DPO's fixed target direction (dependent on the preference ratio and fixed reference policy) versus PPO's dynamic dependence on the current policy. We agree that explicit quantification of sensitivity to reference-policy strength and batch-size variations would strengthen the stability claim. We will add a dedicated paragraph in §3 discussing the theoretical dependence on the reference model coefficient and include a small-scale empirical study varying this coefficient and batch size to show that the core distinction in target stability persists. revision: partial

Circularity Check

0 steps flagged

No significant circularity; analysis rests on direct inspection of objectives and ablations

full rationale

The paper derives its claims by inspecting the gradient update targets and component roles directly from the standard DPO and PPO loss formulations and objectives. It then validates interpretations via ablation studies that control the identified dynamics. No step reduces a claimed result or prediction to a fitted parameter from the same data, a self-definitional loop, or a load-bearing self-citation whose content is itself unverified. The central mechanistic distinctions are obtained from explicit decomposition of the update rules rather than by construction from the inputs being explained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard optimization and RL assumptions rather than introducing new fitted parameters or entities.

axioms (1)

domain assumption Gradient directions computed from the loss functions accurately reflect the optimization targets and component contributions in DPO and PPO.
The central analyses begin by examining these gradient targets and roles.

pith-pipeline@v0.9.0 · 5804 in / 1284 out tokens · 38061 ms · 2026-05-21T18:33:22.627665+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We examine the target directions of gradient-based updates and find that DPO follows stable targets, whereas PPO balances exploration and exploitation
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

positive and negative learning jointly shape the targets... loss reweighting acts more as a regularizer

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 16 internal anchors

[1]

MIT press, 1998

Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction. MIT press, 1998

work page 1998
[2]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[5]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InNeurIPS, 2024

work page 2024
[6]

Towards effective evaluations and comparison for llm unlearning methods

Qizhou Wang, Bo Han, Puning Yang, Jianing Zhu, Tongliang Liu, and Masashi Sugiyama. Towards effective evaluations and comparison for llm unlearning methods. InICLR, 2025

work page 2025
[7]

Understanding why generalized reweighting does not improve over erm.arXiv preprint arXiv:2201.12293, 2022

Runtian Zhai, Chen Dan, Zico Kolter, and Pradeep Ravikumar. Understanding why generalized reweighting does not improve over erm.arXiv preprint arXiv:2201.12293, 2022

work page arXiv 2022
[8]

Spurious Rewards: Rethinking Training Signals in RLVR

Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng, Yiping Wang, Sewoong Oh, Simon Shaolei Du, Nathan Lambert, Sewon Min, Ranjay Krishna, et al. Spurious rewards: Rethinking training signals in rlvr.arXiv preprint arXiv:2506.10947, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

MIT press, 2018

Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar.Foundations of machine learning. MIT press, 2018

work page 2018
[10]

Understanding black-box predictions via influence functions

Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. InICML, 2017

work page 2017
[11]

Learning dynamics of llm finetuning.arXiv preprint arXiv:2407.10490, 2024

Yi Ren and Danica J Sutherland. Learning dynamics of llm finetuning.arXiv preprint arXiv:2407.10490, 2024

work page arXiv 2024
[12]

Towards a theoretical framework of out-of-distribution generalization.NeurIPS, 2021

Haotian Ye, Chuanlong Xie, Tianle Cai, Ruichen Li, Zhenguo Li, and Liwei Wang. Towards a theoretical framework of out-of-distribution generalization.NeurIPS, 2021

work page 2021
[13]

Pythia: A suite for analyzing large language models across training and scaling

Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. InICML, 2023

work page 2023
[14]

Enhancing Chat Language Models by Scaling High-quality Instructional Conversations

Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations.arXiv preprint arXiv:2305.14233, 2023. 10 What Is Preference Optimization Doing, How and Why?

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Ultrafeedback: Boosting language models with high-quality feedback

Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. Ultrafeedback: Boosting language models with high-quality feedback. InICML, 2024

work page 2024
[16]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[17]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

work page 2024
[19]

Reinforcement learning by reward-weighted regression for operational space control

Jan Peters and Stefan Schaal. Reinforcement learning by reward-weighted regression for operational space control. InICML, 2007

work page 2007
[20]

Mm algorithms for generalized bradley-terry models.The annals of statistics, 32(1):384–406, 2004

David R Hunter. Mm algorithms for generalized bradley-terry models.The annals of statistics, 32(1):384–406, 2004

work page 2004
[21]

Sutherland

Yi Ren and Danica J. Sutherland. Learning dynamics of LLM finetuning. InICLR, 2025

work page 2025
[22]

Sugiyama and M

M. Sugiyama and M. Kawanabe.Machine Learning in Non-Stationary Environments: Introduction to Covariate Shift Adaptation. MIT Press, 2012

work page 2012
[23]

Lodkaew, T

T. Lodkaew, T. Fang, T. Ishida, and M. Sugiyama. Importance weighting for aligning language models under deployment distribution shift.Transactions on Machine Learning Research, page 25 pages, 2025

work page 2025
[24]

Simple statistical gradient-following algorithms for connectionist reinforcement learning

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3):229–256, 1992

work page 1992
[25]

Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning

Ruiqi Zhang, Licong Lin, Yu Bai, and Song Mei. Negative preference optimization: From catastrophic collapse to effective unlearning.arXiv preprint arXiv:2404.05868, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

The surprising effectiveness of negative reinforcement in llm reasoning.arXiv preprint arXiv:2506.01347,

Xinyu Zhu, Mengzhou Xia, Zhepei Wei, Wei-Lin Chen, Danqi Chen, and Yu Meng. The surprising effectiveness of negative reinforcement in llm reasoning.arXiv preprint arXiv:2506.01347, 2025

work page arXiv 2025
[27]

Rethinking llm unlearning objectives: A gradient perspective and go beyond

Qizhou Wang, Jin Peng Zhou, Zhanke Zhou, Saebyeol Shin, Bo Han, and Kilian Q Weinberger. Rethinking llm unlearning objectives: A gradient perspective and go beyond. InICLR, 2025

work page 2025
[28]

Classification with noisy labels by importance reweighting.IEEE Transactions on pattern analysis and machine intelligence, 38(3):447–461, 2015

Tongliang Liu and Dacheng Tao. Classification with noisy labels by importance reweighting.IEEE Transactions on pattern analysis and machine intelligence, 38(3):447–461, 2015

work page 2015
[29]

Learning to reweight examples for robust deep learning

Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. Learning to reweight examples for robust deep learning. InICML, 2018

work page 2018
[30]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[31]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[32]

Simpo: Simple preference optimization with a reference-free reward

Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward. InNeurIPS, 2024

work page 2024
[33]

Insights into alignment: Evaluating dpo and its variants across multiple tasks.arXiv preprint arXiv:2404.14723, 2024

Amir Saeidi, Shivanshu Verma, Md Nayem Uddin, and Chitta Baral. Insights into alignment: Evaluating dpo and its variants across multiple tasks.arXiv preprint arXiv:2404.14723, 2024

work page arXiv 2024
[34]

KTO: Model Alignment as Prospect Theoretic Optimization

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation.arXiv preprint arXiv:2401.08417, 2024

Haoran Xu, Amr Sharaf, Yunmo Chen, Weiting Tan, Lingfeng Shen, Benjamin Van Durme, Kenton Murray, and Young Jin Kim. Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation.arXiv preprint arXiv:2401.08417, 2024. 11 What Is Preference Optimization Doing, How and Why?

work page arXiv 2024
[36]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Reward design with language models.arXiv preprint arXiv:2303.00001, 2023

Minae Kwon, Sang Michael Xie, Kalesha Bullard, and Dorsa Sadigh. Reward design with language models.arXiv preprint arXiv:2303.00001, 2023

work page arXiv 2023
[39]

Estimating training data influence by tracing gradient descent.NeurIPS, 2020

Garima Pruthi, Frederick Liu, Satyen Kale, and Mukund Sundararajan. Estimating training data influence by tracing gradient descent.NeurIPS, 2020

work page 2020
[40]

Unrolling sgd: Understanding factors influencing machine unlearning

Anvith Thudi, Gabriel Deza, Varun Chandrasekaran, and Nicolas Papernot. Unrolling sgd: Understanding factors influencing machine unlearning. InEuroS&P, 2022

work page 2022
[41]

Zephyr: Direct Distillation of LM Alignment

Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Clémentine Fourrier, Nathan Habib, et al. Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

Interpretable preferences via multi- objective reward modeling and mixture-of-experts

Haoxiang Wang, Wei Xiong, Tengyang Xie, Han Zhao, and Tong Zhang. Interpretable preferences via multi- objective reward modeling and mixture-of-experts. InEMNLP, 2024

work page 2024
[43]

Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs

Chris Yuhao Liu, Liang Zeng, Jiacai Liu, Rui Yan, Jujie He, Chaojie Wang, Shuicheng Yan, Yang Liu, and Yahui Zhou. Skywork-reward: Bag of tricks for reward modeling in llms.arXiv preprint arXiv:2410.18451, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475, 2024. 12 What Is Preference Optimization Doing, How and Why? A Limitations and Further Discussions Here, we acknowledge our limitations. First, Adam [30] or its variants are commo...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

MIT press, 1998

Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction. MIT press, 1998

work page 1998

[2] [2]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[5] [5]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InNeurIPS, 2024

work page 2024

[6] [6]

Towards effective evaluations and comparison for llm unlearning methods

Qizhou Wang, Bo Han, Puning Yang, Jianing Zhu, Tongliang Liu, and Masashi Sugiyama. Towards effective evaluations and comparison for llm unlearning methods. InICLR, 2025

work page 2025

[7] [7]

Understanding why generalized reweighting does not improve over erm.arXiv preprint arXiv:2201.12293, 2022

Runtian Zhai, Chen Dan, Zico Kolter, and Pradeep Ravikumar. Understanding why generalized reweighting does not improve over erm.arXiv preprint arXiv:2201.12293, 2022

work page arXiv 2022

[8] [8]

Spurious Rewards: Rethinking Training Signals in RLVR

Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng, Yiping Wang, Sewoong Oh, Simon Shaolei Du, Nathan Lambert, Sewon Min, Ranjay Krishna, et al. Spurious rewards: Rethinking training signals in rlvr.arXiv preprint arXiv:2506.10947, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

MIT press, 2018

Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar.Foundations of machine learning. MIT press, 2018

work page 2018

[10] [10]

Understanding black-box predictions via influence functions

Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. InICML, 2017

work page 2017

[11] [11]

Learning dynamics of llm finetuning.arXiv preprint arXiv:2407.10490, 2024

Yi Ren and Danica J Sutherland. Learning dynamics of llm finetuning.arXiv preprint arXiv:2407.10490, 2024

work page arXiv 2024

[12] [12]

Towards a theoretical framework of out-of-distribution generalization.NeurIPS, 2021

Haotian Ye, Chuanlong Xie, Tianle Cai, Ruichen Li, Zhenguo Li, and Liwei Wang. Towards a theoretical framework of out-of-distribution generalization.NeurIPS, 2021

work page 2021

[13] [13]

Pythia: A suite for analyzing large language models across training and scaling

Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. InICML, 2023

work page 2023

[14] [14]

Enhancing Chat Language Models by Scaling High-quality Instructional Conversations

Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations.arXiv preprint arXiv:2305.14233, 2023. 10 What Is Preference Optimization Doing, How and Why?

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

Ultrafeedback: Boosting language models with high-quality feedback

Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. Ultrafeedback: Boosting language models with high-quality feedback. InICML, 2024

work page 2024

[16] [16]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[17] [17]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

work page 2024

[19] [19]

Reinforcement learning by reward-weighted regression for operational space control

Jan Peters and Stefan Schaal. Reinforcement learning by reward-weighted regression for operational space control. InICML, 2007

work page 2007

[20] [20]

Mm algorithms for generalized bradley-terry models.The annals of statistics, 32(1):384–406, 2004

David R Hunter. Mm algorithms for generalized bradley-terry models.The annals of statistics, 32(1):384–406, 2004

work page 2004

[21] [21]

Sutherland

Yi Ren and Danica J. Sutherland. Learning dynamics of LLM finetuning. InICLR, 2025

work page 2025

[22] [22]

Sugiyama and M

M. Sugiyama and M. Kawanabe.Machine Learning in Non-Stationary Environments: Introduction to Covariate Shift Adaptation. MIT Press, 2012

work page 2012

[23] [23]

Lodkaew, T

T. Lodkaew, T. Fang, T. Ishida, and M. Sugiyama. Importance weighting for aligning language models under deployment distribution shift.Transactions on Machine Learning Research, page 25 pages, 2025

work page 2025

[24] [24]

Simple statistical gradient-following algorithms for connectionist reinforcement learning

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3):229–256, 1992

work page 1992

[25] [25]

Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning

Ruiqi Zhang, Licong Lin, Yu Bai, and Song Mei. Negative preference optimization: From catastrophic collapse to effective unlearning.arXiv preprint arXiv:2404.05868, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

The surprising effectiveness of negative reinforcement in llm reasoning.arXiv preprint arXiv:2506.01347,

Xinyu Zhu, Mengzhou Xia, Zhepei Wei, Wei-Lin Chen, Danqi Chen, and Yu Meng. The surprising effectiveness of negative reinforcement in llm reasoning.arXiv preprint arXiv:2506.01347, 2025

work page arXiv 2025

[27] [27]

Rethinking llm unlearning objectives: A gradient perspective and go beyond

Qizhou Wang, Jin Peng Zhou, Zhanke Zhou, Saebyeol Shin, Bo Han, and Kilian Q Weinberger. Rethinking llm unlearning objectives: A gradient perspective and go beyond. InICLR, 2025

work page 2025

[28] [28]

Classification with noisy labels by importance reweighting.IEEE Transactions on pattern analysis and machine intelligence, 38(3):447–461, 2015

Tongliang Liu and Dacheng Tao. Classification with noisy labels by importance reweighting.IEEE Transactions on pattern analysis and machine intelligence, 38(3):447–461, 2015

work page 2015

[29] [29]

Learning to reweight examples for robust deep learning

Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. Learning to reweight examples for robust deep learning. InICML, 2018

work page 2018

[30] [30]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[31] [31]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[32] [32]

Simpo: Simple preference optimization with a reference-free reward

Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward. InNeurIPS, 2024

work page 2024

[33] [33]

Insights into alignment: Evaluating dpo and its variants across multiple tasks.arXiv preprint arXiv:2404.14723, 2024

Amir Saeidi, Shivanshu Verma, Md Nayem Uddin, and Chitta Baral. Insights into alignment: Evaluating dpo and its variants across multiple tasks.arXiv preprint arXiv:2404.14723, 2024

work page arXiv 2024

[34] [34]

KTO: Model Alignment as Prospect Theoretic Optimization

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [35]

Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation.arXiv preprint arXiv:2401.08417, 2024

Haoran Xu, Amr Sharaf, Yunmo Chen, Weiting Tan, Lingfeng Shen, Benjamin Van Durme, Kenton Murray, and Young Jin Kim. Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation.arXiv preprint arXiv:2401.08417, 2024. 11 What Is Preference Optimization Doing, How and Why?

work page arXiv 2024

[36] [36]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [37]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

Reward design with language models.arXiv preprint arXiv:2303.00001, 2023

Minae Kwon, Sang Michael Xie, Kalesha Bullard, and Dorsa Sadigh. Reward design with language models.arXiv preprint arXiv:2303.00001, 2023

work page arXiv 2023

[39] [39]

Estimating training data influence by tracing gradient descent.NeurIPS, 2020

Garima Pruthi, Frederick Liu, Satyen Kale, and Mukund Sundararajan. Estimating training data influence by tracing gradient descent.NeurIPS, 2020

work page 2020

[40] [40]

Unrolling sgd: Understanding factors influencing machine unlearning

Anvith Thudi, Gabriel Deza, Varun Chandrasekaran, and Nicolas Papernot. Unrolling sgd: Understanding factors influencing machine unlearning. InEuroS&P, 2022

work page 2022

[41] [41]

Zephyr: Direct Distillation of LM Alignment

Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Clémentine Fourrier, Nathan Habib, et al. Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[42] [42]

Interpretable preferences via multi- objective reward modeling and mixture-of-experts

Haoxiang Wang, Wei Xiong, Tengyang Xie, Han Zhao, and Tong Zhang. Interpretable preferences via multi- objective reward modeling and mixture-of-experts. InEMNLP, 2024

work page 2024

[43] [43]

Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs

Chris Yuhao Liu, Liang Zeng, Jiacai Liu, Rui Yan, Jujie He, Chaojie Wang, Shuicheng Yan, Yang Liu, and Yahui Zhou. Skywork-reward: Bag of tricks for reward modeling in llms.arXiv preprint arXiv:2410.18451, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[44] [44]

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475, 2024. 12 What Is Preference Optimization Doing, How and Why? A Limitations and Further Discussions Here, we acknowledge our limitations. First, Adam [30] or its variants are commo...

work page internal anchor Pith review Pith/arXiv arXiv 2024