Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap

Rongwu Xu; Xuan Qi; Zhijing Jin

arxiv: 2508.04149 · v2 · pith:6RDEEJZ7new · submitted 2025-08-06 · 💻 cs.CL · cs.AI· cs.LG

Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap

Xuan Qi , Rongwu Xu , Zhijing Jin This is my paper

Pith reviewed 2026-05-21 23:42 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords DPOpreference optimizationdata selectionreward gapLLM alignmentdata efficiencydifficulty sampling

0 comments

The pith

Selecting preference data with smaller DPO implicit reward gaps allows superior LLM alignment using only 10 percent of the data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a way to pick which preference examples are worth training on by looking at how small the gap is in the model's implicit rewards between the preferred and dispreferred responses. Smaller gaps point to harder cases that give more learning signal. This selection strategy lets models reach better alignment performance than several other data selection methods, even when using just one tenth of the full preference dataset. A reader would care because building high-quality preference data is expensive, and this approach makes alignment more practical with limited budgets. The method is tested on multiple datasets and tasks to show consistent gains.

Core claim

The central claim is that preference data examples with smaller DPO implicit reward gaps represent more challenging cases, and selecting them for training leads to improved data efficiency and better model alignment compared to using random or other selection methods.

What carries the argument

The DPO implicit reward gap, defined as the difference in the implicit rewards assigned to the preferred and rejected responses under the current policy, which is used to rank and select the most difficult training examples.

If this is right

Models achieve higher alignment scores when trained on the selected subset than on the full dataset or random subsets.
The approach beats five strong baseline selection methods across several alignment benchmarks.
Only 10% of the data is needed to reach superior performance.
Data efficiency improves for both DPO and potentially other preference optimization techniques.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This selection criterion might generalize to other preference optimization methods beyond DPO by adapting the reward gap concept.
Practitioners could apply this to reduce annotation costs when building new preference datasets.
Testing on larger scale models could reveal if the gap remains a reliable difficulty signal as model size increases.

Load-bearing premise

A smaller DPO implicit reward gap actually indicates a more difficult or informative example instead of reflecting noise in the labels or peculiarities of the current model state.

What would settle it

Running the selection method and a random selection baseline on the same dataset and measuring if the performance difference disappears or reverses on a standard alignment benchmark like MT-Bench or AlpacaEval.

Figures

Figures reproduced from arXiv: 2508.04149 by Rongwu Xu, Xuan Qi, Zhijing Jin.

**Figure 2.** Figure 2: Performance scaling effects with different data selection ratios on RewardBench using [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: The overlap of selected data among our method and four baselines. The legend indicates selection agreement: [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

read the original abstract

Aligning large language models (LLMs) with human preferences is a critical challenge in AI research. While methods like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) are widely used, they often rely on large, costly preference datasets. The current work lacks methods for high-quality data selection specifically for preference data. In this work, we introduce a novel difficulty-based data selection strategy for preference datasets, grounded in the DPO implicit reward mechanism. By selecting preference data examples with smaller DPO implicit reward gaps, which are indicative of more challenging cases, we improve data efficiency and model alignment. Our approach consistently outperforms five strong baselines across multiple datasets and alignment tasks, achieving superior performance with only 10\% of the original data. This principled, efficient selection method offers a promising solution for scaling LLM alignment with limited resources.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a simple rule for picking hard preference pairs via the DPO reward gap and reports solid gains at 10% data, but the evidence that the gap actually tracks difficulty is still thin.

read the letter

The core claim is that examples with smaller DPO implicit reward gaps are harder and more worth keeping, so you can drop to 10% of the data and still beat five baselines on alignment tasks. That is the one practical takeaway worth checking first. The method is straightforward: compute the gap from the current policy and reference model, sort, and keep the bottom slice. It ties directly to the DPO loss, which is a clean move if the gap really signals learning value rather than just model uncertainty or label issues. The reported wins across multiple datasets and tasks with far less data are the part that could matter for people who actually run these alignments and pay for preference labels. If the gains hold under tighter controls, this is the kind of incremental efficiency trick that gets used in practice. The experiments appear to be run on standard setups, and the abstract frames the selection rule as new relative to the cited baselines. That is fair credit for the downstream use even if the underlying quantity comes from the original DPO paper. The soft spot is the interpretation of the gap itself. A small gap can appear for several reasons that have nothing to do with intrinsic difficulty: the model simply has not seen enough signal yet, the chosen response is noisy, or the pair sits outside the reference distribution. The stress-test note flags this, and the abstract gives no sign that the authors tested against perturbed labels or an external difficulty oracle. Without those checks, the consistent gains could be driven by length, topic, or other surface features rather than the intended difficulty signal. Statistical significance and split details are also missing from the summary, which makes it harder to judge how stable the 10% result is. This is the sort of paper that belongs in a reading group focused on data-efficient alignment. Practitioners who need to stretch limited preference data will find the method easy to try and the reported numbers worth replicating. It is not a foundational result, but the idea is concrete enough that a serious referee should see it. I would send it out for review rather than desk reject, with the expectation that the authors add controls for what the gap actually measures and report variance across runs.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes selecting a subset of preference data for DPO by ranking examples according to the magnitude of their DPO implicit reward gap and retaining the bottom-k (smallest-gap) examples, which the authors interpret as the most difficult or informative cases. Experiments on multiple datasets and alignment tasks report that models trained on only 10% of the data selected this way outperform five baselines.

Significance. If the central claim holds after validation, the method would offer a low-cost, model-internal way to improve data efficiency in preference optimization without requiring additional human annotations or external difficulty oracles. It re-uses a quantity already computed inside DPO, which is a practical strength.

major comments (2)

[Method] Method section, definition of Δ: The selection rule retains examples with the smallest Δ = β log(π_θ(y_w|x)/π_ref(y_w|x)) − β log(π_θ(y_l|x)/π_ref(y_l|x)). Because Δ is computed from the current policy π_θ, a small value can arise from model uncertainty, label noise, or reference-model mismatch rather than intrinsic example difficulty. The headline claim that smaller gaps mark “more challenging cases” therefore requires direct evidence (e.g., label-perturbation experiments or correlation with an independent difficulty measure) that is not supplied.
[Experiments] Results section: The abstract and experimental claims state consistent outperformance with 10% data across five baselines and multiple tasks, yet no statistical significance tests, variance across random seeds, or controls for confounding variables (response length, topic distribution, or label quality) are reported. Without these, the reported gains cannot be confidently attributed to the difficulty-based selection rule.

minor comments (1)

[Abstract] The abstract should explicitly name the five baselines and the concrete datasets/tasks so that the scope of the empirical claims is immediately clear.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback on our manuscript. We address each of the major comments below and have made revisions to strengthen the paper where appropriate.

read point-by-point responses

Referee: [Method] Method section, definition of Δ: The selection rule retains examples with the smallest Δ = β log(π_θ(y_w|x)/π_ref(y_w|x)) − β log(π_θ(y_l|x)/π_ref(y_l|x)). Because Δ is computed from the current policy π_θ, a small value can arise from model uncertainty, label noise, or reference-model mismatch rather than intrinsic example difficulty. The headline claim that smaller gaps mark “more challenging cases” therefore requires direct evidence (e.g., label-perturbation experiments or correlation with an independent difficulty measure) that is not supplied.

Authors: We appreciate the referee's observation regarding the potential sources of small Δ values. In the DPO framework, the implicit reward gap Δ quantifies the difference in log-probability ratios between the winning and losing responses. Our selection strategy is motivated by the idea that examples with small gaps are those where the model has not yet strongly differentiated the preferred response, which we posit correspond to more informative or challenging cases for alignment. However, we agree that this interpretation would benefit from additional supporting analysis. In the revised manuscript, we have expanded the Method section to discuss alternative explanations for small Δ and added an empirical analysis showing correlation between low-Δ examples and other difficulty indicators such as higher model entropy on the responses. revision: partial
Referee: [Experiments] Results section: The abstract and experimental claims state consistent outperformance with 10% data across five baselines and multiple tasks, yet no statistical significance tests, variance across random seeds, or controls for confounding variables (response length, topic distribution, or label quality) are reported. Without these, the reported gains cannot be confidently attributed to the difficulty-based selection rule.

Authors: We acknowledge the importance of statistical rigor in validating our experimental results. The original manuscript reported average performance improvements but did not include variance estimates or formal significance testing. In the revised version, we have re-run the experiments with multiple random seeds (at least 3 per setting) and report mean and standard deviation. We have also performed paired t-tests to assess statistical significance of the improvements over baselines. Additionally, we include controls by reporting results on length-matched subsets and analyzing topic distributions to rule out confounding factors. These additions are presented in the updated Results section and Appendix. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the data selection heuristic

full rationale

The paper takes the standard DPO implicit reward gap Δ = β log(π_θ(y_w|x)/π_ref(y_w|x)) − β log(π_θ(y_l|x)/π_ref(y_l|x)) directly from the DPO formulation and proposes its use as a proxy for example difficulty in a downstream selection rule. This is an empirical heuristic rather than a derivation that reduces the claimed result to its inputs by construction. No self-definitional loop, fitted parameter renamed as prediction, or load-bearing self-citation chain is present; the performance gains at 10% data are reported via external baseline comparisons and remain falsifiable. The method is self-contained against the DPO reference without importing uniqueness theorems or ansatzes from the authors' prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The method inherits the standard DPO loss and implicit reward definition from prior work; no new free parameters, axioms, or invented entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5673 in / 1077 out tokens · 34554 ms · 2026-05-21T23:42:41.928229+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Data Difficulty and the Generalization--Extrapolation Tradeoff in LLM Fine-Tuning
cs.LG 2026-05 unverdicted novelty 6.0

For a fixed data budget in LLM supervised fine-tuning, optimal data difficulty shifts toward harder examples as the budget grows because of the tradeoff between in-distribution generalization gap and extrapolation gap.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 1 Pith paper · 16 internal anchors

[1]

Christiano, Jan Leike, Tom B

Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V . N. Vishwanathan, and Roman Garnett, editors,Advances in Neural Information Processing Systems 30: Annual Conference on Neural ...

work page 2017
[2]

A comprehensive overview of large language models

Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Sajid Anwar, Muhammad Usman, Nick Barnes, and Ajmal Mian. A comprehensive overview of large language models. ACM Transactions on Intelligent Systems and Technology, 15(4):1–50, 2023

work page 2023
[3]

Large language models in medicine

Anirudh J Thirunavukarasu, Daniel SW Ting, Kabilan Elangovan, Luis Gutierrez, Trevor Tan, Yiran Chen, Pavitra Bernardo, He Tsao, Adnan Mahmood, Scott M McKinney, et al. Large language models in medicine. Nature Medicine, 29(8):1930–1940, 2023

work page 1930
[4]

Alita: Generalist agent enabling scalable agentic reasoning with minimal predefinition and maximal self-evolution.arXiv preprint arXiv:2505.20286, 2025

Jiahao Qiu, Xuan Qi, Tongcheng Zhang, Xinzhe Juan, Jiacheng Guo, Yifu Lu, Yimin Wang, Zixin Yao, Qihan Ren, Xun Jiang, et al. Alita: Generalist agent enabling scalable agentic reasoning with minimal predefinition and maximal self-evolution. arXiv preprint arXiv:2505.20286, 2025

work page arXiv 2025
[5]

The ai alignment problem: why it is hard, and where to start

Eliezer Yudkowsky. The ai alignment problem: why it is hard, and where to start. Symbolic Systems Distinguished Speaker, 4(1), 2016

work page 2016
[6]

Artificial intelligence, values, and alignment

Iason Gabriel. Artificial intelligence, values, and alignment. Minds and machines, 30(3):411–437, 2020

work page 2020
[7]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022
[8]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[9]

A General Language Assistant as a Laboratory for Alignment

Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Benjamin Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Jackson Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom B. Brown, Jack Clark, Sam McCandlish, Chris Olah, and Jared Kaplan. A general language assistant as ...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[10]

Impact of preference noise on the alignment performance of generative lan- guage models.arXiv preprint arXiv:2404.09824,

Yang Gao, Dana Alon, and Donald Metzler. Impact of preference noise on the alignment performance of generative language models. arXiv preprint arXiv:2404.09824, 2024

work page arXiv 2024
[11]

Dataset cartography: Mapping and diagnosing datasets with training dynamics

Swabha Swayamdipta, Roy Schwartz, Nicholas Lourie, Yizhong Wang, Hannaneh Hajishirzi, Noah A Smith, and Yejin Choi. Dataset cartography: Mapping and diagnosing datasets with training dynamics. arXiv preprint arXiv:2009.10795, 2020

work page arXiv 2009
[12]

Identifying mislabeled data using the area under the margin ranking

Geoff Pleiss, Tianyi Zhang, Ethan Elenberg, and Kilian Q Weinberger. Identifying mislabeled data using the area under the margin ranking. Advances in Neural Information Processing Systems, 33:17044–17056, 2020

work page 2020
[13]

Self-guided curriculum learning for neural machine translation.Transactions of the Association for Computational Linguistics, 11:452–468, 2023

Fei Yuan, Liang Huang, and Qun Liu. Self-guided curriculum learning for neural machine translation.Transactions of the Association for Computational Linguistics, 11:452–468, 2023

work page 2023
[14]

Openbook qa: A new dataset for open book question answering

Todor Agarwal and Mohit Bansal. Openbook qa: A new dataset for open book question answering. Advances in Neural Information Processing Systems, 34:9473–9487, 2021

work page 2021
[15]

Beyond neural scaling laws: beating power law scaling via data pruning.Advances in Neural Information Processing Systems, 35:19523–19536, 2022

Ben Sorscher, Robert Geirhos, Shashank Shekhar, Surya Ganguli, and Ari Morcos. Beyond neural scaling laws: beating power law scaling via data pruning.Advances in Neural Information Processing Systems, 35:19523–19536, 2022

work page 2022
[16]

Data selection for language models via importance resampling

Sang Michael Xie, Shibani Santurkar, Tengyu Ma, and Percy Liang. Data selection for language models via importance resampling. arXiv preprint arXiv:2302.03169, 2023

work page arXiv 2023
[17]

Deep learning on a data diet: Finding important examples early in training

Mansheej Paul, Surya Ganguli, and Gintare Karolina Dziugaite. Deep learning on a data diet: Finding important examples early in training. Advances in Neural Information Processing Systems, 34:20596–20607, 2021

work page 2021
[18]

Prioritized training on points that are learnable, worth learning, and not yet learnt,

Sören Mindermann, Krishnamurthy Dvijotham, Sven Gowal, Robert Stanforth, Balaji Qin, Jonathan Uesato, Pushmeet Arand, Maximilian Mann, and Pushmeet Kohli. Prioritized training on points that are learnable, worth learning, and not yet learnt. arXiv preprint arXiv:2206.07137, 2022

work page arXiv 2022
[19]

Less: Selecting influential data for targeted instruction tuning

Mengzhou Marion, Sang Michael Xie, Shibani Santurkar, and Percy Liang. Less: Selecting influential data for targeted instruction tuning. arXiv preprint arXiv:2402.04333, 2023. 11

work page arXiv 2023
[20]

Prin- cipled data selection for alignment: The hidden risks of difficult examples.arXiv preprint arXiv:2502.09650,

Chengqian Gao, Haonan Li, Liu Liu, Zeke Xie, Peilin Zhao, and Zhiqiang Xu. Principled data selection for alignment: The hidden risks of difficult examples. CoRR, abs/2502.09650, 2025

work page arXiv 2025
[21]

Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs

Chris Yuhao Liu, Liang Zeng, Jiacai Liu, Rui Yan, Jujie He, Chaojie Wang, Shuicheng Yan, Yang Liu, and Yahui Zhou. Skywork-reward: Bag of tricks for reward modeling in llms. CoRR, abs/2410.18451, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

UltraFeedback: Boosting Language Models with Scaled AI Feedback

Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. Ultrafeedback: Boosting language models with high-quality feedback. CoRR, abs/2310.01377, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Rlhflow/pair_data_v2_80k_wsafety: A dataset of 80k paired user-assistant interactions

RLHFlow. Rlhflow/pair_data_v2_80k_wsafety: A dataset of 80k paired user-assistant interactions. Hugging Face Dataset Repository, 2024. Dataset used to train Qwen/WorldPM-72B-RLHFLow model for preference learning

work page 2024
[25]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Constitutional AI: Harmlessness from AI Feedback

Anthropic. Claude: A next-generation ai assistant based on constitutional ai. arXiv preprint arXiv:2212.08073, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[29]

Trust region policy optimiza- tion

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimiza- tion. International conference on machine learning, pages 1889–1897, 2015

work page 2015
[30]

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, et al. Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

A general theoretical paradigm to understand learning from human preferences.arXiv preprint arXiv:2310.12036,

Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, and Rémi Munos. A general theoretical paradigm to understand learning from human preferences. arXiv preprint arXiv:2310.12036, 2024

work page arXiv 2024
[32]

KTO: Model Alignment as Prospect Theoretic Optimization

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theory. arXiv preprint arXiv:2402.01306, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Simpo: Simple preference optimization with a reference-free reward

Yu Meng, Mengzhou Xie, Yee Whye Teh, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward. arXiv preprint arXiv:2405.14734, 2024

work page arXiv 2024
[34]

LIMA: Less Is More for Alignment

Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

The Flan Collection: Designing Data and Methods for Effective Instruction Tuning

Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei, et al. The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Self-evolved diverse data sampling for efficient instruction tuning,

Shengguang Wu, Keming Lu, Benfeng Xu, Junyang Lin, Qi Su, and Chang Zhou. Self-evolved diverse data sampling for efficient instruction tuning. arXiv preprint arXiv:2311.08182, 2023

work page arXiv 2023
[37]

Fair data selection for rlhf

Seungone Park, Juyoung Kang, Seungjoon Yoon, Seunghyun Hwang, Dongkeun Kang, and Youngja Yoon. Fair data selection for rlhf. arXiv preprint arXiv:2402.11409, 2024

work page arXiv 2024
[38]

Weak-to-strong preference learning

Liang Chen, Jiali Huang, Tianyu Xie, Nanyun Peng, and Danqi Chen. Weak-to-strong preference learning. arXiv preprint arXiv:2405.19045, 2024

work page arXiv 2024
[39]

Manning, Stefano Ermon, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference ...

work page 2023
[40]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[41]

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P

Yongcheng Zeng, Guoqing Liu, Weiyu Ma, Ning Yang, Haifeng Zhang, et al. Token-level direct preference optimization. arXiv preprint arXiv:2404.11999, 2024

work page arXiv 2024
[42]

Disentangling length from quality in direct preference optimization, 2024

Ryan Park, Rafael Rafailov, Stefano Ermon, and Chelsea Finn. Disentangling length from quality in direct preference optimization. arXiv preprint arXiv:2403.19159, 2024

work page arXiv 2024
[43]

Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint, 2024

Wei Xiong, Hanze Dong, Chenlu Ye, Ziqi Wang, Han Zhong, Heng Ji, Nan Jiang, and Tong Zhang. Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint, 2024

work page 2024
[44]

RLHF Workflow: From Reward Modeling to Online RLHF

Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, and Tong Zhang. Rlhf workflow: From reward modeling to online rlhf. arXiv preprint arXiv:2405.07863, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

The llama 3 herd of models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv e-prints, pages arXiv–2407, 2024

work page 2024
[46]

Gemma Team. Gemma. 2024

work page 2024
[47]

Ralph Allan Bradley and Milton E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39:324, 1952

work page 1952
[48]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training. arXiv preprint arXiv:2411.15124, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

Jian Hu, Xibin Wu, Zilin Zhu, Xianyu, Weixun Wang, Dehao Zhang, and Yu Cao. Openrlhf: An easy-to-use, scalable and high-performance rlhf framework. arXiv preprint arXiv:2405.11143, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

Entropy law: The story behind data compression and llm performance

Mingjia Yin, Chuhan Wu, Yufei Wang, Hao Wang, Wei Guo, Yasheng Wang, Yong Liu, Ruiming Tang, Defu Lian, and Enhong Chen. Entropy law: The story behind data compression and llm performance. arXiv preprint arXiv:2407.06645, 2024

work page arXiv 2024
[51]

Smith, and Hanna Hajishirzi

Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Raghavi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A. Smith, and Hannaneh Hajishirzi. Rewardbench: Evaluating reward models for language modeling. CoRR, abs/2403.13787, 2024

work page arXiv 2024
[52]

Hashimoto

Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models. https://github. com/tatsu-lab/alpaca_eval, 5 2023

work page 2023
[53]

Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander Madry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, Alex Nichol, Alex Paino, Alex Renzin, Alex Tachard Passos, Alexander Kirillov, Alexi Christakis, Alexis Conneau,...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

Christiano, Jan Leike, Tom B

Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V . N. Vishwanathan, and Roman Garnett, editors,Advances in Neural Information Processing Systems 30: Annual Conference on Neural ...

work page 2017

[2] [2]

A comprehensive overview of large language models

Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Sajid Anwar, Muhammad Usman, Nick Barnes, and Ajmal Mian. A comprehensive overview of large language models. ACM Transactions on Intelligent Systems and Technology, 15(4):1–50, 2023

work page 2023

[3] [3]

Large language models in medicine

Anirudh J Thirunavukarasu, Daniel SW Ting, Kabilan Elangovan, Luis Gutierrez, Trevor Tan, Yiran Chen, Pavitra Bernardo, He Tsao, Adnan Mahmood, Scott M McKinney, et al. Large language models in medicine. Nature Medicine, 29(8):1930–1940, 2023

work page 1930

[4] [4]

Alita: Generalist agent enabling scalable agentic reasoning with minimal predefinition and maximal self-evolution.arXiv preprint arXiv:2505.20286, 2025

Jiahao Qiu, Xuan Qi, Tongcheng Zhang, Xinzhe Juan, Jiacheng Guo, Yifu Lu, Yimin Wang, Zixin Yao, Qihan Ren, Xun Jiang, et al. Alita: Generalist agent enabling scalable agentic reasoning with minimal predefinition and maximal self-evolution. arXiv preprint arXiv:2505.20286, 2025

work page arXiv 2025

[5] [5]

The ai alignment problem: why it is hard, and where to start

Eliezer Yudkowsky. The ai alignment problem: why it is hard, and where to start. Symbolic Systems Distinguished Speaker, 4(1), 2016

work page 2016

[6] [6]

Artificial intelligence, values, and alignment

Iason Gabriel. Artificial intelligence, values, and alignment. Minds and machines, 30(3):411–437, 2020

work page 2020

[7] [7]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022

[8] [8]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024

work page 2024

[9] [9]

A General Language Assistant as a Laboratory for Alignment

Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Benjamin Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Jackson Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom B. Brown, Jack Clark, Sam McCandlish, Chris Olah, and Jared Kaplan. A general language assistant as ...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[10] [10]

Impact of preference noise on the alignment performance of generative lan- guage models.arXiv preprint arXiv:2404.09824,

Yang Gao, Dana Alon, and Donald Metzler. Impact of preference noise on the alignment performance of generative language models. arXiv preprint arXiv:2404.09824, 2024

work page arXiv 2024

[11] [11]

Dataset cartography: Mapping and diagnosing datasets with training dynamics

Swabha Swayamdipta, Roy Schwartz, Nicholas Lourie, Yizhong Wang, Hannaneh Hajishirzi, Noah A Smith, and Yejin Choi. Dataset cartography: Mapping and diagnosing datasets with training dynamics. arXiv preprint arXiv:2009.10795, 2020

work page arXiv 2009

[12] [12]

Identifying mislabeled data using the area under the margin ranking

Geoff Pleiss, Tianyi Zhang, Ethan Elenberg, and Kilian Q Weinberger. Identifying mislabeled data using the area under the margin ranking. Advances in Neural Information Processing Systems, 33:17044–17056, 2020

work page 2020

[13] [13]

Self-guided curriculum learning for neural machine translation.Transactions of the Association for Computational Linguistics, 11:452–468, 2023

Fei Yuan, Liang Huang, and Qun Liu. Self-guided curriculum learning for neural machine translation.Transactions of the Association for Computational Linguistics, 11:452–468, 2023

work page 2023

[14] [14]

Openbook qa: A new dataset for open book question answering

Todor Agarwal and Mohit Bansal. Openbook qa: A new dataset for open book question answering. Advances in Neural Information Processing Systems, 34:9473–9487, 2021

work page 2021

[15] [15]

Beyond neural scaling laws: beating power law scaling via data pruning.Advances in Neural Information Processing Systems, 35:19523–19536, 2022

Ben Sorscher, Robert Geirhos, Shashank Shekhar, Surya Ganguli, and Ari Morcos. Beyond neural scaling laws: beating power law scaling via data pruning.Advances in Neural Information Processing Systems, 35:19523–19536, 2022

work page 2022

[16] [16]

Data selection for language models via importance resampling

Sang Michael Xie, Shibani Santurkar, Tengyu Ma, and Percy Liang. Data selection for language models via importance resampling. arXiv preprint arXiv:2302.03169, 2023

work page arXiv 2023

[17] [17]

Deep learning on a data diet: Finding important examples early in training

Mansheej Paul, Surya Ganguli, and Gintare Karolina Dziugaite. Deep learning on a data diet: Finding important examples early in training. Advances in Neural Information Processing Systems, 34:20596–20607, 2021

work page 2021

[18] [18]

Prioritized training on points that are learnable, worth learning, and not yet learnt,

Sören Mindermann, Krishnamurthy Dvijotham, Sven Gowal, Robert Stanforth, Balaji Qin, Jonathan Uesato, Pushmeet Arand, Maximilian Mann, and Pushmeet Kohli. Prioritized training on points that are learnable, worth learning, and not yet learnt. arXiv preprint arXiv:2206.07137, 2022

work page arXiv 2022

[19] [19]

Less: Selecting influential data for targeted instruction tuning

Mengzhou Marion, Sang Michael Xie, Shibani Santurkar, and Percy Liang. Less: Selecting influential data for targeted instruction tuning. arXiv preprint arXiv:2402.04333, 2023. 11

work page arXiv 2023

[20] [20]

Prin- cipled data selection for alignment: The hidden risks of difficult examples.arXiv preprint arXiv:2502.09650,

Chengqian Gao, Haonan Li, Liu Liu, Zeke Xie, Peilin Zhao, and Zhiqiang Xu. Principled data selection for alignment: The hidden risks of difficult examples. CoRR, abs/2502.09650, 2025

work page arXiv 2025

[21] [21]

Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs

Chris Yuhao Liu, Liang Zeng, Jiacai Liu, Rui Yan, Jujie He, Chaojie Wang, Shuicheng Yan, Yang Liu, and Yahui Zhou. Skywork-reward: Bag of tricks for reward modeling in llms. CoRR, abs/2410.18451, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [22]

UltraFeedback: Boosting Language Models with Scaled AI Feedback

Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. Ultrafeedback: Boosting language models with high-quality feedback. CoRR, abs/2310.01377, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

Rlhflow/pair_data_v2_80k_wsafety: A dataset of 80k paired user-assistant interactions

RLHFlow. Rlhflow/pair_data_v2_80k_wsafety: A dataset of 80k paired user-assistant interactions. Hugging Face Dataset Repository, 2024. Dataset used to train Qwen/WorldPM-72B-RLHFLow model for preference learning

work page 2024

[24] [25]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[25] [26]

Constitutional AI: Harmlessness from AI Feedback

Anthropic. Claude: A next-generation ai assistant based on constitutional ai. arXiv preprint arXiv:2212.08073, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[26] [27]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[27] [28]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[28] [29]

Trust region policy optimiza- tion

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimiza- tion. International conference on machine learning, pages 1889–1897, 2015

work page 2015

[29] [30]

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, et al. Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [31]

A general theoretical paradigm to understand learning from human preferences.arXiv preprint arXiv:2310.12036,

Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, and Rémi Munos. A general theoretical paradigm to understand learning from human preferences. arXiv preprint arXiv:2310.12036, 2024

work page arXiv 2024

[31] [32]

KTO: Model Alignment as Prospect Theoretic Optimization

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theory. arXiv preprint arXiv:2402.01306, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [33]

Simpo: Simple preference optimization with a reference-free reward

Yu Meng, Mengzhou Xie, Yee Whye Teh, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward. arXiv preprint arXiv:2405.14734, 2024

work page arXiv 2024

[33] [34]

LIMA: Less Is More for Alignment

Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [35]

The Flan Collection: Designing Data and Methods for Effective Instruction Tuning

Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei, et al. The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[35] [36]

Self-evolved diverse data sampling for efficient instruction tuning,

Shengguang Wu, Keming Lu, Benfeng Xu, Junyang Lin, Qi Su, and Chang Zhou. Self-evolved diverse data sampling for efficient instruction tuning. arXiv preprint arXiv:2311.08182, 2023

work page arXiv 2023

[36] [37]

Fair data selection for rlhf

Seungone Park, Juyoung Kang, Seungjoon Yoon, Seunghyun Hwang, Dongkeun Kang, and Youngja Yoon. Fair data selection for rlhf. arXiv preprint arXiv:2402.11409, 2024

work page arXiv 2024

[37] [38]

Weak-to-strong preference learning

Liang Chen, Jiali Huang, Tianyu Xie, Nanyun Peng, and Danqi Chen. Weak-to-strong preference learning. arXiv preprint arXiv:2405.19045, 2024

work page arXiv 2024

[38] [39]

Manning, Stefano Ermon, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference ...

work page 2023

[39] [40]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[40] [41]

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P

Yongcheng Zeng, Guoqing Liu, Weiyu Ma, Ning Yang, Haifeng Zhang, et al. Token-level direct preference optimization. arXiv preprint arXiv:2404.11999, 2024

work page arXiv 2024

[41] [42]

Disentangling length from quality in direct preference optimization, 2024

Ryan Park, Rafael Rafailov, Stefano Ermon, and Chelsea Finn. Disentangling length from quality in direct preference optimization. arXiv preprint arXiv:2403.19159, 2024

work page arXiv 2024

[42] [43]

Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint, 2024

Wei Xiong, Hanze Dong, Chenlu Ye, Ziqi Wang, Han Zhong, Heng Ji, Nan Jiang, and Tong Zhang. Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint, 2024

work page 2024

[43] [44]

RLHF Workflow: From Reward Modeling to Online RLHF

Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, and Tong Zhang. Rlhf workflow: From reward modeling to online rlhf. arXiv preprint arXiv:2405.07863, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[44] [45]

The llama 3 herd of models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv e-prints, pages arXiv–2407, 2024

work page 2024

[45] [46]

Gemma Team. Gemma. 2024

work page 2024

[46] [47]

Ralph Allan Bradley and Milton E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39:324, 1952

work page 1952

[47] [48]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training. arXiv preprint arXiv:2411.15124, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[48] [49]

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

Jian Hu, Xibin Wu, Zilin Zhu, Xianyu, Weixun Wang, Dehao Zhang, and Yu Cao. Openrlhf: An easy-to-use, scalable and high-performance rlhf framework. arXiv preprint arXiv:2405.11143, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[49] [50]

Entropy law: The story behind data compression and llm performance

Mingjia Yin, Chuhan Wu, Yufei Wang, Hao Wang, Wei Guo, Yasheng Wang, Yong Liu, Ruiming Tang, Defu Lian, and Enhong Chen. Entropy law: The story behind data compression and llm performance. arXiv preprint arXiv:2407.06645, 2024

work page arXiv 2024

[50] [51]

Smith, and Hanna Hajishirzi

Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Raghavi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A. Smith, and Hannaneh Hajishirzi. Rewardbench: Evaluating reward models for language modeling. CoRR, abs/2403.13787, 2024

work page arXiv 2024

[51] [52]

Hashimoto

Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models. https://github. com/tatsu-lab/alpaca_eval, 5 2023

work page 2023

[52] [53]

Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander Madry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, Alex Nichol, Alex Paino, Alex Renzin, Alex Tachard Passos, Alexander Kirillov, Alexi Christakis, Alexis Conneau,...

work page internal anchor Pith review Pith/arXiv arXiv 2024