pith. sign in

arxiv: 2508.04149 · v2 · pith:6RDEEJZ7new · submitted 2025-08-06 · 💻 cs.CL · cs.AI· cs.LG

Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap

Pith reviewed 2026-05-21 23:42 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords DPOpreference optimizationdata selectionreward gapLLM alignmentdata efficiencydifficulty sampling
0
0 comments X

The pith

Selecting preference data with smaller DPO implicit reward gaps allows superior LLM alignment using only 10 percent of the data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a way to pick which preference examples are worth training on by looking at how small the gap is in the model's implicit rewards between the preferred and dispreferred responses. Smaller gaps point to harder cases that give more learning signal. This selection strategy lets models reach better alignment performance than several other data selection methods, even when using just one tenth of the full preference dataset. A reader would care because building high-quality preference data is expensive, and this approach makes alignment more practical with limited budgets. The method is tested on multiple datasets and tasks to show consistent gains.

Core claim

The central claim is that preference data examples with smaller DPO implicit reward gaps represent more challenging cases, and selecting them for training leads to improved data efficiency and better model alignment compared to using random or other selection methods.

What carries the argument

The DPO implicit reward gap, defined as the difference in the implicit rewards assigned to the preferred and rejected responses under the current policy, which is used to rank and select the most difficult training examples.

If this is right

  • Models achieve higher alignment scores when trained on the selected subset than on the full dataset or random subsets.
  • The approach beats five strong baseline selection methods across several alignment benchmarks.
  • Only 10% of the data is needed to reach superior performance.
  • Data efficiency improves for both DPO and potentially other preference optimization techniques.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This selection criterion might generalize to other preference optimization methods beyond DPO by adapting the reward gap concept.
  • Practitioners could apply this to reduce annotation costs when building new preference datasets.
  • Testing on larger scale models could reveal if the gap remains a reliable difficulty signal as model size increases.

Load-bearing premise

A smaller DPO implicit reward gap actually indicates a more difficult or informative example instead of reflecting noise in the labels or peculiarities of the current model state.

What would settle it

Running the selection method and a random selection baseline on the same dataset and measuring if the performance difference disappears or reverses on a standard alignment benchmark like MT-Bench or AlpacaEval.

Figures

Figures reproduced from arXiv: 2508.04149 by Rongwu Xu, Xuan Qi, Zhijing Jin.

Figure 1
Figure 1. Figure 1: Illustration of our preference data selection pipeline. [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Performance scaling effects with different data selection ratios on RewardBench using [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The overlap of selected data among our method and four baselines. The legend indicates selection agreement: [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
read the original abstract

Aligning large language models (LLMs) with human preferences is a critical challenge in AI research. While methods like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) are widely used, they often rely on large, costly preference datasets. The current work lacks methods for high-quality data selection specifically for preference data. In this work, we introduce a novel difficulty-based data selection strategy for preference datasets, grounded in the DPO implicit reward mechanism. By selecting preference data examples with smaller DPO implicit reward gaps, which are indicative of more challenging cases, we improve data efficiency and model alignment. Our approach consistently outperforms five strong baselines across multiple datasets and alignment tasks, achieving superior performance with only 10\% of the original data. This principled, efficient selection method offers a promising solution for scaling LLM alignment with limited resources.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes selecting a subset of preference data for DPO by ranking examples according to the magnitude of their DPO implicit reward gap and retaining the bottom-k (smallest-gap) examples, which the authors interpret as the most difficult or informative cases. Experiments on multiple datasets and alignment tasks report that models trained on only 10% of the data selected this way outperform five baselines.

Significance. If the central claim holds after validation, the method would offer a low-cost, model-internal way to improve data efficiency in preference optimization without requiring additional human annotations or external difficulty oracles. It re-uses a quantity already computed inside DPO, which is a practical strength.

major comments (2)
  1. [Method] Method section, definition of Δ: The selection rule retains examples with the smallest Δ = β log(π_θ(y_w|x)/π_ref(y_w|x)) − β log(π_θ(y_l|x)/π_ref(y_l|x)). Because Δ is computed from the current policy π_θ, a small value can arise from model uncertainty, label noise, or reference-model mismatch rather than intrinsic example difficulty. The headline claim that smaller gaps mark “more challenging cases” therefore requires direct evidence (e.g., label-perturbation experiments or correlation with an independent difficulty measure) that is not supplied.
  2. [Experiments] Results section: The abstract and experimental claims state consistent outperformance with 10% data across five baselines and multiple tasks, yet no statistical significance tests, variance across random seeds, or controls for confounding variables (response length, topic distribution, or label quality) are reported. Without these, the reported gains cannot be confidently attributed to the difficulty-based selection rule.
minor comments (1)
  1. [Abstract] The abstract should explicitly name the five baselines and the concrete datasets/tasks so that the scope of the empirical claims is immediately clear.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback on our manuscript. We address each of the major comments below and have made revisions to strengthen the paper where appropriate.

read point-by-point responses
  1. Referee: [Method] Method section, definition of Δ: The selection rule retains examples with the smallest Δ = β log(π_θ(y_w|x)/π_ref(y_w|x)) − β log(π_θ(y_l|x)/π_ref(y_l|x)). Because Δ is computed from the current policy π_θ, a small value can arise from model uncertainty, label noise, or reference-model mismatch rather than intrinsic example difficulty. The headline claim that smaller gaps mark “more challenging cases” therefore requires direct evidence (e.g., label-perturbation experiments or correlation with an independent difficulty measure) that is not supplied.

    Authors: We appreciate the referee's observation regarding the potential sources of small Δ values. In the DPO framework, the implicit reward gap Δ quantifies the difference in log-probability ratios between the winning and losing responses. Our selection strategy is motivated by the idea that examples with small gaps are those where the model has not yet strongly differentiated the preferred response, which we posit correspond to more informative or challenging cases for alignment. However, we agree that this interpretation would benefit from additional supporting analysis. In the revised manuscript, we have expanded the Method section to discuss alternative explanations for small Δ and added an empirical analysis showing correlation between low-Δ examples and other difficulty indicators such as higher model entropy on the responses. revision: partial

  2. Referee: [Experiments] Results section: The abstract and experimental claims state consistent outperformance with 10% data across five baselines and multiple tasks, yet no statistical significance tests, variance across random seeds, or controls for confounding variables (response length, topic distribution, or label quality) are reported. Without these, the reported gains cannot be confidently attributed to the difficulty-based selection rule.

    Authors: We acknowledge the importance of statistical rigor in validating our experimental results. The original manuscript reported average performance improvements but did not include variance estimates or formal significance testing. In the revised version, we have re-run the experiments with multiple random seeds (at least 3 per setting) and report mean and standard deviation. We have also performed paired t-tests to assess statistical significance of the improvements over baselines. Additionally, we include controls by reporting results on length-matched subsets and analyzing topic distributions to rule out confounding factors. These additions are presented in the updated Results section and Appendix. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the data selection heuristic

full rationale

The paper takes the standard DPO implicit reward gap Δ = β log(π_θ(y_w|x)/π_ref(y_w|x)) − β log(π_θ(y_l|x)/π_ref(y_l|x)) directly from the DPO formulation and proposes its use as a proxy for example difficulty in a downstream selection rule. This is an empirical heuristic rather than a derivation that reduces the claimed result to its inputs by construction. No self-definitional loop, fitted parameter renamed as prediction, or load-bearing self-citation chain is present; the performance gains at 10% data are reported via external baseline comparisons and remain falsifiable. The method is self-contained against the DPO reference without importing uniqueness theorems or ansatzes from the authors' prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The method inherits the standard DPO loss and implicit reward definition from prior work; no new free parameters, axioms, or invented entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5673 in / 1077 out tokens · 34554 ms · 2026-05-21T23:42:41.928229+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Data Difficulty and the Generalization--Extrapolation Tradeoff in LLM Fine-Tuning

    cs.LG 2026-05 unverdicted novelty 6.0

    For a fixed data budget in LLM supervised fine-tuning, optimal data difficulty shifts toward harder examples as the budget grows because of the tradeoff between in-distribution generalization gap and extrapolation gap.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 1 Pith paper · 16 internal anchors

  1. [1]

    Christiano, Jan Leike, Tom B

    Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V . N. Vishwanathan, and Roman Garnett, editors,Advances in Neural Information Processing Systems 30: Annual Conference on Neural ...

  2. [2]

    A comprehensive overview of large language models

    Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Sajid Anwar, Muhammad Usman, Nick Barnes, and Ajmal Mian. A comprehensive overview of large language models. ACM Transactions on Intelligent Systems and Technology, 15(4):1–50, 2023

  3. [3]

    Large language models in medicine

    Anirudh J Thirunavukarasu, Daniel SW Ting, Kabilan Elangovan, Luis Gutierrez, Trevor Tan, Yiran Chen, Pavitra Bernardo, He Tsao, Adnan Mahmood, Scott M McKinney, et al. Large language models in medicine. Nature Medicine, 29(8):1930–1940, 2023

  4. [4]

    Alita: Generalist agent enabling scalable agentic reasoning with minimal predefinition and maximal self-evolution.arXiv preprint arXiv:2505.20286, 2025

    Jiahao Qiu, Xuan Qi, Tongcheng Zhang, Xinzhe Juan, Jiacheng Guo, Yifu Lu, Yimin Wang, Zixin Yao, Qihan Ren, Xun Jiang, et al. Alita: Generalist agent enabling scalable agentic reasoning with minimal predefinition and maximal self-evolution. arXiv preprint arXiv:2505.20286, 2025

  5. [5]

    The ai alignment problem: why it is hard, and where to start

    Eliezer Yudkowsky. The ai alignment problem: why it is hard, and where to start. Symbolic Systems Distinguished Speaker, 4(1), 2016

  6. [6]

    Artificial intelligence, values, and alignment

    Iason Gabriel. Artificial intelligence, values, and alignment. Minds and machines, 30(3):411–437, 2020

  7. [7]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022

  8. [8]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024

  9. [9]

    A General Language Assistant as a Laboratory for Alignment

    Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Benjamin Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Jackson Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom B. Brown, Jack Clark, Sam McCandlish, Chris Olah, and Jared Kaplan. A general language assistant as ...

  10. [10]

    Impact of preference noise on the alignment performance of generative lan- guage models.arXiv preprint arXiv:2404.09824,

    Yang Gao, Dana Alon, and Donald Metzler. Impact of preference noise on the alignment performance of generative language models. arXiv preprint arXiv:2404.09824, 2024

  11. [11]

    Dataset cartography: Mapping and diagnosing datasets with training dynamics

    Swabha Swayamdipta, Roy Schwartz, Nicholas Lourie, Yizhong Wang, Hannaneh Hajishirzi, Noah A Smith, and Yejin Choi. Dataset cartography: Mapping and diagnosing datasets with training dynamics. arXiv preprint arXiv:2009.10795, 2020

  12. [12]

    Identifying mislabeled data using the area under the margin ranking

    Geoff Pleiss, Tianyi Zhang, Ethan Elenberg, and Kilian Q Weinberger. Identifying mislabeled data using the area under the margin ranking. Advances in Neural Information Processing Systems, 33:17044–17056, 2020

  13. [13]

    Self-guided curriculum learning for neural machine translation.Transactions of the Association for Computational Linguistics, 11:452–468, 2023

    Fei Yuan, Liang Huang, and Qun Liu. Self-guided curriculum learning for neural machine translation.Transactions of the Association for Computational Linguistics, 11:452–468, 2023

  14. [14]

    Openbook qa: A new dataset for open book question answering

    Todor Agarwal and Mohit Bansal. Openbook qa: A new dataset for open book question answering. Advances in Neural Information Processing Systems, 34:9473–9487, 2021

  15. [15]

    Beyond neural scaling laws: beating power law scaling via data pruning.Advances in Neural Information Processing Systems, 35:19523–19536, 2022

    Ben Sorscher, Robert Geirhos, Shashank Shekhar, Surya Ganguli, and Ari Morcos. Beyond neural scaling laws: beating power law scaling via data pruning.Advances in Neural Information Processing Systems, 35:19523–19536, 2022

  16. [16]

    Data selection for language models via importance resampling

    Sang Michael Xie, Shibani Santurkar, Tengyu Ma, and Percy Liang. Data selection for language models via importance resampling. arXiv preprint arXiv:2302.03169, 2023

  17. [17]

    Deep learning on a data diet: Finding important examples early in training

    Mansheej Paul, Surya Ganguli, and Gintare Karolina Dziugaite. Deep learning on a data diet: Finding important examples early in training. Advances in Neural Information Processing Systems, 34:20596–20607, 2021

  18. [18]

    Prioritized training on points that are learnable, worth learning, and not yet learnt,

    Sören Mindermann, Krishnamurthy Dvijotham, Sven Gowal, Robert Stanforth, Balaji Qin, Jonathan Uesato, Pushmeet Arand, Maximilian Mann, and Pushmeet Kohli. Prioritized training on points that are learnable, worth learning, and not yet learnt. arXiv preprint arXiv:2206.07137, 2022

  19. [19]

    Less: Selecting influential data for targeted instruction tuning

    Mengzhou Marion, Sang Michael Xie, Shibani Santurkar, and Percy Liang. Less: Selecting influential data for targeted instruction tuning. arXiv preprint arXiv:2402.04333, 2023. 11

  20. [20]

    Prin- cipled data selection for alignment: The hidden risks of difficult examples.arXiv preprint arXiv:2502.09650,

    Chengqian Gao, Haonan Li, Liu Liu, Zeke Xie, Peilin Zhao, and Zhiqiang Xu. Principled data selection for alignment: The hidden risks of difficult examples. CoRR, abs/2502.09650, 2025

  21. [21]

    Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs

    Chris Yuhao Liu, Liang Zeng, Jiacai Liu, Rui Yan, Jujie He, Chaojie Wang, Shuicheng Yan, Yang Liu, and Yahui Zhou. Skywork-reward: Bag of tricks for reward modeling in llms. CoRR, abs/2410.18451, 2024

  22. [22]

    UltraFeedback: Boosting Language Models with Scaled AI Feedback

    Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. Ultrafeedback: Boosting language models with high-quality feedback. CoRR, abs/2310.01377, 2023

  23. [23]

    Rlhflow/pair_data_v2_80k_wsafety: A dataset of 80k paired user-assistant interactions

    RLHFlow. Rlhflow/pair_data_v2_80k_wsafety: A dataset of 80k paired user-assistant interactions. Hugging Face Dataset Repository, 2024. Dataset used to train Qwen/WorldPM-72B-RLHFLow model for preference learning

  24. [25]

    GPT-4 Technical Report

    OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  25. [26]

    Constitutional AI: Harmlessness from AI Feedback

    Anthropic. Claude: A next-generation ai assistant based on constitutional ai. arXiv preprint arXiv:2212.08073, 2023

  26. [27]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

  27. [28]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

  28. [29]

    Trust region policy optimiza- tion

    John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimiza- tion. International conference on machine learning, pages 1889–1897, 2015

  29. [30]

    Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

    Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, et al. Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217, 2023

  30. [31]

    A general theoretical paradigm to understand learning from human preferences.arXiv preprint arXiv:2310.12036,

    Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, and Rémi Munos. A general theoretical paradigm to understand learning from human preferences. arXiv preprint arXiv:2310.12036, 2024

  31. [32]

    KTO: Model Alignment as Prospect Theoretic Optimization

    Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theory. arXiv preprint arXiv:2402.01306, 2024

  32. [33]

    Simpo: Simple preference optimization with a reference-free reward

    Yu Meng, Mengzhou Xie, Yee Whye Teh, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward. arXiv preprint arXiv:2405.14734, 2024

  33. [34]

    LIMA: Less Is More for Alignment

    Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023

  34. [35]

    The Flan Collection: Designing Data and Methods for Effective Instruction Tuning

    Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei, et al. The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688, 2023

  35. [36]

    Self-evolved diverse data sampling for efficient instruction tuning,

    Shengguang Wu, Keming Lu, Benfeng Xu, Junyang Lin, Qi Su, and Chang Zhou. Self-evolved diverse data sampling for efficient instruction tuning. arXiv preprint arXiv:2311.08182, 2023

  36. [37]

    Fair data selection for rlhf

    Seungone Park, Juyoung Kang, Seungjoon Yoon, Seunghyun Hwang, Dongkeun Kang, and Youngja Yoon. Fair data selection for rlhf. arXiv preprint arXiv:2402.11409, 2024

  37. [38]

    Weak-to-strong preference learning

    Liang Chen, Jiali Huang, Tianyu Xie, Nanyun Peng, and Danqi Chen. Weak-to-strong preference learning. arXiv preprint arXiv:2405.19045, 2024

  38. [39]

    Manning, Stefano Ermon, and Chelsea Finn

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference ...

  39. [40]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...

  40. [41]

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P

    Yongcheng Zeng, Guoqing Liu, Weiyu Ma, Ning Yang, Haifeng Zhang, et al. Token-level direct preference optimization. arXiv preprint arXiv:2404.11999, 2024

  41. [42]

    Disentangling length from quality in direct preference optimization, 2024

    Ryan Park, Rafael Rafailov, Stefano Ermon, and Chelsea Finn. Disentangling length from quality in direct preference optimization. arXiv preprint arXiv:2403.19159, 2024

  42. [43]

    Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint, 2024

    Wei Xiong, Hanze Dong, Chenlu Ye, Ziqi Wang, Han Zhong, Heng Ji, Nan Jiang, and Tong Zhang. Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint, 2024

  43. [44]

    RLHF Workflow: From Reward Modeling to Online RLHF

    Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, and Tong Zhang. Rlhf workflow: From reward modeling to online rlhf. arXiv preprint arXiv:2405.07863, 2024

  44. [45]

    The llama 3 herd of models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv e-prints, pages arXiv–2407, 2024

  45. [46]

    Gemma Team. Gemma. 2024

  46. [47]

    Ralph Allan Bradley and Milton E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39:324, 1952

  47. [48]

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training

    Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training. arXiv preprint arXiv:2411.15124, 2024

  48. [49]

    OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

    Jian Hu, Xibin Wu, Zilin Zhu, Xianyu, Weixun Wang, Dehao Zhang, and Yu Cao. Openrlhf: An easy-to-use, scalable and high-performance rlhf framework. arXiv preprint arXiv:2405.11143, 2024

  49. [50]

    Entropy law: The story behind data compression and llm performance

    Mingjia Yin, Chuhan Wu, Yufei Wang, Hao Wang, Wei Guo, Yasheng Wang, Yong Liu, Ruiming Tang, Defu Lian, and Enhong Chen. Entropy law: The story behind data compression and llm performance. arXiv preprint arXiv:2407.06645, 2024

  50. [51]

    Smith, and Hanna Hajishirzi

    Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Raghavi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A. Smith, and Hannaneh Hajishirzi. Rewardbench: Evaluating reward models for language modeling. CoRR, abs/2403.13787, 2024

  51. [52]

    Hashimoto

    Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models. https://github. com/tatsu-lab/alpaca_eval, 5 2023

  52. [53]

    Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander Madry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, Alex Nichol, Alex Paino, Alex Renzin, Alex Tachard Passos, Alexander Kirillov, Alexi Christakis, Alexis Conneau,...