pith. sign in

arxiv: 2407.16216 · v4 · pith:NQJTNYMInew · submitted 2024-07-23 · 💻 cs.CL

Reinforcement Learning for LLM Post-Training: A Survey

Pith reviewed 2026-05-23 22:34 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLM post-trainingRLHFRLVRpolicy gradient frameworkDPOPPOsurvey
0
0 comments X

The pith

A single policy gradient framework unifies pretraining, SFT, RLHF, and RLVR as special cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey derives a unified policy gradient framework under which pretraining, supervised fine-tuning, reinforcement learning from human feedback, and reinforcement learning with verifiable rewards appear as special cases. It sorts the methods by choices along three axes: prompt sampling, response sampling, and gradient coefficient, while supplying standardized notation. A reader would care because the result organizes recent techniques and enables direct technical comparisons among approaches used to improve LLM alignment and performance on tasks such as math and coding.

Core claim

The paper establishes a single policy gradient framework that unifies pretraining, SFT, RLHF, and RLVR as special cases while also organizing the more recent techniques therein. The framework decomposes methods along the axes of prompt sampling, response sampling, and gradient coefficient, supplies standardized notation for cross-method comparison, and includes detailed analysis of PPO-based, GRPO-based, and DPO approaches together with comparisons of their implementation details and empirical results.

What carries the argument

The unified policy gradient framework, obtained by varying prompt sampling, response sampling, and gradient coefficient to recover different post-training methods as special cases.

If this is right

  • Methods from pretraining through RLVR can be recovered and compared inside one shared notation.
  • Recent PPO, GRPO, and DPO variants fit inside the same three-axis decomposition.
  • Implementation choices and empirical outcomes become directly comparable across approaches.
  • The framework supplies a self-contained foundation for analyzing new post-training variants.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Researchers could systematically generate new methods by selecting untried combinations along the three axes.
  • The decomposition may highlight whether any emerging technique falls outside the current structure.
  • If the axes prove sufficient, the framework could serve as a design space for exploring hybrid training procedures.

Load-bearing premise

The space of post-training methods can be exhaustively captured by the three axes of prompt sampling, response sampling, and gradient coefficient without needing further independent dimensions.

What would settle it

Discovery of a post-training method whose mechanics cannot be expressed by any combination of choices on the three axes of prompt sampling, response sampling, and gradient coefficient.

Figures

Figures reproduced from arXiv: 2407.16216 by Bin Bi, Kiran Ramnath, Na (Claire) Cheng, Shiva Kumar Pentyala, Shubham Mehrotra, Sitaram Asur, Sougata Chaudhuri, Xiang-Bo Mao, Zhichao Wang, Zixu (James) Zhu.

Figure 1
Figure 1. Figure 1: The 13 categorical directions for xPO to align an LLM with human preference [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The four subtopics of reward model 2.1.2 Pointwise Reward Model vs. Preferencewise Model The original work in RLHF derived a pointwise reward model, which returned a reward score, i.e., r(x, y) given the prompt x and response y. Given two pointwise reward scores from the prompt, a desired response, and an undesired response r(x, yw) and r(x, yl), the probability of the desired response being preferred over… view at source ↗
Figure 3
Figure 3. Figure 3: The four subtopics of feedback feedback instead. Binary feedback referred to simple "thumbs up" (positive), i.e., y + or "thumbs down" (negative), i.e., y − responses. 2.2.2 Pairwise Feedback vs. Listwise Feedback In RLHF, listwise feedback was collected. This approach involved gathering K different responses y1, y2, . . . , yK for a given prompt x to expedite the labeling process. However, these listwise … view at source ↗
Figure 4
Figure 4. Figure 4: The two subtopics of optimization 2.4.1 Iterative/Online Preference Optimization vs. Non-Iterative/Offline Preference Optimization When only utilizing a collected dataset for alignment, the process was referred to as non-iterative/offline preference optimization. In contrast, iterative/online preference optimization became feasible when 1. Human labeled new data or 2. LLMs assumed dual roles—both generatin… view at source ↗
read the original abstract

Large language models (LLMs) trained via pretraining and supervised fine-tuning (SFT) can still produce harmful and misaligned outputs, or struggle in domains like math and coding. Reinforcement learning (RL)-based post-training methods, including Reinforcement Learning from Human Feedback (RLHF) methods like Direct Preference Optimization (DPO) and Reinforcement Learning with Verifiable Rewards (RLVR) approaches like PPO and GRPO, have made remarkable gains to alleviate these issues. Yet, no existing work offers a technically detailed comparison of the various methods driving this progress. In order to fill this gap, we present a timely survey that connects foundational components with latest advancements. We derive a single policy gradient framework that unifies pretraining, SFT, RLHF, and RLVR as special cases while also organizing the more recent techniques therein. The main contributions of our survey are as follows: (1) a self-contained introduction to MLE, RLHF, and RLVR foundations and the unified policy gradient framework; (2) detailed technical analysis of PPO- and GRPO-based methods alongside offline and iterative DPO approaches, decomposed along prompt sampling, response sampling, and gradient coefficient axes; (3) standardized notation enabling direct cross-method comparison; and (4) comprehensive comparison of implementation details and empirical results of each method in the appendix. We aim to serve as a technically grounded reference for researchers and practitioners working on LLM post-training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript is a survey on reinforcement learning methods for LLM post-training. It claims to derive a single policy gradient framework that unifies pretraining, SFT, RLHF (including DPO), and RLVR (including PPO and GRPO) as special cases, with more recent techniques organized by varying only along the three axes of prompt sampling, response sampling, and gradient coefficient. Additional contributions include a self-contained introduction to foundations, standardized notation for cross-method comparison, detailed technical analysis of PPO/GRPO and offline/iterative DPO methods, and empirical comparisons in the appendix.

Significance. If the unification holds without omitted structural variations, the survey would provide a valuable technically grounded reference with standardized notation that enables direct comparisons across the rapidly developing set of post-training methods. The decomposition into three axes and the appendix comparisons of implementation details and results would be useful organizing tools for the field.

major comments (1)
  1. [Abstract] Abstract and contribution (2): the central claim that every post-training method reduces to a special case of the unified policy gradient framework by varying only prompt sampling, response sampling, and gradient coefficient must be demonstrated by explicit mappings for all cited algorithms. Methods introducing auxiliary objectives, distinct value estimators, or constraint mechanisms (e.g., explicit KL penalties formulated separately from the gradient term, or multi-turn credit assignment) would require showing that these components are fully absorbed into one of the three axes without remainder; any residual component would falsify the exhaustiveness of the unification.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on the unification claim. We agree that explicit mappings are necessary to substantiate the framework's exhaustiveness and will revise the manuscript to provide them.

read point-by-point responses
  1. Referee: [Abstract] Abstract and contribution (2): the central claim that every post-training method reduces to a special case of the unified policy gradient framework by varying only prompt sampling, response sampling, and gradient coefficient must be demonstrated by explicit mappings for all cited algorithms. Methods introducing auxiliary objectives, distinct value estimators, or constraint mechanisms (e.g., explicit KL penalties formulated separately from the gradient term, or multi-turn credit assignment) would require showing that these components are fully absorbed into one of the three axes without remainder; any residual component would falsify the exhaustiveness of the unification.

    Authors: We agree that the central claim requires explicit demonstration. In the revised version, we will add a new subsection (under Section 3 on the unified framework) containing a table that provides one-to-one mappings for every algorithm cited in the survey. Each row will specify the exact prompt sampling distribution, response sampling distribution, and gradient coefficient used, showing how the method is recovered as a special case. For auxiliary objectives and constraints: the KL penalty term in PPO/GRPO is absorbed directly into the gradient coefficient (as a subtracted term in the advantage-weighted objective); distinct value estimators in actor-critic variants are folded into the response sampling axis via the baseline subtraction; and any auxiliary losses (e.g., in certain DPO variants) are shown to be equivalent to modified gradient coefficients. Multi-turn credit assignment is outside the scope of the current survey, which focuses on single-turn post-training methods; we will explicitly state this scope limitation and note that multi-turn extensions would require an additional temporal axis. These additions will confirm that no residual components remain for the covered methods. revision: yes

Circularity Check

0 steps flagged

No circularity: survey organizes existing methods via three-axis decomposition without self-referential reduction

full rationale

This is a survey paper whose central contribution is an organizational framework that places prior algorithms (pretraining, SFT, RLHF, RLVR, PPO, DPO, etc.) into a common policy-gradient template by varying prompt sampling, response sampling, and gradient coefficient. No equations or claims reduce a derived quantity to a parameter fitted from the paper's own data; the unification is an explicit re-expression of published methods rather than a tautological redefinition. No self-citation chain is load-bearing for the framework itself, and the work does not present fitted predictions that are statistically forced by construction. The three-axis decomposition may or may not be exhaustive, but that is a question of coverage, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a survey paper the contribution is organizational; no new free parameters, axioms, or invented entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5829 in / 1028 out tokens · 19468 ms · 2026-05-23T22:34:04.525233+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ReCrit: Transition-Aware Reinforcement Learning for Scientific Critic Reasoning

    cs.LG 2026-05 unverdicted novelty 7.0

    ReCrit frames critic interaction as a correctness-transition problem and uses quadrant-based RL rewards to improve LLM performance on scientific reasoning benchmarks by rewarding corrections and robustness while penal...

  2. ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety

    cs.CR 2026-04 unverdicted novelty 7.0

    ProjLens shows that backdoor parameters in MLLMs are encoded in low-rank subspaces of the projector and that embeddings shift toward the target direction with magnitude linear in input norm, activating only on poisone...

  3. RACC: Representation-Aware Coverage Criteria for LLM Safety Testing

    cs.SE 2026-02 unverdicted novelty 7.0

    RACC defines six representation-aware coverage criteria that score jailbreak test suites by measuring activation of safety concepts extracted from LLM hidden states on a calibration set.

  4. Token Buncher: Shielding LLMs from Harmful Reinforcement Learning Fine-Tuning

    cs.LG 2025-08 unverdicted novelty 7.0

    TokenBuncher constrains response entropy via entropy-as-reward RL and a Token Noiser to stop harmful RL fine-tuning while keeping benign performance intact.

  5. EyeMulator: Improving Code Language Models by Mimicking Human Visual Attention

    cs.SE 2025-08 unverdicted novelty 7.0

    EyeMulator augments CodeLLM fine-tuning loss with token weights derived from human eye-tracking scan paths, producing large gains on code translation and summarization across StarCoder, Llama-3.2 and DeepSeek-Coder.

  6. Distinguishable Deletion: Unifying Knowledge Erasure and Refusal for Large Language Model Unlearning

    cs.LG 2026-05 unverdicted novelty 6.0

    Distinguishable Deletion unifies knowledge erasure and refusal for LLM unlearning via an energy index that enforces boundaries during training and enables refusal at inference.

  7. UNIPO: Unified Interactive Visual Explanation for RL Fine-Tuning Policy Optimization

    cs.HC 2026-05 unverdicted novelty 6.0

    UNIPO is the first unified interactive visualization tool exposing token-level training dynamics of RL fine-tuning algorithms for LLMs through high-level overviews, step inspectors, and side-by-side comparisons.

  8. Pref-CTRL: Preference Driven LLM Alignment using Representation Editing

    cs.CL 2026-04 unverdicted novelty 6.0

    Pref-CTRL trains a multi-objective value function on preferences to guide representation editing for LLM alignment, outperforming RE-Control on benchmarks with better out-of-domain generalization.

  9. Mobile GUI Agent Privacy Personalization with Trajectory Induced Preference Optimization

    cs.AI 2026-04 unverdicted novelty 6.0

    TIPO applies preference-intensity weighting and padding gating to stabilize preference optimization for privacy personalization in mobile GUI agents, yielding higher alignment and distinction metrics than prior methods.

  10. The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training

    cs.CR 2026-04 unverdicted novelty 6.0

    ORPO is most effective at misaligning LLMs while DPO excels at realigning them, though it reduces utility, revealing an asymmetry between attack and defense methods.

  11. VC-Soup: Value-Consistency Guided Multi-Value Alignment for Large Language Models

    cs.LG 2026-03 unverdicted novelty 6.0

    VC-Soup uses a cosine-similarity consistency metric to filter data, trains value-consistent policies, and applies linear merging with Pareto filtering to improve multi-value LLM alignment trade-offs.

  12. Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection

    cs.LG 2026-02 conditional novelty 6.0

    OGPSA projects safety gradients orthogonal to a low-rank subspace from general capability gradients, improving safety-utility trade-offs in SFT and DPO pipelines on Qwen2.5-7B and Llama3.1-8B.

  13. SCOPE-RL: Stable and Quantitative Control of Policy Entropy in RL Post-Training

    cs.LG 2025-10 unverdicted novelty 6.0

    SCOPE-RL adds a regularization term built from high-temperature positive samples to quantitatively control entropy dynamics and maintain exploration in RL post-training of reasoning LLMs.

  14. The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

    cs.AI 2025-09 accept novelty 6.0

    Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.

  15. Exploring the Secondary Risks of Large Language Models

    cs.LG 2025-06 unverdicted novelty 6.0

    Introduces secondary risks as a new class of LLM failures from benign prompts, defines two primitives, proposes SecLens search framework, and releases SecRiskBench showing risks are widespread across 16 models.

  16. Stable-GFlowNet: Toward Diverse and Robust LLM Red-Teaming via Contrastive Trajectory Balance

    cs.LG 2026-05 unverdicted novelty 5.0

    Stable-GFlowNet improves training stability and attack diversity in LLM red-teaming by eliminating Z estimation via contrastive trajectory balance while preserving GFN optimality.

  17. Generating Place-Based Compromises Between Two Points of View

    cs.CL 2026-04 unverdicted novelty 5.0

    Empathic similarity feedback in prompts generates more acceptable compromises than chain-of-thought, and margin-based training on the resulting data lets smaller models produce them without ongoing empathy estimation.

  18. Query Expansion in the Age of Pre-trained and Large Language Models: A Comprehensive Survey

    cs.IR 2025-09 unverdicted novelty 5.0

    A comprehensive survey that organizes query expansion methods in the PLM/LLM era along four design dimensions, synthesizes application patterns, and outlines future directions.

  19. ReGA: Model-Based Safeguard for LLMs via Representation-Guided Abstraction

    cs.CR 2025-06 unverdicted novelty 5.0

    ReGA uses safety-critical representations to guide abstraction in model-based analysis, enabling scalable detection of harmful LLM inputs with reported AUROC of 0.975 at prompt level.

  20. Agents Should Replace Narrow Predictive AI as the Orchestrator in 6G AI-RAN

    cs.NI 2026-05 unverdicted novelty 4.0

    Position paper proposes replacing fragmented narrow AI models with LLMs as the cognitive orchestrator in the RAN Intelligent Controller for Level 5 autonomous 6G networks.

  21. Rethinking Agentic Reinforcement Learning In Large Language Models

    cs.AI 2026-04 unverdicted novelty 3.0

    The paper reviews conceptual foundations, methodological innovations, effective designs, critical challenges, and future directions for LLM-based Agentic Reinforcement Learning.

  22. Rethinking Agentic Reinforcement Learning In Large Language Models

    cs.AI 2026-04 unverdicted novelty 2.0

    This review synthesizes conceptual foundations, methods, challenges, and future directions for agentic reinforcement learning in large language models.

  23. Rethinking Agentic Reinforcement Learning In Large Language Models

    cs.AI 2026-04 unverdicted novelty 2.0

    The paper surveys the conceptual foundations, methodological innovations, challenges, and future directions of agentic reinforcement learning frameworks that embed cognitive capabilities like meta-reasoning and self-r...

Reference graph

Works this paper leans on

96 extracted references · 96 canonical work pages · cited by 21 Pith papers · 5 internal anchors

  1. [1]

    Bert: Pre-training of deep bidirectional transformers for language understanding, 2019

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019

  2. [2]

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback,...

  3. [3]

    Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...

  4. [4]

    OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner...

  5. [5]

    The claude 3 model family: Opus, sonnet, haiku

    AI Anthropic. The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card, 1, 2024

  6. [6]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

  7. [7]

    Rlhf workflow: From reward modeling to online rlhf, 2024

    Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, and Tong Zhang. Rlhf workflow: From reward modeling to online rlhf, 2024

  8. [8]

    Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint, 2024

    Wei Xiong, Hanze Dong, Chenlu Ye, Ziqi Wang, Han Zhong, Heng Ji, Nan Jiang, and Tong Zhang. Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint, 2024

  9. [9]

    Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, 32 A Comprehensive Survey of LLM Alignment Tech...

  10. [10]

    Rlaif: Scaling reinforcement learning from human feedback with ai feedback, 2023

    Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, and Sushant Prakash. Rlaif: Scaling reinforcement learning from human feedback with ai feedback, 2023

  11. [11]

    Yao Zhao, Rishabh Joshi, Tianqi Liu, Misha Khalman, Mohammad Saleh, and Peter J. Liu. Slic-hf: Sequence likelihood calibration with human feedback, 2023

  12. [12]

    Manning, and Chelsea Finn

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model, 2023

  13. [13]

    Smaug: Fixing failure modes of preference optimisation with dpo-positive, 2024

    Arka Pal, Deep Karkhanis, Samuel Dooley, Manley Roberts, Siddartha Naidu, and Colin White. Smaug: Fixing failure modes of preference optimisation with dpo-positive, 2024

  14. [14]

    β-dpo: Direct preference optimization with dynamic β, 2024

    Junkang Wu, Yuexiang Xie, Zhengyi Yang, Jiancan Wu, Jinyang Gao, Bolin Ding, Xiang Wang, and Xiangnan He. β-dpo: Direct preference optimization with dynamic β, 2024

  15. [15]

    A general theoretical paradigm to understand learning from human preferences, 2023

    Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, and Rémi Munos. A general theoretical paradigm to understand learning from human preferences, 2023

  16. [16]

    sdpo: Don’t use your data all at once, 2024

    Dahyun Kim, Yungi Kim, Wonho Song, Hyeonwoo Kim, Yunsu Kim, Sanghoon Kim, and Chanjun Park. sdpo: Don’t use your data all at once, 2024

  17. [17]

    From r to q∗: Your language model is secretly a q-function, 2024

    Rafael Rafailov, Joey Hejna, Ryan Park, and Chelsea Finn. From r to q∗: Your language model is secretly a q-function, 2024

  18. [18]

    Token-level direct preference optimization, 2024

    Yongcheng Zeng, Guoqing Liu, Weiyu Ma, Ning Yang, Haifeng Zhang, and Jun Wang. Token-level direct preference optimization, 2024

  19. [19]

    Self-rewarding language models, 2024

    Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. Self-rewarding language models, 2024

  20. [20]

    Some things are more cringe than others: Iterative preference optimization with the pairwise cringe loss, 2024

    Jing Xu, Andrew Lee, Sainbayar Sukhbaatar, and Jason Weston. Some things are more cringe than others: Iterative preference optimization with the pairwise cringe loss, 2024

  21. [21]

    Kto: Model alignment as prospect theoretic optimization, 2024

    Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization, 2024

  22. [22]

    Offline regularised reinforcement learning for large language models alignment, 2024

    Pierre Harvey Richemond, Yunhao Tang, Daniel Guo, Daniele Calandriello, Mohammad Gheshlaghi Azar, Rafael Rafailov, Bernardo Avila Pires, Eugene Tarassov, Lucas Spangher, Will Ellsworth, Aliaksei Severyn, Jonathan Mallinson, Lior Shani, Gil Shamir, Rishabh Joshi, Tianqi Liu, Remi Munos, and Bilal Piot. Offline regularised reinforcement learning for large l...

  23. [23]

    Orpo: Monolithic preference optimization without reference model, 2024

    Jiwoo Hong, Noah Lee, and James Thorne. Orpo: Monolithic preference optimization without reference model, 2024

  24. [24]

    Paft: A parallel training paradigm for effective llm fine-tuning, 2024

    Shiva Kumar Pentyala, Zhichao Wang, Bin Bi, Kiran Ramnath, Xiang-Bo Mao, Regunathan Radhakrishnan, Sitaram Asur, Na, and Cheng. Paft: A parallel training paradigm for effective llm fine-tuning, 2024

  25. [25]

    Disentangling length from quality in direct preference optimization, 2024

    Ryan Park, Rafael Rafailov, Stefano Ermon, and Chelsea Finn. Disentangling length from quality in direct preference optimization, 2024

  26. [26]

    Simpo: Simple preference optimization with a reference-free reward, 2024

    Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward, 2024

  27. [27]

    Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms, 2024

    Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms, 2024

  28. [28]

    Liu, and Xuanhui Wang

    Tianqi Liu, Zhen Qin, Junru Wu, Jiaming Shen, Misha Khalman, Rishabh Joshi, Yao Zhao, Mohammad Saleh, Simon Baumgartner, Jialu Liu, Peter J. Liu, and Xuanhui Wang. Lipo: Listwise preference optimization through learning-to-rank, 2024

  29. [29]

    Rrhf: Rank responses to align language models with human feedback without tears, 2023

    Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang. Rrhf: Rank responses to align language models with human feedback without tears, 2023

  30. [30]

    Preference ranking optimization for human alignment, 2024

    Feifan Song, Bowen Yu, Minghao Li, Haiyang Yu, Fei Huang, Yongbin Li, and Houfeng Wang. Preference ranking optimization for human alignment, 2024. 33 A Comprehensive Survey of LLM Alignment Techniques: RLHF , RLAIF , PPO, DPO and More

  31. [31]

    Negating negatives: Alignment without human positive samples via distributional dispreference optimization, 2024

    Shitong Duan, Xiaoyuan Yi, Peng Zhang, Tun Lu, Xing Xie, and Ning Gu. Negating negatives: Alignment without human positive samples via distributional dispreference optimization, 2024

  32. [32]

    Negative preference optimization: From catastrophic collapse to effective unlearning, 2024

    Ruiqi Zhang, Licong Lin, Yu Bai, and Song Mei. Negative preference optimization: From catastrophic collapse to effective unlearning, 2024

  33. [33]

    Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation, 2024

    Haoran Xu, Amr Sharaf, Yunmo Chen, Weiting Tan, Lingfeng Shen, Benjamin Van Durme, Kenton Murray, and Young Jin Kim. Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation, 2024

  34. [34]

    Mankowitz, Doina Precup, and Bilal Piot

    Rémi Munos, Michal Valko, Daniele Calandriello, Mohammad Gheshlaghi Azar, Mark Rowland, Zhaohan Daniel Guo, Yunhao Tang, Matthieu Geist, Thomas Mesnard, Andrea Michi, Marco Selvi, Sertan Girgin, Nikola Momchev, Olivier Bachem, Daniel J. Mankowitz, Doina Precup, and Bilal Piot. Nash learning from human feedback, 2024

  35. [35]

    A minimaximalist approach to reinforcement learning from human feedback, 2024

    Gokul Swamy, Christoph Dann, Rahul Kidambi, Zhiwei Steven Wu, and Alekh Agarwal. A minimaximalist approach to reinforcement learning from human feedback, 2024

  36. [36]

    Direct nash optimization: Teaching language models to self-improve with general preferences, 2024

    Corby Rosset, Ching-An Cheng, Arindam Mitra, Michael Santacroce, Ahmed Awadallah, and Tengyang Xie. Direct nash optimization: Teaching language models to self-improve with general preferences, 2024

  37. [37]

    Beyond reverse kl: Generalizing direct preference optimization with diverse divergence constraints, 2023

    Chaoqi Wang, Yibo Jiang, Chenghao Yang, Han Liu, and Yuxin Chen. Beyond reverse kl: Generalizing direct preference optimization with diverse divergence constraints, 2023

  38. [38]

    Ralph Allan Bradley and Milton E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39:324, 1952

  39. [39]

    A markovian decision process

    Richard Bellman. A markovian decision process. Journal of Mathematics and Mechanics, 6(5):679–684, 1957

  40. [40]

    Hashimoto

    Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models. https://github. com/tatsu-lab/alpaca_eval, 2023

  41. [41]

    Bleu: a method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002

  42. [42]

    Rouge: A package for automatic evaluation of summaries

    Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004

  43. [43]

    Weinberger, and Yoav Artzi

    Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert, 2020

  44. [44]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Nee- lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Lit...

  45. [45]

    Truthfulqa: Measuring how models mimic human falsehoods, 2022

    Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods, 2022

  46. [46]

    Chain-of-thought prompting elicits reasoning in large language models, 2023

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023

  47. [47]

    Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano

    Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback, 2022

  48. [48]

    Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H

    Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H. Clark, Laurent El Shafey, Yanping Huang, Kathy Meier-Hellstern, Gaurav Mishra, Erica Moreira, Mark Omernick, Kevin Robinson, Sebastian Ruder, Yi Tay, Kefan Xiao, Yuanzhong Xu, Yujing Z...

  49. [49]

    High-Dimensional Continuous Control Using Generalized Advantage Estimation

    John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015

  50. [50]

    Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J

    Colin Raffel, Noam M. Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67, 2019

  51. [51]

    Liu, and Jialu Liu

    Tianqi Liu, Yao Zhao, Rishabh Joshi, Misha Khalman, Mohammad Saleh, Peter J. Liu, and Jialu Liu. Statistical rejection sampling improves preference optimization, 2024

  52. [53]

    Is dpo superior to ppo for llm alignment? a comprehensive study

    Shusheng Xu, Wei Fu, Jiaxuan Gao, Wenjie Ye, Weilin Liu, Zhiyu Mei, Guangju Wang, Chao Yu, and Yi Wu. Is dpo superior to ppo for llm alignment? a comprehensive study. arXiv preprint arXiv:2404.10719, 2024

  53. [54]

    Maas, Raymond E

    Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y . Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Dekang Lin, Yuji Matsumoto, and Rada Mihalcea, editors, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA,...

  54. [55]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1, 2018

  55. [56]

    Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , 2019

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , 2019

  56. [57]

    Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu

    Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T. Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models, 2024

  57. [58]

    Xing, Hao Zhang, Joseph E

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023

  58. [59]

    Pythia: A suite for analyzing large language models across training and scaling

    Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR, 2023

  59. [60]

    Solar 10.7b: Scaling large language models with simple yet effective depth up-scaling, 2024

    Dahyun Kim, Chanjun Park, Sanghoon Kim, Wonsung Lee, Wonho Song, Yunsu Kim, Hyeonwoo Kim, Yungi Kim, Hyeonju Lee, Jihoo Kim, Changbae Ahn, Seonghoon Yang, Sukyung Lee, Hyunbyung Park, Gyoungjin Gim, Mikyoung Cha, Hwalsuk Lee, and Sunghun Kim. Solar 10.7b: Scaling large language models with simple yet effective depth up-scaling, 2024

  60. [61]

    Orca: Progressive learning from complex explanation traces of gpt-4, 2023

    Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, and Ahmed Awadallah. Orca: Progressive learning from complex explanation traces of gpt-4, 2023

  61. [62]

    Ultrafeedback: Boosting language models with high-quality feedback, 2023

    Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. Ultrafeedback: Boosting language models with high-quality feedback, 2023

  62. [63]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021

  63. [64]

    Winogrande: An adversarial winograd schema challenge at scale, 2019

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale, 2019

  64. [65]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021. 35 A Comprehensive Survey of LLM Alignment Techniques: RLHF , RLAIF , PPO, DPO and More

  65. [66]

    Generalized preference optimization: A unified approach to offline alignment, 2024

    Yunhao Tang, Zhaohan Daniel Guo, Zeyu Zheng, Daniele Calandriello, Rémi Munos, Mark Rowland, Pierre Har- vey Richemond, Michal Valko, Bernardo Ávila Pires, and Bilal Piot. Generalized preference optimization: A unified approach to offline alignment, 2024

  66. [67]

    Language models are unsupervised multitask learners

    Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019

  67. [68]

    Llama 2: Open foundation and fine-tuned chat models, 2023

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

  68. [69]

    The cringe loss: Learning what language not to model, 2022

    Leonard Adolphs, Tianyu Gao, Jing Xu, Kurt Shuster, Sainbayar Sukhbaatar, and Jason Weston. The cringe loss: Learning what language not to model, 2022

  69. [70]

    Hashimoto

    Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacafarm: A simulation framework for methods that learn from human feedback, 2024

  70. [71]

    Advances in prospect theory: Cumulative representation of uncertainty

    Amos Tversky and Daniel Kahneman. Advances in prospect theory: Cumulative representation of uncertainty. Journal of Risk and Uncertainty, 5:297–323, 1992

  71. [72]

    Proximal policy optimization algorithms, 2017

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017

  72. [73]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  73. [74]

    Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

    Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, , and Jason Wei. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022

  74. [75]

    Phi-2: The surprising power of small language models

    Mojan Javaheripi, Sébastien Bubeck, Marah Abdin, Jyoti Aneja, Sebastien Bubeck, Caio César Teodoro Mendes, Weizhu Chen, Allie Del Giorno, Ronen Eldan, Sivakanth Gopi, et al. Phi-2: The surprising power of small language models. Microsoft Research Blog, 2023

  75. [76]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023

  76. [77]

    Instruction-following evaluation for large language models, 2023

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models, 2023

  77. [78]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021

  78. [79]

    Gonzalez, and Ion Stoica

    Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E. Gonzalez, and Ion Stoica. From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline, 2024

  79. [80]

    Simple statistical gradient-following algorithms for connectionist reinforcement learning

    Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8:229–256, 1992

  80. [81]

    Buy 4 REINFORCE samples, get a baseline for free!, 2019

    Wouter Kool, Herke van Hoof, and Max Welling. Buy 4 REINFORCE samples, get a baseline for free!, 2019. 36 A Comprehensive Survey of LLM Alignment Techniques: RLHF , RLAIF , PPO, DPO and More

Showing first 80 references.