Reinforcement Learning for LLM Post-Training: A Survey

Zhichao Wang , Kiran Ramnath , Bin Bi , Shiva Kumar Pentyala , Sougata Chaudhuri , Shubham Mehrotra , Zixu (James) Zhu , Xiang-Bo Mao

show 2 more authors

Sitaram Asur Na (Claire) Cheng

Authors on Pith no claims yet

classification 💻 cs.CL

keywords learningmethodsreinforcementcomparisongradientlikepost-trainingrlhf

0 comments

read the original abstract

Large language models (LLMs) trained via pretraining and supervised fine-tuning (SFT) can still produce harmful and misaligned outputs, or struggle in domains like math and coding. Reinforcement learning (RL)-based post-training methods, including Reinforcement Learning from Human Feedback (RLHF) methods like Direct Preference Optimization (DPO) and Reinforcement Learning with Verifiable Rewards (RLVR) approaches like PPO and GRPO, have made remarkable gains to alleviate these issues. Yet, no existing work offers a technically detailed comparison of the various methods driving this progress. In order to fill this gap, we present a timely survey that connects foundational components with latest advancements. We derive a single policy gradient framework that unifies pretraining, SFT, RLHF, and RLVR as special cases while also organizing the more recent techniques therein. The main contributions of our survey are as follows: (1) a self-contained introduction to MLE, RLHF, and RLVR foundations and the unified policy gradient framework; (2) detailed technical analysis of PPO- and GRPO-based methods alongside offline and iterative DPO approaches, decomposed along prompt sampling, response sampling, and gradient coefficient axes; (3) standardized notation enabling direct cross-method comparison; and (4) comprehensive comparison of implementation details and empirical results of each method in the appendix. We aim to serve as a technically grounded reference for researchers and practitioners working on LLM post-training.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 11 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety
cs.CR 2026-04 unverdicted novelty 7.0

ProjLens shows that backdoor parameters in MLLMs are encoded in low-rank subspaces of the projector and that embeddings shift toward the target direction with magnitude linear in input norm, activating only on poisone...
UNIPO: Unified Interactive Visual Explanation for RL Fine-Tuning Policy Optimization
cs.HC 2026-05 unverdicted novelty 6.0

UNIPO is the first unified interactive visualization tool exposing token-level training dynamics of RL fine-tuning algorithms for LLMs through high-level overviews, step inspectors, and side-by-side comparisons.
Pref-CTRL: Preference Driven LLM Alignment using Representation Editing
cs.CL 2026-04 unverdicted novelty 6.0

Pref-CTRL trains a multi-objective value function on preferences to guide representation editing for LLM alignment, outperforming RE-Control on benchmarks with better out-of-domain generalization.
Mobile GUI Agent Privacy Personalization with Trajectory Induced Preference Optimization
cs.AI 2026-04 unverdicted novelty 6.0

TIPO applies preference-intensity weighting and padding gating to stabilize preference optimization for privacy personalization in mobile GUI agents, yielding higher alignment and distinction metrics than prior methods.
The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training
cs.CR 2026-04 unverdicted novelty 6.0

ORPO is most effective at misaligning LLMs while DPO excels at realigning them, though it reduces utility, revealing an asymmetry between attack and defense methods.
VC-Soup: Value-Consistency Guided Multi-Value Alignment for Large Language Models
cs.LG 2026-03 unverdicted novelty 6.0

VC-Soup uses a cosine-similarity consistency metric to filter data, trains value-consistent policies, and applies linear merging with Pareto filtering to improve multi-value LLM alignment trade-offs.
Stable-GFlowNet: Toward Diverse and Robust LLM Red-Teaming via Contrastive Trajectory Balance
cs.LG 2026-05 unverdicted novelty 5.0

Stable-GFlowNet improves training stability and attack diversity in LLM red-teaming by eliminating Z estimation via contrastive trajectory balance while preserving GFN optimality.
Generating Place-Based Compromises Between Two Points of View
cs.CL 2026-04 unverdicted novelty 5.0

Empathic similarity feedback in prompts generates more acceptable compromises than chain-of-thought, and margin-based training on the resulting data lets smaller models produce them without ongoing empathy estimation.
Agents Should Replace Narrow Predictive AI as the Orchestrator in 6G AI-RAN
cs.NI 2026-05 unverdicted novelty 4.0

Position paper proposes replacing fragmented narrow AI models with LLMs as the cognitive orchestrator in the RAN Intelligent Controller for Level 5 autonomous 6G networks.
A Brief Overview: Agentic Reinforcement Learning In Large Language Models
cs.AI 2026-04 unverdicted novelty 2.0

This review synthesizes conceptual foundations, methods, challenges, and future directions for agentic reinforcement learning in large language models.
A Brief Overview: Agentic Reinforcement Learning In Large Language Models
cs.AI 2026-04 unverdicted novelty 2.0

The paper surveys the conceptual foundations, methodological innovations, challenges, and future directions of agentic reinforcement learning frameworks that embed cognitive capabilities like meta-reasoning and self-r...