Reinforcement learning meets large language models: A survey of advancements and applications across the LLM lifecycle

Keliang Liu, Dingkang Yang, Ziyun Qian, Weijie Yin, Yuchi Wang, Hongsheng Li, Jun Liu, Peng Zhai, Yang Liu, Lihua Zhang · 2025 · cs.CL · arXiv 2509.16679

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

open full Pith review browse 8 citing papers arXiv PDF

abstract

In recent years, training methods centered on Reinforcement Learning (RL) have markedly enhanced the reasoning and alignment performance of Large Language Models (LLMs), particularly in understanding human intents, following user instructions, and bolstering inferential strength. Although existing surveys offer overviews of RL augmented LLMs, their scope is often limited, failing to provide a comprehensive summary of how RL operates across the full lifecycle of LLMs. We systematically review the theoretical and practical advancements whereby RL empowers LLMs, especially Reinforcement Learning with Verifiable Rewards (RLVR). First, we briefly introduce the basic theory of RL. Second, we thoroughly detail application strategies for RL across various phases of the LLM lifecycle, including pre-training, alignment fine-tuning, and reinforced reasoning. In particular, we emphasize that RL methods in the reinforced reasoning phase serve as a pivotal driving force for advancing model reasoning to its limits. Next, we collate existing datasets and evaluation benchmarks currently used for RL fine-tuning, spanning human-annotated datasets, AI-assisted preference data, and program-verification-style corpora. Subsequently, we review the mainstream open-source tools and training frameworks available, providing clear practical references for subsequent research. Finally, we analyse the future challenges and trends in the field of RL-enhanced LLMs. This survey aims to present researchers and practitioners with the latest developments and frontier trends at the intersection of RL and LLMs, with the goal of fostering the evolution of LLMs that are more intelligent, generalizable, and secure.

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Beyond Entropy: Learning from Token-Level Distributional Deviations for LLM Reasoning

cs.AI · 2026-06-18 · unverdicted · novelty 7.0

ICT framework applies JS divergence to token logits to select critical tokens for selective RLVR updates, claiming 4.58% average pass@4 gains on Qwen2.5 models across seven reasoning benchmarks.

ThinkDeception: A Progressive Reinforcement Learning Framework for Interpretable Multimodal Deception Detection

cs.AI · 2026-06-17 · unverdicted · novelty 6.0

ThinkDeception introduces MLLMs, a multimodal CoT dataset, and VAC-GRPO progressive RL to convert deception detection into interpretable reasoning and claims new SOTA accuracy plus rationale quality.

TRACER: Turn-level Regret Matching with Inner Reinforcement Credit for Cooperative Multi-LLM Reasoning

cs.AI · 2026-05-27 · unverdicted · novelty 6.0

TRACER combines a controller-regret layer using regret matching for speak/skip decisions with a generation-credit layer using GSPO rewards to enable learned collaboration in multi-LLM reasoning.

Tailoring Teaching to Aptitude: Direction-Adaptive Self-Distillation for LLM Reasoning

cs.LG · 2026-05-21 · unverdicted · novelty 6.0

DASD improves math reasoning in LLMs by adaptively directing self-distillation based on per-token entropy to balance exploration and step accuracy, outperforming prior self-distillation and RLVR baselines on six benchmarks.

StaRPO: Stability-Augmented Reinforcement Policy Optimization

cs.AI · 2026-04-10 · unverdicted · novelty 5.0

StaRPO improves LLM reasoning by adding autocorrelation function and path efficiency stability metrics to RL policy optimization, yielding higher accuracy and fewer logic errors on reasoning benchmarks.

Shattering the Autoregressive Curse: Dynamic Epistemic Entropy Orchestrated Erasable Reinforcement Learning for LLMs

cs.AI · 2026-06-16 · unverdicted · novelty 4.0

E³RL uses dynamic thresholds on epistemic entropy from autoregressive cross-entropy to enable erasable RL in LLM reasoning, reporting 5.349% and 6.514% gains on AIME for 4B and 8B models over prior SOTA.

Reinforcement Learning for LLM-based Multi-Agent Systems through Orchestration Traces

cs.CL · 2026-05-04 · unverdicted · novelty 4.0

This survey organizes RL for LLM multi-agent systems into reward families, credit units, and five orchestration sub-decisions, notes the absence of explicit stopping-decision training in its paper pool, and releases a tagged corpus.

Aligning Perception, Reasoning, Modeling and Interaction: A Survey on Physical AI

cs.AI · 2025-10-06 · unverdicted · novelty 4.0

A survey of physical AI that distinguishes theoretical physics reasoning from applied understanding and synthesizes advances in symbolic reasoning, embodied systems, and generative models to advocate for physics-grounded world models.

citing papers explorer

Showing 8 of 8 citing papers.

Beyond Entropy: Learning from Token-Level Distributional Deviations for LLM Reasoning cs.AI · 2026-06-18 · unverdicted · none · ref 13 · internal anchor
ICT framework applies JS divergence to token logits to select critical tokens for selective RLVR updates, claiming 4.58% average pass@4 gains on Qwen2.5 models across seven reasoning benchmarks.
ThinkDeception: A Progressive Reinforcement Learning Framework for Interpretable Multimodal Deception Detection cs.AI · 2026-06-17 · unverdicted · none · ref 26 · internal anchor
ThinkDeception introduces MLLMs, a multimodal CoT dataset, and VAC-GRPO progressive RL to convert deception detection into interpretable reasoning and claims new SOTA accuracy plus rationale quality.
TRACER: Turn-level Regret Matching with Inner Reinforcement Credit for Cooperative Multi-LLM Reasoning cs.AI · 2026-05-27 · unverdicted · none · ref 17
TRACER combines a controller-regret layer using regret matching for speak/skip decisions with a generation-credit layer using GSPO rewards to enable learned collaboration in multi-LLM reasoning.
Tailoring Teaching to Aptitude: Direction-Adaptive Self-Distillation for LLM Reasoning cs.LG · 2026-05-21 · unverdicted · none · ref 14
DASD improves math reasoning in LLMs by adaptively directing self-distillation based on per-token entropy to balance exploration and step accuracy, outperforming prior self-distillation and RLVR baselines on six benchmarks.
StaRPO: Stability-Augmented Reinforcement Policy Optimization cs.AI · 2026-04-10 · unverdicted · none · ref 19
StaRPO improves LLM reasoning by adding autocorrelation function and path efficiency stability metrics to RL policy optimization, yielding higher accuracy and fewer logic errors on reasoning benchmarks.
Shattering the Autoregressive Curse: Dynamic Epistemic Entropy Orchestrated Erasable Reinforcement Learning for LLMs cs.AI · 2026-06-16 · unverdicted · none · ref 1 · internal anchor
E³RL uses dynamic thresholds on epistemic entropy from autoregressive cross-entropy to enable erasable RL in LLM reasoning, reporting 5.349% and 6.514% gains on AIME for 4B and 8B models over prior SOTA.
Reinforcement Learning for LLM-based Multi-Agent Systems through Orchestration Traces cs.CL · 2026-05-04 · unverdicted · none · ref 35
This survey organizes RL for LLM multi-agent systems into reward families, credit units, and five orchestration sub-decisions, notes the absence of explicit stopping-decision training in its paper pool, and releases a tagged corpus.
Aligning Perception, Reasoning, Modeling and Interaction: A Survey on Physical AI cs.AI · 2025-10-06 · unverdicted · none · ref 23
A survey of physical AI that distinguishes theoretical physics reasoning from applied understanding and synthesizes advances in symbolic reasoning, embodied systems, and generative models to advocate for physics-grounded world models.

Reinforcement learning meets large language models: A survey of advancements and applications across the LLM lifecycle

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer