A technical survey of reinforcement learning techniques for large language models

Saksham Sahai Srivastava, Vaneet Aggarwal · 2025 · arXiv 2507.04136

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

read on arXiv browse 8 citing papers

citation-role summary

background 3

citation-polarity summary

background 2 unclear 1

representative citing papers

Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

cs.LG · 2026-04-08 · unverdicted · novelty 7.0

This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.

Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks

cs.CL · 2026-04-03 · unverdicted · novelty 6.0

RTT bridges response-level rubrics to token-level rewards via a relevance discriminator and intra-sample group normalization, yielding higher instruction and rubric accuracy than baselines.

Understanding and Exploiting Weight Update Sparsity for Communication-Efficient Distributed RL

cs.LG · 2026-02-03 · unverdicted · novelty 6.0

PULSE exploits BF16-invisible sparsity in weight updates to enable over 100x lower communication in distributed RL post-training via compute-visible sparsification.

The Geometric Reasoner: Manifold-Informed Latent Foresight Search for Long-Context Reasoning

cs.LG · 2026-01-25 · unverdicted · novelty 6.0

TGR performs manifold-informed latent foresight search to boost trajectory coverage in long-context reasoning tasks by up to 13 AUC points with minimal overhead.

Auditing Data Membership in Reinforcement Learning With Verifiable Rewards

cs.CR · 2025-11-18 · unverdicted · novelty 6.0

DIBA detects membership of prompts in RLVR training by measuring reward success changes and policy behavioral drift between pre- and post-RLVR model checkpoints.

The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

cs.AI · 2025-09-02 · accept · novelty 6.0

Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.

PEER: Unified Process-Outcome Reinforcement Learning for Structured Empathetic Reasoning

cs.CL · 2025-08-13 · unverdicted · novelty 6.0

PEER applies GRPO reinforcement learning with a unified process-outcome reward model to structured empathetic reasoning steps on the SER dataset, yielding gains in empathy, strategy alignment, and human-likeness.

Rethinking Agentic Reinforcement Learning In Large Language Models

cs.AI · 2026-04-30 · unverdicted · novelty 3.0 · 3 refs

The paper reviews conceptual foundations, methodological innovations, effective designs, critical challenges, and future directions for LLM-based Agentic Reinforcement Learning.

citing papers explorer

Showing 8 of 8 citing papers.

Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning cs.LG · 2026-04-08 · unverdicted · none · ref 104
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks cs.CL · 2026-04-03 · unverdicted · none · ref 15
RTT bridges response-level rubrics to token-level rewards via a relevance discriminator and intra-sample group normalization, yielding higher instruction and rubric accuracy than baselines.
Understanding and Exploiting Weight Update Sparsity for Communication-Efficient Distributed RL cs.LG · 2026-02-03 · unverdicted · none · ref 8
PULSE exploits BF16-invisible sparsity in weight updates to enable over 100x lower communication in distributed RL post-training via compute-visible sparsification.
The Geometric Reasoner: Manifold-Informed Latent Foresight Search for Long-Context Reasoning cs.LG · 2026-01-25 · unverdicted · none · ref 19
TGR performs manifold-informed latent foresight search to boost trajectory coverage in long-context reasoning tasks by up to 13 AUC points with minimal overhead.
Auditing Data Membership in Reinforcement Learning With Verifiable Rewards cs.CR · 2025-11-18 · unverdicted · none · ref 10
DIBA detects membership of prompts in RLVR training by measuring reward success changes and policy behavioral drift between pre- and post-RLVR model checkpoints.
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey cs.AI · 2025-09-02 · accept · none · ref 4
Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.
PEER: Unified Process-Outcome Reinforcement Learning for Structured Empathetic Reasoning cs.CL · 2025-08-13 · unverdicted · none · ref 8
PEER applies GRPO reinforcement learning with a unified process-outcome reward model to structured empathetic reasoning steps on the SER dataset, yielding gains in empathy, strategy alignment, and human-likeness.
Rethinking Agentic Reinforcement Learning In Large Language Models cs.AI · 2026-04-30 · unverdicted · none · ref 81 · 3 links
The paper reviews conceptual foundations, methodological innovations, effective designs, critical challenges, and future directions for LLM-based Agentic Reinforcement Learning.

A technical survey of reinforcement learning techniques for large language models

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer