This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
hub Canonical reference
Self-evolving curriculum for LLM reasoning
Canonical reference. 88% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
GPS trains a small model on optimization history to predict prompt difficulty and select intermediate-difficulty diverse batches, yielding better training efficiency, final performance, and test-time allocation than baselines on reasoning benchmarks.
Failure-prefix conditioning unlocks learning from saturated reasoning problems by conditioning on failure prefixes, improving recovery from misleading early steps and matching gains from new medium-difficulty problems.
CIPO jointly optimizes standard RLVR rewards with correction samples derived from the model's own failed attempts, yielding better reasoning and self-correction on math and code benchmarks.
METIS internalizes curriculum judgment in LLM reinforcement fine-tuning by predicting within-prompt reward variance via in-context learning and jointly optimizing with a self-judgment reward, yielding superior performance and up to 67% faster convergence across math, code, and agent benchmarks.
DARE co-evolves difficulty estimation and policy in RL for LLMs to improve training efficiency, final performance, and inference speed by using tailored strategies for different difficulty levels.
CauSim turns scarce causal reasoning labels into scalable supervised data by having LLMs incrementally construct complex executable structural causal models.
SEIF creates a self-reinforcing loop in which an LLM alternately generates increasingly difficult instructions and learns to follow them better using reinforcement learning signals from its own judgments.
REC RL improves LLM code generation by automatically assessing and optimizing requirement difficulty with adaptive curriculum sampling, yielding 1.23-5.62% Pass@1 gains over baselines.
PIRL maximizes cumulative policy improvement across iterations instead of surrogate rewards and is proven aligned with final performance; PIPO implements it via retrospective verification for stable closed-loop optimization.
SCALER creates adaptive synthetic environments for RL-based LLM reasoning training that outperforms fixed-dataset baselines with more stable long-term progress.
The paper defines a Gradient Gap for RLVR policy gradients and proves a sharp step-size threshold below which training converges and above which it collapses, with predictions for length and success-rate scaling validated in simulations and on Qwen2.5-Math-7B.
Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.
A teacher-driven sampling method selects appropriately difficult questions for student models in GRPO-based RL to improve reasoning performance under fixed compute on OpenMathReasoning.
CurEvo integrates curriculum guidance into self-evolution to structure autonomous improvement of video understanding models, yielding gains on VideoQA benchmarks.
A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.
citing papers explorer
-
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
-
Small Generalizable Prompt Predictive Models Can Steer Efficient RL Post-Training of Large Reasoning Models
GPS trains a small model on optimization history to predict prompt difficulty and select intermediate-difficulty diverse batches, yielding better training efficiency, final performance, and test-time allocation than baselines on reasoning benchmarks.
-
Training Reasoning Models on Saturated Problems via Failure-Prefix Conditioning
Failure-prefix conditioning unlocks learning from saturated reasoning problems by conditioning on failure prefixes, improving recovery from misleading early steps and matching gains from new medium-difficulty problems.
-
Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards
CIPO jointly optimizes standard RLVR rewards with correction samples derived from the model's own failed attempts, yielding better reasoning and self-correction on math and code benchmarks.
-
Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning
METIS internalizes curriculum judgment in LLM reinforcement fine-tuning by predicting within-prompt reward variance via in-context learning and jointly optimizing with a self-judgment reward, yielding superior performance and up to 67% faster convergence across math, code, and agent benchmarks.
-
DARE: Difficulty-Adaptive Reinforcement Learning with Co-Evolved Difficulty Estimation
DARE co-evolves difficulty estimation and policy in RL for LLMs to improve training efficiency, final performance, and inference speed by using tailored strategies for different difficulty levels.
-
CauSim: Scaling Causal Reasoning with Increasingly Complex Causal Simulators
CauSim turns scarce causal reasoning labels into scalable supervised data by having LLMs incrementally construct complex executable structural causal models.
-
SEIF: Self-Evolving Reinforcement Learning for Instruction Following
SEIF creates a self-reinforcing loop in which an LLM alternately generates increasingly difficult instructions and learns to follow them better using reinforcement learning signals from its own judgments.
-
Improving LLM Code Generation via Requirement-Aware Curriculum Reinforcement Learning
REC RL improves LLM code generation by automatically assessing and optimizing requirement difficulty with adaptive curriculum sampling, yielding 1.23-5.62% Pass@1 gains over baselines.
-
Policy Improvement Reinforcement Learning
PIRL maximizes cumulative policy improvement across iterations instead of surrogate rewards and is proven aligned with final performance; PIPO implements it via retrospective verification for stable closed-loop optimization.
-
SCALER:Synthetic Scalable Adaptive Learning Environment for Reasoning
SCALER creates adaptive synthetic environments for RL-based LLM reasoning training that outperforms fixed-dataset baselines with more stable long-term progress.
-
On the optimization dynamics of RLVR: Gradient gap and step size thresholds
The paper defines a Gradient Gap for RLVR policy gradients and proves a sharp step-size threshold below which training converges and above which it collapses, with predictions for length and success-rate scaling validated in simulations and on Qwen2.5-Math-7B.
-
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.
-
Goldilocks RL: Tuning Task Difficulty to Escape Sparse Rewards for Reasoning
A teacher-driven sampling method selects appropriately difficult questions for student models in GRPO-based RL to improve reasoning performance under fixed compute on OpenMathReasoning.
-
CurEvo: Curriculum-Guided Self-Evolution for Video Understanding
CurEvo integrates curriculum guidance into self-evolution to structure autonomous improvement of video understanding models, yielding gains on VideoQA benchmarks.
-
A Survey of Reinforcement Learning for Large Reasoning Models
A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.