Prolonged RL training with KL control and reference policy resetting enables LLMs to develop novel reasoning strategies inaccessible to base models even under extensive sampling.
Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2025 3roles
baseline 1polarities
baseline 1representative citing papers
A simple PPO-based RL training pipeline on base models scales reasoning performance and response length, outperforming prior work on math and science benchmarks with one-tenth the training steps.
HAPO is a new token-level policy optimization method for LLMs that continuously adapts four optimization stages using entropy, claiming consistent gains over DAPO on math, code, and logic tasks.
citing papers explorer
-
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models
Prolonged RL training with KL control and reference policy resetting enables LLMs to develop novel reasoning strategies inaccessible to base models even under extensive sampling.
-
Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model
A simple PPO-based RL training pipeline on base models scales reasoning performance and response length, outperforming prior work on math and science benchmarks with one-tenth the training steps.
-
Heterogeneous Adaptive Policy Optimization: Tailoring Optimization to Every Token's Nature
HAPO is a new token-level policy optimization method for LLMs that continuously adapts four optimization stages using entropy, claiming consistent gains over DAPO on math, code, and logic tasks.