Scaling Behaviors of LLM Reinforcement Learning Post-Training: An Empirical Study in Mathematical Reasoning

Zelin Tan , Hejia Geng , Xiaohang Yu , Mulei Zhang , Guancheng Wan , Yifan Zhou , Qiang He , Xiangyuan Xue

show 9 more authors

Heng Zhou Yutao Fan Zhongzhi Li Zaibin Zhang Guibin Zhang Chen Zhang Zhenfei Yin Philip Torr Lei Bai

Authors on Pith no claims yet

classification 💻 cs.LG cs.AI

keywords learningdataefficiencymodelspost-trainingscalingmodelreasoning

0 comments

read the original abstract

While scaling laws for large language models (LLMs) during pre-training have been extensively studied, their behavior under reinforcement learning (RL) post-training remains largely unexplored. This paper presents a systematic empirical investigation of scaling behaviors in RL-based post-training, with a particular focus on mathematical reasoning. Based on a set of experiments across the full Qwen2.5 dense model series (0.5B to 72B), we characterize how model scale, data volume, and computational budget interact to shape performance. Our analysis leads to four key findings: 1. Larger models consistently exhibit superior learning efficiency on both compute and data metrics. 2. The relationship between test loss, compute, and data can be modeled by a predictive power-law which is robust across both base and instruction-tuned models. 3. Although larger models exhibit higher learning efficiency, the analytical learning efficiency term k(N) in the power-law reveals a latent saturation trend in learning efficiency as model size continues to increase. 4. In data-constrained regimes, repeated reuse of high-quality data proves highly effective, as final performance is primarily governed by the total number of optimization steps rather than the uniqueness of samples. Collectively, these results provide a principled foundation and practical guidelines for efficiently scaling the reasoning capabilities of LLMs through RL post-training.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
cs.AI 2026-05 unverdicted novelty 7.0

RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
cs.AI 2026-05 unverdicted novelty 7.0

RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.
Geometry Conflict: Explaining and Controlling Forgetting in LLM Continual Post-Training
cs.LG 2026-05 unverdicted novelty 6.0

Forgetting in LLM continual post-training is a geometry conflict between task-induced covariance structures and the evolving model state, controlled by gating Wasserstein barycenter merging on measured conflict.
Agentic Frameworks for Reasoning Tasks: An Empirical Study
cs.AI 2026-04 unverdicted novelty 6.0

An empirical evaluation of 22 agentic frameworks on BBH, GSM8K, and ARC benchmarks shows stable performance in 12 frameworks but highlights orchestration failures and weaker mathematical reasoning.
Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes
cs.AI 2026-04 unverdicted novelty 5.0

Mixed-complexity procedural datasets provide up to 5x sample efficiency for RLVR on small models in low-data regimes, with low-to-high complexity generalization observed across counting, graph, and spatial tasks.
Continued AI Scaling Requires Repeated Efficiency Doublings
cs.LG 2026-03 unverdicted novelty 3.0

Continued AI scaling remains feasible only if efficiency doublings recur repeatedly to keep logical compute affordable.