Not all steps are informative: On the linearity of LLMs’ RLVR training.arXiv preprint arXiv:2601.04537

Tianle Wang, Zhongyuan Wu, Shenghao Jin, Hao Xu, Wei Chen, Ning Miao · 2026 · cs.LG · arXiv 2601.04537

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open full Pith review browse 3 citing papers arXiv PDF

abstract

Reinforcement learning with verifiable rewards (RLVR) has driven significant performance gains in reasoning-oriented large language models (LLMs), yet its internal training dynamics remain largely a black box. In this work, we perform a comprehensive trajectory-level analysis of RLVR and uncover a striking regularity: across various model families, RL algorithms, and training configurations, RLVR consistently enters a robust linear regime, where both parameter weights and output log-probabilities, measured rigorously via teacher-forced evaluation, evolve in a highly linear manner ($R^2 > 0.7$). Through controlled experiments and theoretical analysis, we demonstrate that this linearity is not a coincidence, but stems from the high-variance, noisy nature of RLVR training signals, which act as a low-pass filter to concentrate optimization along a stable, low-dimensional drift. Moreover, we show that this linear structure is not merely descriptive but powerfully predictive and actionable. Specifically, weight-space extrapolation matches the performance of standard RL optimization while achieving a 6.1x training speedup through periodic re-grounding. Meanwhile, output-space extrapolation serves as a lightweight intervention that effectively bypasses late-stage model collapse, consistently outperforming standard RL across mathematical and coding benchmarks, with an average performance improvement of 4.2%. Our code is available at https://github.com/Miaow-Lab/RLVR-Linearity.

representative citing papers

Extrapolative Weight Averaging Reveals Correctness-Efficiency Frontiers in Code RL

cs.LG · 2026-05-27 · conditional · novelty 7.0

Extrapolative weight averaging of RL checkpoints trained under nested unit-test coverage extends a correctness-efficiency frontier and boosts ensemble pass rates in code generation across model scales and inference modes.

You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories

cs.LG · 2026-05-20 · unverdicted · novelty 6.0

RELEX extrapolates LLM checkpoints from short RLVR prefixes by projecting deltas onto a rank-1 subspace and fitting a linear trend, matching full training performance at 15% of the steps.

Alignment Dynamics in LLM Fine-Tuning

cs.LG · 2026-05-18 · unverdicted · novelty 6.0

The paper introduces a dynamical model that decomposes alignment updates in LLM fine-tuning into rebound and driving forces and predicts a rehearsal priming effect.

citing papers explorer

Showing 3 of 3 citing papers.

Extrapolative Weight Averaging Reveals Correctness-Efficiency Frontiers in Code RL cs.LG · 2026-05-27 · conditional · none · ref 41 · internal anchor
Extrapolative weight averaging of RL checkpoints trained under nested unit-test coverage extends a correctness-efficiency frontier and boosts ensemble pass rates in code generation across model scales and inference modes.
You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories cs.LG · 2026-05-20 · unverdicted · none · ref 8 · internal anchor
RELEX extrapolates LLM checkpoints from short RLVR prefixes by projecting deltas onto a rank-1 subspace and fitting a linear trend, matching full training performance at 15% of the steps.
Alignment Dynamics in LLM Fine-Tuning cs.LG · 2026-05-18 · unverdicted · none · ref 38 · internal anchor
The paper introduces a dynamical model that decomposes alignment updates in LLM fine-tuning into rebound and driving forces and predicts a rehearsal priming effect.

Not all steps are informative: On the linearity of LLMs’ RLVR training.arXiv preprint arXiv:2601.04537

fields

years

verdicts

representative citing papers

citing papers explorer