Linear Dynamics in the RLVR Training of Large Language Models

Hao Xu; Jiayu Liu; Ning Miao; Shenghao Jin; Tianle Wang; Wei Chen; Zhongyuan Wu

arxiv: 2601.04537 · v3 · pith:UH5K63AZnew · submitted 2026-01-08 · 💻 cs.LG · cs.CL

Linear Dynamics in the RLVR Training of Large Language Models

Tianle Wang , Jiayu Liu , Zhongyuan Wu , Shenghao Jin , Wei Chen , Hao Xu , Ning Miao This is my paper

Pith reviewed 2026-05-22 12:34 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords RLVRlinear dynamicsLLM trainingreinforcement learningweight extrapolationmodel collapsetraining speedupverifiable rewards

0 comments

The pith

RLVR training enters a robust linear regime where LLM weights and log-probabilities evolve linearly due to noisy signals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that reinforcement learning with verifiable rewards drives large language models into a consistent linear regime during training. Both the model parameters and the output log-probabilities change in a straight-line fashion with high reliability across different models and setups. This regularity comes from the noisy training signals that filter out fluctuations and lock the updates onto one steady direction. If true, the finding turns training dynamics from opaque to something that can be measured, predicted, and sped up with simple extrapolations.

Core claim

Across various model families, RL algorithms, and training configurations, RLVR consistently enters a robust linear regime, where both parameter weights and output log-probabilities, measured rigorously via teacher-forced evaluation, evolve in a highly linear manner (R² > 0.7). This linearity stems from the high-variance, noisy nature of RLVR training signals, which act as a low-pass filter to concentrate optimization along a stable, low-dimensional drift. The linear structure proves predictive: weight-space extrapolation matches standard RL performance at 6.1x speedup through periodic re-grounding, while output-space extrapolation bypasses late-stage model collapse and delivers 4.2% average

What carries the argument

The high-variance noisy RLVR training signals that function as a low-pass filter concentrating updates along a stable low-dimensional linear drift.

If this is right

Weight-space extrapolation achieves standard RL performance while delivering a 6.1x training speedup through periodic re-grounding.
Output-space extrapolation prevents late-stage model collapse and raises average scores by 4.2% on mathematical and coding benchmarks.
The observed linearity supplies a concrete way to monitor and intervene in RLVR training instead of treating it as a black box.
The same linear structure holds across multiple model families, RL algorithms, and hyper-parameter choices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the linear regime proves general, simpler linear approximations could replace full gradient steps in parts of the training loop.
Similar linearity might appear in other reinforcement-learning setups for language models once noise levels are high enough.
One could test whether deliberately increasing signal noise early in training accelerates entry into the linear regime.
The low-dimensional drift view suggests that RLVR mainly moves the model along a single effective direction rather than exploring a high-dimensional space.

Load-bearing premise

The high-variance, noisy nature of RLVR training signals acts as a low-pass filter that concentrates optimization along a stable low-dimensional drift.

What would settle it

An RLVR training run in which the linear regression fit to the trajectory of weights or teacher-forced log-probabilities yields R² below 0.7 on average across checkpoints, or in which the proposed extrapolations fail to match or exceed baseline performance.

read the original abstract

Reinforcement learning with verifiable rewards (RLVR) has driven significant performance gains in reasoning-oriented large language models (LLMs), yet its internal training dynamics remain largely a black box. In this work, we perform a comprehensive trajectory-level analysis of RLVR and uncover a striking regularity: across various model families, RL algorithms, and training configurations, RLVR consistently enters a robust linear regime, where both parameter weights and output log-probabilities, measured rigorously via teacher-forced evaluation, evolve in a highly linear manner ($R^2 > 0.7$). Through controlled experiments and theoretical analysis, we demonstrate that this linearity is not a coincidence, but stems from the high-variance, noisy nature of RLVR training signals, which act as a low-pass filter to concentrate optimization along a stable, low-dimensional drift. Moreover, we show that this linear structure is not merely descriptive but powerfully predictive and actionable. Specifically, weight-space extrapolation matches the performance of standard RL optimization while achieving a 6.1x training speedup through periodic re-grounding. Meanwhile, output-space extrapolation serves as a lightweight intervention that effectively bypasses late-stage model collapse, consistently outperforming standard RL across mathematical and coding benchmarks, with an average performance improvement of 4.2%. Our code is available at https://github.com/Miaow-Lab/RLVR-Linearity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that across model families, RL algorithms, and training configurations, RLVR training of LLMs enters a robust linear regime in which both parameter weights and teacher-forced output log-probabilities evolve linearly with R² > 0.7. The authors attribute this regularity to the high-variance noisy RLVR signals acting as a low-pass filter that concentrates optimization along a stable low-dimensional drift. They further show that this linear structure is predictive: weight-space extrapolation matches standard RL performance at 6.1× speedup via periodic re-grounding, while output-space extrapolation bypasses late-stage collapse and yields a 4.2% average gain on math and coding benchmarks. Code is released at https://github.com/Miaow-Lab/RLVR-Linearity.

Significance. If the reported linearity and its causal attribution hold, the work supplies both a mechanistic explanation for RLVR dynamics and immediately actionable extrapolation techniques that could materially reduce training cost while improving final performance. The public release of code is a clear strength that supports reproducibility and follow-up verification.

major comments (1)

[Controlled experiments and theoretical analysis sections] The central explanatory claim—that linearity arises specifically because high-variance noisy RLVR signals act as a low-pass filter—lacks a direct ablation that holds model, data, optimizer, and loss formulation fixed while varying only signal variance (e.g., SFT versus RLVR on the same verifiable-reward data, or PPO with reduced advantage variance). Without this contrast, the observed R² > 0.7 could be an artifact of gradient descent on LLMs rather than RLVR-specific noise, weakening both the theoretical analysis and the justification for the extrapolation interventions.

minor comments (2)

[Abstract and results] The abstract and results sections report consistent R² > 0.7 but omit details on statistical tests, error bars, data exclusion criteria, or the precise definition of the linear fit (e.g., over which training steps or tokens). Adding these would strengthen the claim of robustness across setups.
[Methods] Notation for teacher-forced log-probability evaluation should be clarified with an explicit equation or pseudocode, as it is central to the output-space linearity measurements.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for identifying a point that can strengthen the causal interpretation of our results. We address the major comment below.

read point-by-point responses

Referee: [Controlled experiments and theoretical analysis sections] The central explanatory claim—that linearity arises specifically because high-variance noisy RLVR signals act as a low-pass filter—lacks a direct ablation that holds model, data, optimizer, and loss formulation fixed while varying only signal variance (e.g., SFT versus RLVR on the same verifiable-reward data, or PPO with reduced advantage variance). Without this contrast, the observed R² > 0.7 could be an artifact of gradient descent on LLMs rather than RLVR-specific noise, weakening both the theoretical analysis and the justification for the extrapolation interventions.

Authors: We agree that the current set of controlled experiments, while spanning multiple RL algorithms and noise regimes, does not include the most direct isolation of signal variance requested. Existing comparisons (across PPO, GRPO, and other variants) vary advantage estimation and reward structure but do not hold the underlying loss and data identical to an SFT baseline on the same verifiable-reward corpus. We will add the suggested ablations—SFT versus RLVR on identical data and PPO runs with explicitly reduced advantage variance—in the revised controlled-experiments section. The new results will be used to update the theoretical analysis and to clarify the extent to which the observed linearity is attributable to RLVR-specific noise rather than generic gradient descent on LLMs. These additions will also reinforce the motivation for the extrapolation techniques. revision: yes

Circularity Check

0 steps flagged

No significant circularity; linearity reported as empirical observation with independent extrapolation tests

full rationale

The paper presents the linear regime (R² > 0.7 in weights and teacher-forced log-probabilities) as a discovered empirical regularity across model families, RL algorithms, and configurations. The attribution to high-variance RLVR signals as a low-pass filter is supported by controlled experiments and theoretical analysis rather than by redefining the observed linearity in terms of itself. Weight-space and output-space extrapolation are then validated by direct performance comparison against standard RL optimization on mathematical and coding benchmarks, yielding measurable speedups and gains (6.1x and +4.2%). No equations or steps reduce a claimed prediction to a fitted parameter by construction, nor does any load-bearing premise collapse to a self-citation chain. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation of linearity and the domain assumption that RLVR reward signals are high-variance and noisy; no new entities are postulated and no explicit free parameters are introduced in the abstract.

axioms (1)

domain assumption RLVR training signals are high-variance and noisy and therefore act as a low-pass filter
Invoked to explain why the observed trajectories remain linear rather than chaotic.

pith-pipeline@v0.9.0 · 5785 in / 1243 out tokens · 37576 ms · 2026-05-22T12:34:54.964961+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Extrapolative Weight Averaging Reveals Correctness-Efficiency Frontiers in Code RL
cs.LG 2026-05 conditional novelty 7.0

Extrapolative weight averaging of RL checkpoints trained under nested unit-test coverage extends a correctness-efficiency frontier and boosts ensemble pass rates in code generation across model scales and inference modes.
You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories
cs.LG 2026-05 unverdicted novelty 6.0

RELEX extrapolates LLM checkpoints from short RLVR prefixes by projecting deltas onto a rank-1 subspace and fitting a linear trend, matching full training performance at 15% of the steps.
Alignment Dynamics in LLM Fine-Tuning
cs.LG 2026-05 unverdicted novelty 6.0

The paper introduces a dynamical model that decomposes alignment updates in LLM fine-tuning into rebound and driving forces and predicts a rehearsal priming effect.