Linear Dynamics in the RLVR Training of Large Language Models
Pith reviewed 2026-05-22 12:34 UTC · model grok-4.3
The pith
RLVR training enters a robust linear regime where LLM weights and log-probabilities evolve linearly due to noisy signals.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across various model families, RL algorithms, and training configurations, RLVR consistently enters a robust linear regime, where both parameter weights and output log-probabilities, measured rigorously via teacher-forced evaluation, evolve in a highly linear manner (R² > 0.7). This linearity stems from the high-variance, noisy nature of RLVR training signals, which act as a low-pass filter to concentrate optimization along a stable, low-dimensional drift. The linear structure proves predictive: weight-space extrapolation matches standard RL performance at 6.1x speedup through periodic re-grounding, while output-space extrapolation bypasses late-stage model collapse and delivers 4.2% average
What carries the argument
The high-variance noisy RLVR training signals that function as a low-pass filter concentrating updates along a stable low-dimensional linear drift.
If this is right
- Weight-space extrapolation achieves standard RL performance while delivering a 6.1x training speedup through periodic re-grounding.
- Output-space extrapolation prevents late-stage model collapse and raises average scores by 4.2% on mathematical and coding benchmarks.
- The observed linearity supplies a concrete way to monitor and intervene in RLVR training instead of treating it as a black box.
- The same linear structure holds across multiple model families, RL algorithms, and hyper-parameter choices.
Where Pith is reading between the lines
- If the linear regime proves general, simpler linear approximations could replace full gradient steps in parts of the training loop.
- Similar linearity might appear in other reinforcement-learning setups for language models once noise levels are high enough.
- One could test whether deliberately increasing signal noise early in training accelerates entry into the linear regime.
- The low-dimensional drift view suggests that RLVR mainly moves the model along a single effective direction rather than exploring a high-dimensional space.
Load-bearing premise
The high-variance, noisy nature of RLVR training signals acts as a low-pass filter that concentrates optimization along a stable low-dimensional drift.
What would settle it
An RLVR training run in which the linear regression fit to the trajectory of weights or teacher-forced log-probabilities yields R² below 0.7 on average across checkpoints, or in which the proposed extrapolations fail to match or exceed baseline performance.
read the original abstract
Reinforcement learning with verifiable rewards (RLVR) has driven significant performance gains in reasoning-oriented large language models (LLMs), yet its internal training dynamics remain largely a black box. In this work, we perform a comprehensive trajectory-level analysis of RLVR and uncover a striking regularity: across various model families, RL algorithms, and training configurations, RLVR consistently enters a robust linear regime, where both parameter weights and output log-probabilities, measured rigorously via teacher-forced evaluation, evolve in a highly linear manner ($R^2 > 0.7$). Through controlled experiments and theoretical analysis, we demonstrate that this linearity is not a coincidence, but stems from the high-variance, noisy nature of RLVR training signals, which act as a low-pass filter to concentrate optimization along a stable, low-dimensional drift. Moreover, we show that this linear structure is not merely descriptive but powerfully predictive and actionable. Specifically, weight-space extrapolation matches the performance of standard RL optimization while achieving a 6.1x training speedup through periodic re-grounding. Meanwhile, output-space extrapolation serves as a lightweight intervention that effectively bypasses late-stage model collapse, consistently outperforming standard RL across mathematical and coding benchmarks, with an average performance improvement of 4.2%. Our code is available at https://github.com/Miaow-Lab/RLVR-Linearity.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that across model families, RL algorithms, and training configurations, RLVR training of LLMs enters a robust linear regime in which both parameter weights and teacher-forced output log-probabilities evolve linearly with R² > 0.7. The authors attribute this regularity to the high-variance noisy RLVR signals acting as a low-pass filter that concentrates optimization along a stable low-dimensional drift. They further show that this linear structure is predictive: weight-space extrapolation matches standard RL performance at 6.1× speedup via periodic re-grounding, while output-space extrapolation bypasses late-stage collapse and yields a 4.2% average gain on math and coding benchmarks. Code is released at https://github.com/Miaow-Lab/RLVR-Linearity.
Significance. If the reported linearity and its causal attribution hold, the work supplies both a mechanistic explanation for RLVR dynamics and immediately actionable extrapolation techniques that could materially reduce training cost while improving final performance. The public release of code is a clear strength that supports reproducibility and follow-up verification.
major comments (1)
- [Controlled experiments and theoretical analysis sections] The central explanatory claim—that linearity arises specifically because high-variance noisy RLVR signals act as a low-pass filter—lacks a direct ablation that holds model, data, optimizer, and loss formulation fixed while varying only signal variance (e.g., SFT versus RLVR on the same verifiable-reward data, or PPO with reduced advantage variance). Without this contrast, the observed R² > 0.7 could be an artifact of gradient descent on LLMs rather than RLVR-specific noise, weakening both the theoretical analysis and the justification for the extrapolation interventions.
minor comments (2)
- [Abstract and results] The abstract and results sections report consistent R² > 0.7 but omit details on statistical tests, error bars, data exclusion criteria, or the precise definition of the linear fit (e.g., over which training steps or tokens). Adding these would strengthen the claim of robustness across setups.
- [Methods] Notation for teacher-forced log-probability evaluation should be clarified with an explicit equation or pseudocode, as it is central to the output-space linearity measurements.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for identifying a point that can strengthen the causal interpretation of our results. We address the major comment below.
read point-by-point responses
-
Referee: [Controlled experiments and theoretical analysis sections] The central explanatory claim—that linearity arises specifically because high-variance noisy RLVR signals act as a low-pass filter—lacks a direct ablation that holds model, data, optimizer, and loss formulation fixed while varying only signal variance (e.g., SFT versus RLVR on the same verifiable-reward data, or PPO with reduced advantage variance). Without this contrast, the observed R² > 0.7 could be an artifact of gradient descent on LLMs rather than RLVR-specific noise, weakening both the theoretical analysis and the justification for the extrapolation interventions.
Authors: We agree that the current set of controlled experiments, while spanning multiple RL algorithms and noise regimes, does not include the most direct isolation of signal variance requested. Existing comparisons (across PPO, GRPO, and other variants) vary advantage estimation and reward structure but do not hold the underlying loss and data identical to an SFT baseline on the same verifiable-reward corpus. We will add the suggested ablations—SFT versus RLVR on identical data and PPO runs with explicitly reduced advantage variance—in the revised controlled-experiments section. The new results will be used to update the theoretical analysis and to clarify the extent to which the observed linearity is attributable to RLVR-specific noise rather than generic gradient descent on LLMs. These additions will also reinforce the motivation for the extrapolation techniques. revision: yes
Circularity Check
No significant circularity; linearity reported as empirical observation with independent extrapolation tests
full rationale
The paper presents the linear regime (R² > 0.7 in weights and teacher-forced log-probabilities) as a discovered empirical regularity across model families, RL algorithms, and configurations. The attribution to high-variance RLVR signals as a low-pass filter is supported by controlled experiments and theoretical analysis rather than by redefining the observed linearity in terms of itself. Weight-space and output-space extrapolation are then validated by direct performance comparison against standard RL optimization on mathematical and coding benchmarks, yielding measurable speedups and gains (6.1x and +4.2%). No equations or steps reduce a claimed prediction to a fitted parameter by construction, nor does any load-bearing premise collapse to a self-citation chain. The derivation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption RLVR training signals are high-variance and noisy and therefore act as a low-pass filter
Forward citations
Cited by 2 Pith papers
-
You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories
RELEX extrapolates LLM checkpoints from short RLVR prefixes by projecting deltas onto a rank-1 subspace and fitting a linear trend, matching full training performance at 15% of the steps.
-
Alignment Dynamics in LLM Fine-Tuning
The paper introduces a dynamical model that decomposes alignment updates in LLM fine-tuning into rebound and driving forces and predicts a rehearsal priming effect.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.