You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories
Pith reviewed 2026-05-21 05:15 UTC · model grok-4.3
The pith
A rank-1 linear extrapolation of early RLVR parameter deltas matches full training performance with only 15 percent of the steps.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RLVR weight trajectories are extremely low-rank and highly predictable. The majority of downstream performance gains are captured by a rank-1 approximation of the parameter deltas, where the magnitude of this projection evolves near-linearly with training steps. Estimating the subspace from a short early window and extrapolating via linear regression produces checkpoints that match or exceed full RLVR results while using far fewer steps and no extra compute.
What carries the argument
Rank-1 subspace of parameter deltas whose projection magnitude is regressed linearly against training step count.
If this is right
- RELEX matches or exceeds full RLVR performance on in-domain and out-of-domain benchmarks.
- Only 15 percent of full RLVR steps are needed to reach the same or better results.
- Extrapolation to 10-20 times the observed prefix continues to improve scores at zero training cost.
- Raising the subspace rank or switching to non-linear models yields no further gains.
- Projecting onto the rank-1 subspace removes stochastic optimization noise that harms long-horizon predictions.
Where Pith is reading between the lines
- The low-rank structure may appear in other optimization settings such as supervised fine-tuning or continued pre-training.
- If the linear trend generalizes, short observation windows could let researchers cheaply simulate many long training runs before committing compute.
- The denoising effect suggests that much of the apparent randomness in late-stage RLVR updates is avoidable by staying in the dominant direction.
- Similar rank-1 predictability could be tested on models trained with different verifiable reward functions or on non-reasoning tasks.
Load-bearing premise
The rank-1 subspace estimated from the first 50 steps and the near-linear growth of its magnitude stay stable and continue to predict gains when extended to hundreds or thousands of later steps.
What would settle it
Run full RLVR for 1000 steps on one of the tested models, use only the first 50 steps to fit the rank-1 line, generate the extrapolated checkpoint at step 1000, and compare its benchmark scores to those of the actual step-1000 checkpoint.
Figures
read the original abstract
Reinforcement learning with verifiable rewards (RLVR) has become a dominant paradigm for improving reasoning in large language models (LLMs), yet the underlying geometry of the resulting parameter trajectories remains underexplored. In this work, we demonstrate that RLVR weight trajectories are extremely low-rank and highly predictable. Specifically, we find that the majority of downstream performance gains are captured by a rank-1 approximation of the parameter deltas, where the magnitude of this projection evolves near-linearly with training steps. Motivated by this, we propose a simple and compute-efficient method RELEX (REinforcement Learning EXtrapolation), which estimates the rank-1 subspace from a short observation window and extrapolates future checkpoints via linear regression, with no learned model required. Across three models (i.e., Qwen2.5-Math-1.5B, Qwen3-4B-Base, and Qwen3-8B-Base), RELEX produces checkpoints that match or exceed RLVR performance on both in-domain and out-of-domain benchmarks, requiring as few as 15% steps of full RLVR training. Remarkably, RELEX is able to extrapolate far beyond the observation window at no training cost, predicting checkpoints up to 10-20$\times$ beyond the observed prefix with continued improvement (e.g., observe only the first 50 steps and extrapolate to 1000 steps). Our ablation analysis confirms the minimalist sufficiency of RELEX: neither increasing the subspace rank nor employing non-linear modeling yields further gains in extrapolation. Finally, we show that RELEX's success stems from a "denoising" effect: by projecting updates onto the rank-1 subspace, the model discards stochastic optimization noise that would otherwise degrade performance during extrapolation. Our code is available at https://github.com/weizhepei/RELEX.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that RLVR training trajectories for LLMs are extremely low-rank, with most performance gains captured by a rank-1 approximation where the projection magnitude evolves near-linearly with training steps. They introduce RELEX to estimate this rank-1 subspace from a short early window and extrapolate future checkpoints via linear regression, achieving comparable or better benchmark performance than full RLVR with as little as 15% of the steps, and successfully extrapolating 10-20x beyond the observed prefix on three Qwen models.
Significance. If validated, this has substantial significance for reducing the computational cost of RLVR fine-tuning in LLMs. The finding that a simple linear rank-1 extrapolation suffices, with no gains from higher ranks or non-linear models, suggests a highly structured optimization landscape. The denoising effect explanation adds insight into why such extrapolation works. Strengths include the reproducible code and the falsifiable prediction of continued improvement via extrapolation.
major comments (2)
- [Abstract and extrapolation experiments] The central claim that the rank-1 direction estimated from the first 50 steps remains predictive up to step 1000 requires explicit verification that the subspace does not drift. The manuscript should report the alignment (e.g., cosine similarity) between the early rank-1 vector and the actual parameter updates at later extrapolated checkpoints, as drift would invalidate the linear extrapolation.
- [Ablation studies] The statement that higher-rank or non-linear models add no value needs to specify whether these comparisons were evaluated on the extrapolated performance or only on fitting the observation window. If the latter, it does not fully support the claim for the extrapolation task.
minor comments (2)
- [Experimental details] The abstract mentions matching or superior results but lacks mention of error bars, number of runs, or exact data selection criteria for the benchmarks; including these would strengthen the quantitative claims.
- [Notation] Clarify the precise definition of the rank-1 projection and the linear regression setup in the methods section for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and insightful comments. We address each major comment below and will revise the manuscript to incorporate the requested verifications and clarifications.
read point-by-point responses
-
Referee: [Abstract and extrapolation experiments] The central claim that the rank-1 direction estimated from the first 50 steps remains predictive up to step 1000 requires explicit verification that the subspace does not drift. The manuscript should report the alignment (e.g., cosine similarity) between the early rank-1 vector and the actual parameter updates at later extrapolated checkpoints, as drift would invalidate the linear extrapolation.
Authors: We agree that explicit verification of subspace stability strengthens the central claim. In the revised manuscript we will add a new analysis (in Section 4.2) reporting cosine similarity between the rank-1 vector estimated from the first 50 steps and the actual parameter updates at later checkpoints (e.g., steps 100, 200, ..., 1000). Preliminary internal checks show sustained high alignment (cosine similarity > 0.85 throughout), which supports the validity of linear extrapolation. This addition will be included with the corresponding figure. revision: yes
-
Referee: [Ablation studies] The statement that higher-rank or non-linear models add no value needs to specify whether these comparisons were evaluated on the extrapolated performance or only on fitting the observation window. If the latter, it does not fully support the claim for the extrapolation task.
Authors: We thank the referee for this clarification request. The ablation comparisons were performed on extrapolated performance: we measured benchmark scores of the extrapolated checkpoints (beyond the observation window) for rank-2/3 subspaces and non-linear (quadratic) fits, finding no improvement over rank-1 linear extrapolation. To remove any ambiguity we will revise the text in Section 4.3 to explicitly state that all comparisons use extrapolated checkpoints and add a brief description of the evaluation protocol. revision: yes
Circularity Check
No significant circularity in RELEX rank-1 extrapolation
full rationale
The paper observes low-rank structure empirically from RLVR trajectories, estimates the rank-1 subspace and linear magnitude fit exclusively from a short early prefix (e.g. first 50 steps), then extrapolates to far-future checkpoints (e.g. 1000 steps) without incorporating any target values or deltas from the extrapolation horizon into the fit. This constitutes standard out-of-sample prediction tested against held-out later checkpoints and external benchmarks, not a reduction of the claimed result to its own inputs by construction. No self-definitional loops, load-bearing self-citations, or renaming of known results appear in the derivation.
Axiom & Free-Parameter Ledger
free parameters (2)
- rank-1 direction vector
- linear regression slope and intercept
axioms (1)
- domain assumption RLVR parameter trajectories are dominated by a single direction whose magnitude evolves near-linearly
Reference graph
Works this paper leans on
-
[1]
Low-rank Optimization Trajectories Modeling for LLM RLVR Acceleration
Zhipeng Chen, Tao Qian, Wayne Xin Zhao, and Ji-Rong Wen. Low-rank optimization trajectories modeling for LLM RLVR acceleration.arXiv preprint arXiv:2604.11446,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs
Jasper Dekoninck, Nikola Jovanovi´c, Tim Gehrunger, Kári Rögnvaldsson, Ivo Petrov, Chenhao Sun, and Martin Vechev. Beyond benchmarks: MathArena as an evaluation platform for mathematics with llms.arXiv preprint arXiv:2605.00674,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
The Implicit Curriculum: Learning Dynamics in RL with Verifiable Rewards
Kexin Huang, Haoming Meng, Junkang Wu, Jinda Lu, Chiyu Ma, Ziqian Chen, Xue Wang, Bolin Ding, Jiancan Wu, Xiang Wang, Xiangnan He, Guoyin Wang, and Jingren Zhou. Beyond magni- tude: Leveraging direction of RLVR updates for LLM reasoning. InThe Fourteenth International Conference on Learning Representations, 2026a. Yu Huang, Zixin Wen, Yuejie Chi, Yuting W...
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[6]
Team Olmo, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, et al. Olmo 3.arXiv preprint arXiv:2512.13961,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Tianle Wang, Zhongyuan Wu, Shenghao Jin, Hao Xu, Wei Chen, and Ning Miao. Not all steps are informative: On the linearity of LLMs’ RLVR training.arXiv preprint arXiv:2601.04537,
-
[9]
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement
An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jian- hong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement.arXiv preprint arXiv:2409.12122,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
On the Implicit Reward Overfitting and the Low-rank Dynamics in RLVR
Hao Ye, Jisheng Dang, Junfeng Fang, Bimei Wang, Yizhou Zhang, Ning Lv, Wencan Zhang, Hong Peng, Bin Hu, and Tat-Seng Chua. On the implicit reward overfitting and the low-rank dynamics in RLVR.arXiv preprint arXiv:2605.06523,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Generative modeling of weights: Generalization or memorization?arXiv preprint arXiv:2506.07998,
Boya Zeng, Yida Yin, Zhiqiu Xu, and Zhuang Liu. Generative modeling of weights: Generalization or memorization?arXiv preprint arXiv:2506.07998,
-
[13]
Pan, Zhangyang Wang, Yuandong Tian, and Kai Sheng Tai
Hanqing Zhu, Zhenyu Zhang, Hanxian Huang, DiJia Su, Zechun Liu, Jiawei Zhao, Igor Fedorov, Hamed Pirsiavash, Zhizhou Sha, Jinwon Lee, et al. The path not taken: RLVR provably learns off the principals.arXiv preprint arXiv:2511.08567, 2025a. Xinyu Zhu, Mengzhou Xia, Zhepei Wei, Wei-Lin Chen, Danqi Chen, and Yu Meng. The surprising effectiveness of negative...
-
[14]
All runs across three models (i.e., Qwen2.5-Math-1.5B, Qwen3-4B-Base, Qwen3-8B-Base) are trained for 500 optimization steps on 8xH200 GPUs. Inference details.For in-domain MATH evaluation, Qwen2.5-Math-1.5B uses greedy decoding with a 4K-token budget, while Qwen3-family models use sampling decoding with a 16K-token budget. For OOD benchmarks, we use avg@8...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.