You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories

Chengsong Huang; Jiaxin Huang; Wei-Lin Chen; Xinyu Zhu; Yu Meng; Zhepei Wei

arxiv: 2605.21468 · v1 · pith:SSVGOSPVnew · submitted 2026-05-20 · 💻 cs.LG · cs.CL

You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories

Zhepei Wei , Xinyu Zhu , Wei-Lin Chen , Chengsong Huang , Jiaxin Huang , Yu Meng This is my paper

Pith reviewed 2026-05-21 05:15 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords RLVRLLM reasoningparameter trajectorieslow-rank approximationextrapolationreinforcement learningtraining efficiency

0 comments

The pith

A rank-1 linear extrapolation of early RLVR parameter deltas matches full training performance with only 15 percent of the steps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that RLVR training trajectories in large language models are dominated by a single low-rank direction whose magnitude grows nearly linearly with steps. Most performance gains on reasoning benchmarks can be recovered by projecting updates onto this rank-1 subspace estimated from the first 50 steps. A simple linear regression on the projection magnitude then generates later checkpoints without any additional training. These extrapolated models reach or surpass the accuracy of fully trained RLVR checkpoints on both in-domain and out-of-domain tasks. The method works by discarding stochastic noise that would otherwise accumulate and degrade longer runs.

Core claim

RLVR weight trajectories are extremely low-rank and highly predictable. The majority of downstream performance gains are captured by a rank-1 approximation of the parameter deltas, where the magnitude of this projection evolves near-linearly with training steps. Estimating the subspace from a short early window and extrapolating via linear regression produces checkpoints that match or exceed full RLVR results while using far fewer steps and no extra compute.

What carries the argument

Rank-1 subspace of parameter deltas whose projection magnitude is regressed linearly against training step count.

If this is right

RELEX matches or exceeds full RLVR performance on in-domain and out-of-domain benchmarks.
Only 15 percent of full RLVR steps are needed to reach the same or better results.
Extrapolation to 10-20 times the observed prefix continues to improve scores at zero training cost.
Raising the subspace rank or switching to non-linear models yields no further gains.
Projecting onto the rank-1 subspace removes stochastic optimization noise that harms long-horizon predictions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The low-rank structure may appear in other optimization settings such as supervised fine-tuning or continued pre-training.
If the linear trend generalizes, short observation windows could let researchers cheaply simulate many long training runs before committing compute.
The denoising effect suggests that much of the apparent randomness in late-stage RLVR updates is avoidable by staying in the dominant direction.
Similar rank-1 predictability could be tested on models trained with different verifiable reward functions or on non-reasoning tasks.

Load-bearing premise

The rank-1 subspace estimated from the first 50 steps and the near-linear growth of its magnitude stay stable and continue to predict gains when extended to hundreds or thousands of later steps.

What would settle it

Run full RLVR for 1000 steps on one of the tested models, use only the first 50 steps to fit the rank-1 line, generate the extrapolated checkpoint at step 1000, and compare its benchmark scores to those of the actual step-1000 checkpoint.

Figures

Figures reproduced from arXiv: 2605.21468 by Chengsong Huang, Jiaxin Huang, Wei-Lin Chen, Xinyu Zhu, Yu Meng, Zhepei Wei.

**Figure 1.** Figure 1: RELEX extrapolates checkpoints that match full RLVR performance based only on early training dynamics, without further training. RELEX estimates the rank-1 update subspace from the observed RLVR prefix (up to Tcut) and extrapolates future checkpoints at no training cost, matching or exceeding the RLVR checkpoints on the MATH test set across three models. structured [Wang et al., 2026, Zhu et al., 2025a], s… view at source ↗

**Figure 2.** Figure 2: Rank-1 SVD reconstruction recovers RLVR checkpoints across models. The rank-1 reconstructed checkpoints preserve most downstream performance on MATH, suggesting that a single dominant direction captures the task-relevant component of RLVR updates. 2 Background 2.1 Reinforcement Learning with Verifiable Rewards RLVR algorithms train an LLM policy πθ to maximize rewards that can be programmatically verified,… view at source ↗

**Figure 3.** Figure 3: From raw RLVR trajectories to rank-1 extrapolation. Left: RLVR checkpoints form a curved path in raw weight space, making future checkpoints hard to predict directly. Right: after SVD, the dominant direction v1 captures the main update, and the corresponding scalar coefficient grows approximately linearly with training step. RELEX uses the observed prefix θ≤Tcut=125 to estimate v1, fits this rank-1 coeffic… view at source ↗

**Figure 4.** Figure 4: Rank-1 SVD coefficients evolve nearly linearly. Rank-1 coefficients ct (blue dots) and linear fits (pink) for representative modules of Qwen2.5-Math-1.5B. for each of the 500 RLVR training steps on Qwen2.5-Math-1.5B, perform per-tensor SVD on the resulting trajectory matrices (Algorithm 1), and observe two insightful empirical findings. Finding 1: RLVR updates are low-rank [PITH_FULL_IMAGE:figures/full_fi… view at source ↗

**Figure 5.** Figure 5: Rank-5 SVD coefficient trajectories for a representative tensor (layer 14 [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Weight-space alignment against the true RLVR trajectory on Qwen2.5-Math-1.5B. Reconstruction (Tcut = 500, Algorithm 1) is a rank-1 projection of the actual delta within the observed window; extrapolation (Tcut = 75, Algorithm 2) fits the first 75 steps and predicts future checkpoints without seeing WRLVR(t). (a) mean per-tensor direction similarity (cosine to ∆RLVR(t)); (b) magnitude ratio ∥∆ˆ ∥/∥∆RLVR∥. w… view at source ↗

read the original abstract

Reinforcement learning with verifiable rewards (RLVR) has become a dominant paradigm for improving reasoning in large language models (LLMs), yet the underlying geometry of the resulting parameter trajectories remains underexplored. In this work, we demonstrate that RLVR weight trajectories are extremely low-rank and highly predictable. Specifically, we find that the majority of downstream performance gains are captured by a rank-1 approximation of the parameter deltas, where the magnitude of this projection evolves near-linearly with training steps. Motivated by this, we propose a simple and compute-efficient method RELEX (REinforcement Learning EXtrapolation), which estimates the rank-1 subspace from a short observation window and extrapolates future checkpoints via linear regression, with no learned model required. Across three models (i.e., Qwen2.5-Math-1.5B, Qwen3-4B-Base, and Qwen3-8B-Base), RELEX produces checkpoints that match or exceed RLVR performance on both in-domain and out-of-domain benchmarks, requiring as few as 15% steps of full RLVR training. Remarkably, RELEX is able to extrapolate far beyond the observation window at no training cost, predicting checkpoints up to 10-20$\times$ beyond the observed prefix with continued improvement (e.g., observe only the first 50 steps and extrapolate to 1000 steps). Our ablation analysis confirms the minimalist sufficiency of RELEX: neither increasing the subspace rank nor employing non-linear modeling yields further gains in extrapolation. Finally, we show that RELEX's success stems from a "denoising" effect: by projecting updates onto the rank-1 subspace, the model discards stochastic optimization noise that would otherwise degrade performance during extrapolation. Our code is available at https://github.com/weizhepei/RELEX.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RELEX shows that fitting a rank-1 linear model to early RLVR steps can extrapolate to match or exceed full training performance on reasoning benchmarks with far less compute.

read the letter

The main point is that RLVR parameter trajectories in these LLMs are dominated by a single direction whose magnitude grows near-linearly. RELEX fits that direction and slope from a short early window, then extrapolates checkpoints that hit or beat the full RLVR results on both in-domain and out-of-domain tasks while using roughly 15% of the steps. They also show you can push 10-20x beyond the observed prefix with continued gains and no extra training cost.

Referee Report

2 major / 2 minor

Summary. The paper claims that RLVR training trajectories for LLMs are extremely low-rank, with most performance gains captured by a rank-1 approximation where the projection magnitude evolves near-linearly with training steps. They introduce RELEX to estimate this rank-1 subspace from a short early window and extrapolate future checkpoints via linear regression, achieving comparable or better benchmark performance than full RLVR with as little as 15% of the steps, and successfully extrapolating 10-20x beyond the observed prefix on three Qwen models.

Significance. If validated, this has substantial significance for reducing the computational cost of RLVR fine-tuning in LLMs. The finding that a simple linear rank-1 extrapolation suffices, with no gains from higher ranks or non-linear models, suggests a highly structured optimization landscape. The denoising effect explanation adds insight into why such extrapolation works. Strengths include the reproducible code and the falsifiable prediction of continued improvement via extrapolation.

major comments (2)

[Abstract and extrapolation experiments] The central claim that the rank-1 direction estimated from the first 50 steps remains predictive up to step 1000 requires explicit verification that the subspace does not drift. The manuscript should report the alignment (e.g., cosine similarity) between the early rank-1 vector and the actual parameter updates at later extrapolated checkpoints, as drift would invalidate the linear extrapolation.
[Ablation studies] The statement that higher-rank or non-linear models add no value needs to specify whether these comparisons were evaluated on the extrapolated performance or only on fitting the observation window. If the latter, it does not fully support the claim for the extrapolation task.

minor comments (2)

[Experimental details] The abstract mentions matching or superior results but lacks mention of error bars, number of runs, or exact data selection criteria for the benchmarks; including these would strengthen the quantitative claims.
[Notation] Clarify the precise definition of the rank-1 projection and the linear regression setup in the methods section for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments. We address each major comment below and will revise the manuscript to incorporate the requested verifications and clarifications.

read point-by-point responses

Referee: [Abstract and extrapolation experiments] The central claim that the rank-1 direction estimated from the first 50 steps remains predictive up to step 1000 requires explicit verification that the subspace does not drift. The manuscript should report the alignment (e.g., cosine similarity) between the early rank-1 vector and the actual parameter updates at later extrapolated checkpoints, as drift would invalidate the linear extrapolation.

Authors: We agree that explicit verification of subspace stability strengthens the central claim. In the revised manuscript we will add a new analysis (in Section 4.2) reporting cosine similarity between the rank-1 vector estimated from the first 50 steps and the actual parameter updates at later checkpoints (e.g., steps 100, 200, ..., 1000). Preliminary internal checks show sustained high alignment (cosine similarity > 0.85 throughout), which supports the validity of linear extrapolation. This addition will be included with the corresponding figure. revision: yes
Referee: [Ablation studies] The statement that higher-rank or non-linear models add no value needs to specify whether these comparisons were evaluated on the extrapolated performance or only on fitting the observation window. If the latter, it does not fully support the claim for the extrapolation task.

Authors: We thank the referee for this clarification request. The ablation comparisons were performed on extrapolated performance: we measured benchmark scores of the extrapolated checkpoints (beyond the observation window) for rank-2/3 subspaces and non-linear (quadratic) fits, finding no improvement over rank-1 linear extrapolation. To remove any ambiguity we will revise the text in Section 4.3 to explicitly state that all comparisons use extrapolated checkpoints and add a brief description of the evaluation protocol. revision: yes

Circularity Check

0 steps flagged

No significant circularity in RELEX rank-1 extrapolation

full rationale

The paper observes low-rank structure empirically from RLVR trajectories, estimates the rank-1 subspace and linear magnitude fit exclusively from a short early prefix (e.g. first 50 steps), then extrapolates to far-future checkpoints (e.g. 1000 steps) without incorporating any target values or deltas from the extrapolation horizon into the fit. This constitutes standard out-of-sample prediction tested against held-out later checkpoints and external benchmarks, not a reduction of the claimed result to its own inputs by construction. No self-definitional loops, load-bearing self-citations, or renaming of known results appear in the derivation.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The method rests on empirical estimation of a low-dimensional subspace and linear trend from initial training data; no new theoretical entities or untested physical assumptions are introduced.

free parameters (2)

rank-1 direction vector
Estimated directly from the observed parameter deltas within the short training window.
linear regression slope and intercept
Fitted to the scalar magnitude of the rank-1 projection as a function of training step count.

axioms (1)

domain assumption RLVR parameter trajectories are dominated by a single direction whose magnitude evolves near-linearly
Invoked to justify both the rank-1 projection and the linear extrapolation step.

pith-pipeline@v0.9.0 · 5890 in / 1284 out tokens · 57562 ms · 2026-05-21T05:15:38.158697+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 10 internal anchors

[1]

Low-rank Optimization Trajectories Modeling for LLM RLVR Acceleration

Zhipeng Chen, Tao Qian, Wayne Xin Zhao, and Ji-Rong Wen. Low-rank optimization trajectories modeling for LLM RLVR acceleration.arXiv preprint arXiv:2604.11446,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs

Jasper Dekoninck, Nikola Jovanovi´c, Tim Gehrunger, Kári Rögnvaldsson, Ivo Petrov, Chenhao Sun, and Martin Vechev. Beyond benchmarks: MathArena as an evaluation platform for mathematics with llms.arXiv preprint arXiv:2605.00674,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

The Implicit Curriculum: Learning Dynamics in RL with Verifiable Rewards

Kexin Huang, Haoming Meng, Junkang Wu, Jinda Lu, Chiyu Ma, Ziqian Chen, Xue Wang, Bolin Ding, Jiancan Wu, Xiang Wang, Xiangnan He, Guoyin Wang, and Jingren Zhou. Beyond magni- tude: Leveraging direction of RLVR updates for LLM reasoning. InThe Fourteenth International Conference on Learning Representations, 2026a. Yu Huang, Zixin Wen, Yuejie Chi, Yuting W...

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[6]

Olmo 3

Team Olmo, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, et al. Olmo 3.arXiv preprint arXiv:2512.13961,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Not all steps are informative: On the linearity of LLMs’ RLVR training.arXiv preprint arXiv:2601.04537,

Tianle Wang, Zhongyuan Wu, Shenghao Jin, Hao Xu, Wei Chen, and Ning Miao. Not all steps are informative: On the linearity of LLMs’ RLVR training.arXiv preprint arXiv:2601.04537,

work page arXiv
[9]

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jian- hong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement.arXiv preprint arXiv:2409.12122,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

On the Implicit Reward Overfitting and the Low-rank Dynamics in RLVR

Hao Ye, Jisheng Dang, Junfeng Fang, Bimei Wang, Yizhou Zhang, Ning Lv, Wencan Zhang, Hong Peng, Bin Hu, and Tat-Seng Chua. On the implicit reward overfitting and the low-rank dynamics in RLVR.arXiv preprint arXiv:2605.06523,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Generative modeling of weights: Generalization or memorization?arXiv preprint arXiv:2506.07998,

Boya Zeng, Yida Yin, Zhiqiu Xu, and Zhuang Liu. Generative modeling of weights: Generalization or memorization?arXiv preprint arXiv:2506.07998,

work page arXiv
[13]

Pan, Zhangyang Wang, Yuandong Tian, and Kai Sheng Tai

Hanqing Zhu, Zhenyu Zhang, Hanxian Huang, DiJia Su, Zechun Liu, Jiawei Zhao, Igor Fedorov, Hamed Pirsiavash, Zhizhou Sha, Jinwon Lee, et al. The path not taken: RLVR provably learns off the principals.arXiv preprint arXiv:2511.08567, 2025a. Xinyu Zhu, Mengzhou Xia, Zhepei Wei, Wei-Lin Chen, Danqi Chen, and Yu Meng. The surprising effectiveness of negative...

work page arXiv 2025
[14]

All runs across three models (i.e., Qwen2.5-Math-1.5B, Qwen3-4B-Base, Qwen3-8B-Base) are trained for 500 optimization steps on 8xH200 GPUs. Inference details.For in-domain MATH evaluation, Qwen2.5-Math-1.5B uses greedy decoding with a 4K-token budget, while Qwen3-family models use sampling decoding with a 16K-token budget. For OOD benchmarks, we use avg@8...

work page 2026

[1] [1]

Low-rank Optimization Trajectories Modeling for LLM RLVR Acceleration

Zhipeng Chen, Tao Qian, Wayne Xin Zhao, and Ji-Rong Wen. Low-rank optimization trajectories modeling for LLM RLVR acceleration.arXiv preprint arXiv:2604.11446,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs

Jasper Dekoninck, Nikola Jovanovi´c, Tim Gehrunger, Kári Rögnvaldsson, Ivo Petrov, Chenhao Sun, and Martin Vechev. Beyond benchmarks: MathArena as an evaluation platform for mathematics with llms.arXiv preprint arXiv:2605.00674,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

The Implicit Curriculum: Learning Dynamics in RL with Verifiable Rewards

Kexin Huang, Haoming Meng, Junkang Wu, Jinda Lu, Chiyu Ma, Ziqian Chen, Xue Wang, Bolin Ding, Jiancan Wu, Xiang Wang, Xiangnan He, Guoyin Wang, and Jingren Zhou. Beyond magni- tude: Leveraging direction of RLVR updates for LLM reasoning. InThe Fourteenth International Conference on Learning Representations, 2026a. Yu Huang, Zixin Wen, Yuejie Chi, Yuting W...

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001

[6] [6]

Olmo 3

Team Olmo, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, et al. Olmo 3.arXiv preprint arXiv:2512.13961,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Not all steps are informative: On the linearity of LLMs’ RLVR training.arXiv preprint arXiv:2601.04537,

Tianle Wang, Zhongyuan Wu, Shenghao Jin, Hao Xu, Wei Chen, and Ning Miao. Not all steps are informative: On the linearity of LLMs’ RLVR training.arXiv preprint arXiv:2601.04537,

work page arXiv

[9] [9]

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jian- hong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement.arXiv preprint arXiv:2409.12122,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

On the Implicit Reward Overfitting and the Low-rank Dynamics in RLVR

Hao Ye, Jisheng Dang, Junfeng Fang, Bimei Wang, Yizhou Zhang, Ning Lv, Wencan Zhang, Hong Peng, Bin Hu, and Tat-Seng Chua. On the implicit reward overfitting and the low-rank dynamics in RLVR.arXiv preprint arXiv:2605.06523,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Generative modeling of weights: Generalization or memorization?arXiv preprint arXiv:2506.07998,

Boya Zeng, Yida Yin, Zhiqiu Xu, and Zhuang Liu. Generative modeling of weights: Generalization or memorization?arXiv preprint arXiv:2506.07998,

work page arXiv

[13] [13]

Pan, Zhangyang Wang, Yuandong Tian, and Kai Sheng Tai

Hanqing Zhu, Zhenyu Zhang, Hanxian Huang, DiJia Su, Zechun Liu, Jiawei Zhao, Igor Fedorov, Hamed Pirsiavash, Zhizhou Sha, Jinwon Lee, et al. The path not taken: RLVR provably learns off the principals.arXiv preprint arXiv:2511.08567, 2025a. Xinyu Zhu, Mengzhou Xia, Zhepei Wei, Wei-Lin Chen, Danqi Chen, and Yu Meng. The surprising effectiveness of negative...

work page arXiv 2025

[14] [14]

All runs across three models (i.e., Qwen2.5-Math-1.5B, Qwen3-4B-Base, Qwen3-8B-Base) are trained for 500 optimization steps on 8xH200 GPUs. Inference details.For in-domain MATH evaluation, Qwen2.5-Math-1.5B uses greedy decoding with a 4K-token budget, while Qwen3-family models use sampling decoding with a 16K-token budget. For OOD benchmarks, we use avg@8...

work page 2026