Where Rollouts Begin: Low-Load, High-Leverage First-Token Diversification for RLVR

Albert No; Soeun Kim

arxiv: 2605.28295 · v1 · pith:6R7T665Fnew · submitted 2026-05-27 · 💻 cs.AI · cs.CL· cs.LG

Where Rollouts Begin: Low-Load, High-Leverage First-Token Diversification for RLVR

Soeun Kim , Albert No This is my paper

Pith reviewed 2026-06-29 11:52 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG

keywords RLVRrollout diversityfirst tokenREFTreinforcement learningverifiable rewardsreasoning modelspass rates

0 comments

The pith

Uniform sampling of the first token after the reasoning marker broadens rollout diversity in RLVR.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that in reinforcement learning with verifiable rewards, the first token of a rollout is a high-leverage point for increasing diversity. The policy tends to assign high probability to a few first tokens even when those choices do not determine final correctness. REFT samples the first token uniformly from the top-N most likely candidates and spreads the remaining rollouts evenly across them. This change leaves the rest of the training pipeline untouched yet raises Pass@1, Pass@8, and Pass@64 scores over standard baselines on multiple model sizes and task difficulties.

Core claim

REFT improves aggregate Pass@1, Pass@8, and Pass@64 over DAPO and GRPO baselines across four base models (0.5B-7B) and three difficulty regimes by sampling first tokens uniformly from the policy's own top-N candidates and allocating rollouts evenly.

What carries the argument

REFT, a method that samples the first token uniformly from the top-N candidates in the policy's distribution at the position right after the reasoning marker.

If this is right

Training on the diversified rollouts yields higher pass rates at multiple evaluation budgets.
The improvement holds across base models from 0.5B to 7B parameters.
The same gains appear in easy, medium, and hard difficulty regimes.
REFT adds negligible compute because only the first token choice changes.
Every other component of the RLVR pipeline remains unchanged.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the first-token effect generalizes, similar low-cost diversification could be applied at other early positions in the sequence.
Models that already exhibit less peaked first-token distributions may see smaller gains from this approach.
The method could be combined with existing temperature or prefix adjustments for further coverage.
Verification cost stays the same because the number of rollouts per prompt does not increase.

Load-bearing premise

The first token after the reasoning marker is sharply peaked in probability yet largely decoupled from whether the final answer is correct.

What would settle it

Running the same RLVR training with and without REFT on the same base models and seeing no difference in final Pass@1, Pass@8, or Pass@64 scores would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.28295 by Albert No, Soeun Kim.

**Figure 1.** Figure 1: Sharp probability, flat correctness. The model assigns high probability to the top first token, but rollout correctness remains nearly flat across the top-20 ranks. The policy is confident, but the verifier is not. For each prompt x, let fr(x) be the rank-r token under the policy distribution at the first token position, πθ(· | x, <think>). The model’s prior over this position is extremely sharp [PITH_F… view at source ↗

**Figure 2.** Figure 2: First tokens route continuations. Semantic diversity is measured after stripping the first token itself, so the effect comes from the rest of the rollout. A rare first token changes the continuation model. To test whether this routing effect is real, we compare four rollout strategies with the same group size: standard sampling, forcing f1(x), forcing f20(x), and sampling uniformly from the top-20 first … view at source ↗

**Figure 3.** Figure 3: RLVR sharpens the first-token prior. The top-1 first-token probability increases during training, although the first token is only weakly tied to correctness. GRPO sharpens the wrong preference. The bias is not only present before RLVR training; GRPO-style updates can amplify it. A trajectory-level advantage is applied to every token in the rollout. For a rollout whose first token is Fi , the update conta… view at source ↗

**Figure 4.** Figure 4: Sharper outputs without sacrificing coverage. (a) During training, REFT produces the most semantically diverse rollouts among the four sampling strategies. (b) After training, the inference-time ordering inverts: REFT-trained checkpoints produce the most concentrated, while Tables 1 and 3 show that this concentration coexists with stronger Pass@1 and Pass@64. exposes each rollout group to multiple early co… view at source ↗

**Figure 5.** Figure 5: Within-group answer diversity. Number of unique final answers per rollout group during training on Qwen2.5-0.5B-Instruct, GSM8K, DAPO. Outcome diversity beyond trajectory diversity. The near-flat first-token-conditioned accuracy in Section 3 shows viability, not equivalence. Plausible alternative first tokens are not destructive to correctness on average, but they can still route generation into differe… view at source ↗

**Figure 6.** Figure 6: REFT reduces all-wrong groups, the informative form of zero-variance reduction. Each panel shows the fraction of zero-variance groups (black) and the all-correct subset (green) under standard sampling (left) and REFT (right). The shaded gap is the all-wrong subset. REFT and standard sampling have relatively similar total zero-variance fractions, but REFT has a smaller all-wrong component. Thus, even when t… view at source ↗

**Figure 7.** Figure 7: REFT does not sharpen what the verifier ignores. Top-r first-token probability over training on Qwen2.5-3B + GSM8K. Standard GRPO escalates the prior on the mostfrequent first token; REFT keeps the prior comparatively flat. REFT mitigates first-token overcrediting. Section 3 showed that standard GRPO assigns a biased preference and it is monotonically escalated during training. Under standard sampling, … view at source ↗

**Figure 8.** Figure 8: shows first-token rank probabilities for additional model–dataset pairs. Across Qwen2.5-3B on BigMath-Easy, Llama3.2-3B on GSM8K, and Qwen2.5-0.5B on GSM8K, the probability mass remains concentrated in the top few first-token ranks. This supports the diagnosis in Section 3 that first-token concentration is not specific to the single setting used in [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Top-20 first-token probabilities for GSM8K prompt 0, under Llama3.2-3B-Instruct. GSM8K prompt 1 (gold answer: 3) A robe takes 2 bolts of blue fiber and half that much white fiber. How many bolts in total does it take? To Let First The We Since In 1 [sp]To This Given [sp]Let A There [sp]First For toMath Imagine It First token 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Probability [PITH_FULL_IMAGE:figures/… view at source ↗

**Figure 10.** Figure 10: Top-20 first-token probabilities for GSM8K prompt 1, under Llama3.2-3B-Instruct. GSM8K prompt 2 (gold answer: 70,000) Josh decides to try flipping a house. He buys a house for $80,000 and then puts in $50,000 in repairs. This increased the value of the house by 150%. How much profit did he make? To Let First Josh 1 The We In [sp]Josh Since This Given Initially For If [sp]Let [sp]First Step [sp]To There Fi… view at source ↗

**Figure 11.** Figure 11: Top-20 first-token probabilities for GSM8K prompt 2, under Llama3.2-3B-Instruct. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

read the original abstract

Reinforcement Learning with Verifiable Rewards (RLVR) trains reasoning models without labeled trajectories, relying on grouped rollouts to expose the policy to alternative reasoning paths and a verifier to score them. Rollout diversity has accordingly emerged as a central bottleneck in RLVR, with most existing methods broadening exploration through temperature, prefix, or rollout-selection adjustments. We identify a structurally distinguished but overlooked position for broadening this diversity: the first token after the reasoning marker. The policy's first-token distribution exhibits a sharply peaked yet correctness-decoupled phenomenon, and this first token position can broaden the regions a rollout group covers without altering the correctness signal. We introduce REFT (Rollout Exploration with First-Token Diversification), a light addition to the RLVR pipeline that samples first tokens uniformly from the policy's own top-$N$ candidates and allocates rollouts evenly, leaving every other component unchanged. Trained on the resulting diversified rollouts, REFT improves aggregate Pass@1, Pass@8, and Pass@64 over DAPO and GRPO baselines across four base models (0.5B-7B) and three difficulty regimes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

REFT's first-token uniform sampling is a low-overhead diversification tweak that reports Pass@k gains, but the decoupling claim rests on an unmeasured assumption.

read the letter

The paper's main contribution is a targeted change to rollout generation in RLVR: after the reasoning marker, sample the first token uniformly from the policy's own top-N candidates and spread the rollouts evenly. Everything else in the training loop stays the same. They show this lifts aggregate Pass@1, Pass@8, and Pass@64 over DAPO and GRPO baselines on four models (0.5B–7B) and three difficulty levels.

The position they pick is new relative to temperature or prefix tricks in the cited work, and the uniform allocation is a clean, cheap intervention. If the gains hold under proper controls, it gives practitioners a simple lever for rollout coverage.

The weakest part is the supporting claim that the first-token distribution is sharply peaked yet correctness-decoupled. The abstract asserts this without reporting any correlation, conditional pass rates, or mutual information between first token and verifier outcome. If that link exists even modestly, uniform sampling changes the expected reward distribution instead of preserving it, which would reframe the results. The stress-test note is right to flag this; without ablations or measurements in the full text, the justification stays thin.

This is for people already running RLVR pipelines on reasoning models who want a low-cost diversity knob. A reader in that niche can extract the method and test it directly.

Send it to review so the experimental details and the decoupling evidence get checked properly.

Referee Report

2 major / 0 minor

Summary. The paper claims that in RLVR the first token after the reasoning marker exhibits a sharply peaked yet correctness-decoupled distribution. It introduces REFT, which samples first tokens uniformly from the policy's top-N candidates while allocating rollouts evenly, leaving other pipeline components unchanged. This is asserted to broaden rollout coverage without altering the correctness signal and yields improved aggregate Pass@1, Pass@8, and Pass@64 over DAPO and GRPO baselines across four base models (0.5B-7B) and three difficulty regimes.

Significance. If the decoupling holds and the gains are robust, REFT supplies a computationally light, pipeline-compatible way to increase rollout diversity at a structurally distinguished position. This directly targets a recognized bottleneck in RLVR and could be practically useful for training reasoning models, provided the empirical improvements survive controls and the assumption is measured.

major comments (2)

[Abstract] Abstract: the central justification rests on the claim that the first-token distribution is 'sharply peaked yet correctness-decoupled,' yet the manuscript supplies no supporting measurement (correlation, mutual information, or conditional Pass rate between first-token identity and final verifier outcome). Without this datum the assertion that uniform top-N sampling preserves the reward distribution while only broadening coverage remains unsubstantiated.
[Abstract] Abstract / experimental description: the reported consistent gains across models and regimes are presented without any information on number of runs, statistical significance testing, or controls for confounds such as effective temperature, total compute budget, or changes in the verifier-score distribution induced by the re-sampling.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on substantiating the core claim and improving experimental reporting. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the central justification rests on the claim that the first-token distribution is 'sharply peaked yet correctness-decoupled,' yet the manuscript supplies no supporting measurement (correlation, mutual information, or conditional Pass rate between first-token identity and final verifier outcome). Without this datum the assertion that uniform top-N sampling preserves the reward distribution while only broadening coverage remains unsubstantiated.

Authors: We agree that the current manuscript does not provide explicit quantitative measurements to support the correctness-decoupling claim. In the revision we will add Pearson correlation, mutual information, and conditional Pass rates between first-token identity and verifier outcome, stratified across models and difficulty levels, to demonstrate that uniform top-N sampling broadens coverage without biasing the reward distribution. revision: yes
Referee: [Abstract] Abstract / experimental description: the reported consistent gains across models and regimes are presented without any information on number of runs, statistical significance testing, or controls for confounds such as effective temperature, total compute budget, or changes in the verifier-score distribution induced by the re-sampling.

Authors: We acknowledge that these experimental details are absent. The revised manuscript will report the number of independent runs with seeds, include statistical significance tests (e.g., paired t-tests), and add controls confirming matched effective temperature, total compute budget, and verifier-score distributions between REFT and baselines. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method stands on experimental results

full rationale

The paper presents REFT as a lightweight empirical modification to RLVR rollouts, asserting an observed first-token distribution property and reporting measured Pass@K gains on multiple models. No derivation chain, equations, fitted parameters renamed as predictions, or self-citation load-bearing steps exist that would reduce the reported improvements to re-expressions of the inputs by construction. The central justification rests on experimental outcomes rather than any mathematical reduction or ansatz smuggled via citation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; ledger entries are therefore limited to statements explicitly present in the provided text.

axioms (1)

domain assumption The first-token distribution is sharply peaked yet correctness-decoupled
Invoked to justify that uniform sampling broadens coverage without harming the verifier signal.

pith-pipeline@v0.9.1-grok · 5734 in / 1211 out tokens · 30374 ms · 2026-06-29T11:52:24.520059+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 25 canonical work pages · 11 internal anchors

[1]

Albalak, D

A. Albalak, D. Phung, N. Lile, R. Rafailov, K. Gandhi, L. Castricato, A. Singh, C. Blagden, V . Xiang, D. Mahan, et al. Big-math: A large-scale, high-quality math dataset for reinforcement learning in language models.arXiv preprint arXiv:2502.17387, 2025

work page arXiv 2025
[2]

B. Bai, X. Wang, P. Ye, and T. Chen. Learning to explore with parameter-space noise: A deep dive into parameter-space noise for reinforcement learning with verifiable rewards.arXiv preprint arXiv:2602.02555, 2026

work page arXiv 2026
[3]

Chang, S

C.-C. Chang, S. Zhu, Z. Zeng, H. Lin, J. You, M. S. Abdelfattah, Z. Jiang, and X. Qian. Srt: Accelerating reinforcement learning via speculative rollout with tree-structured cache.arXiv preprint arXiv:2601.09083, 2026

work page arXiv 2026
[4]

Z. Chen, H. Liu, Y . Zhou, H. Zheng, and B. Chen. Jackpot: Optimal budgeted rejection sampling for extreme actor-policy mismatch reinforcement learning.arXiv preprint arXiv:2602.06107, 2026

work page arXiv 2026
[5]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

X. Dang, C. Baek, J. Z. Kolter, and A. Raghunathan. Assessing diversity collapse in reasoning. InICLR 2025 Workshop on SSI-FM, 2025

2025
[7]

Y . Fang, J. Lin, X. Fu, C. Qin, H. Shi, C. Hu, L. Pan, K. Zeng, and X. Cai. How to allocate, how to learn? dynamic rollout allocation and advantage modulation for policy optimization.arXiv preprint arXiv:2602.19208, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[8]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

H. Gu, H. Wang, J. Liu, L. Li, Q. Zhu, B. Liu, B. Xu, L. Wang, X. Yang, S. Lin, et al. Qarl: Rollout-aligned quantization-aware rl for fast and stable training under training–inference mismatch.arXiv preprint arXiv:2604.07853, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[10]

D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Z. Hou, Z. Hu, Y . Li, R. Lu, J. Tang, and Y . Dong. TreeRL: LLM reinforcement learning with on-policy tree search. InACL, 2025

2025
[12]

E. J. Hu, yelong shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. LoRA: Low-rank adaptation of large language models. InICLR, 2022

2022
[13]

J. Hu, Y . Zhang, Q. Han, D. Jiang, X. Zhang, and H.-Y . Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model. InNeurIPS, 2025

2025
[14]

Z. Hu, S. Zhang, Y . Li, J. Yan, X. Hu, L. Cui, X. Qu, C. Chen, Y . Cheng, and Z. Wang. Diversity-incentivized exploration for versatile reasoning. InICLR, 2026

2026
[15]

Huang and X

B. Huang and X. Wan. Pros: Towards compute-efficient rlvr via rollout prefix reuse. InICLR, 2026

2026
[16]

Huang, Y

W. Huang, Y . Ge, S. Yang, Y . Xiao, H. Mao, Y . Lin, H. Ye, S. Liu, K. C. Cheung, H. Yin, et al. Qerl: Beyond efficiency–quantization-enhanced reinforcement learning for llms. InICLR, 2026

2026
[17]

Where does output diversity collapse in post-training?

C. Karouzos, X. Tan, and N. Aletras. Where does output diversity collapse in post-training? arXiv preprint arXiv:2604.16027, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[18]

R. Kirk, I. Mediratta, C. Nalmpantis, J. Luketina, E. Hambro, E. Grefenstette, and R. Raileanu. Understanding the effects of rlhf on llm generalisation and diversity. InICLR, 2024. 10

2024
[19]

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention. InSOSP, 2023

2023
[20]

T.-L. V . Le, M. Jeon, K. Vu, V . Lai, and E. Yang. No prompt left behind: Exploiting zero- variance prompts in llm reinforcement learning via entropy-guided advantage shaping. InICLR, 2026

2026
[21]

J. Li, E. Beeching, L. Tunstall, B. Lipkin, R. Soletskyi, S. Huang, K. Rasul, L. Yu, A. Q. Jiang, Z. Shen, et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions. https://huggingface.co/datasets/AI-MO/ NuminaMath-TIR, 2024. Hugging Face dataset repository

2024
[22]

Lightman, V

H. Lightman, V . Kosaraju, Y . Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let’s verify step by step. InICLR, 2024

2024
[23]

B. Liu, A. Wang, Z. Min, L. Yao, H. Zhang, Y . Liu, A. Zeng, and J. Su. Spec-rl: Accelerating on-policy reinforcement learning via speculative rollouts.arXiv preprint arXiv:2509.23232, 2025

work page arXiv 2025
[24]

C. Liu, J. Liang, Y . Jia, B. Cao, Y . Bai, H. Huang, and X. Chen. Explore data left behind in reinforcement learning for reasoning language models.arXiv preprint arXiv:2511.04800, 2025

work page arXiv 2025
[25]

Rajbhandari, J

S. Rajbhandari, J. Rasley, O. Ruwase, and Y . He. Zero: memory optimizations toward training trillion parameter models. InSC20, 2020

2020
[26]

Reimers and I

N. Reimers and I. Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. InEMNLP, 2019

2019
[27]

Setlur, Z

A. Setlur, Z. Wang, A. Cohen, P. Rashidinejad, and S. M. Xie. Reuse your flops: Scaling rl on hard problems by conditioning on very off-policy prefixes.arXiv preprint arXiv:2601.18795, 2026

work page arXiv 2026
[28]

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

K. Song, X. Tan, T. Qin, J. Lu, and T.-Y . Liu. Mpnet: Masked and permuted pre-training for language understanding. InNeurIPS, 2020

2020
[30]

Y . Song, J. Kempe, and R. Munos. Outcome-based exploration for llm reasoning.arXiv preprint arXiv:2509.06941, 2025

work page arXiv 2025
[31]

Z. Wan, Y . Shen, Z. Dou, D. Zhou, Y . Zhang, X. Wang, H. Shen, J. Xiong, C. Tao, Z. Zhong, et al. Dsdr: Dual-scale diversity regularization for exploration in llm reasoning.arXiv preprint arXiv:2602.19895, 2026

work page arXiv 2026
[32]

S. Wang, L. Yu, C. Gao, C. Zheng, S. Liu, R. Lu, K. Dang, X.-H. Chen, J. Yang, Z. Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning. InNeurIPS, 2025

2025
[33]

L. Wei, Y . Zhang, Z. Zhang, Z. Wang, S. Zhao, T. Huang, H. Zhao, C. Liu, S. Zhang, and J. Yan. Entropy-tree: Tree-based decoding with entropy-guided exploration.arXiv preprint arXiv:2601.15296, 2026

work page arXiv 2026
[34]

F. Wu, W. Xuan, X. Lu, M. Liu, Y . Dong, Z. Harchaoui, and Y . Choi. The invisible leash: Why rlvr may or may not escape its origin.arXiv preprint arXiv:2507.14843, 2025

work page arXiv 2025
[35]

S. Xing, S. Wang, C. Yang, X. Dai, and X. Ren. Lookahead tree-based rollouts for enhanced trajectory-level exploration in reinforcement learning with verifiable rewards. InICLR, 2026

2026
[36]

Y . E. Xu, Y . Savani, F. Fang, and J. Z. Kolter. Not all rollouts are useful: Down-sampling rollouts in llm reinforcement learning.arXiv preprint arXiv:2504.13818, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

C. Yang, L. Gui, C. Yang, V . Veitch, L. Zhang, and Z. Zhao. Let it calm: Exploratory annealed decoding for verifiable reinforcement learning.arXiv preprint arXiv:2510.05251, 2025

work page arXiv 2025
[39]

Q. Yu, Z. Zhang, R. Zhu, Y . Yuan, X. Zuo, Y . Yue, W. Dai, T. Fan, G. Liu, J. Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. InNeurIPS, 2025

2025
[40]

S. Yu, L. Li, W. Zhao, and Z. Yang. Erpo: Token-level entropy-regulated policy optimization for large reasoning models.arXiv preprint arXiv:2603.28204, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[41]

Y . Yue, Z. Chen, R. Lu, A. Zhao, Z. Wang, S. Song, and G. Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? InNeurIPS, 2025

2025
[42]

Zhang and Math-AI

Y . Zhang and Math-AI. American invitational mathematics examination (aime) 2024.https: //huggingface.co/datasets/math-ai/aime24, 2024. Hugging Face dataset repository

2024
[43]

Zhang and Math-AI

Y . Zhang and Math-AI. American invitational mathematics examination (aime) 2025.https: //huggingface.co/datasets/math-ai/aime25, 2025. Hugging Face dataset repository

2025
[44]

Improving sampling efficiency in rlvr through adaptive rollout and response reuse

Y . Zhang, W. Yao, C. Yu, Y . Liu, Q. Yin, B. Yin, H. Yun, and L. Li. Improving sampling efficiency in rlvr through adaptive rollout and response reuse.arXiv preprint arXiv:2509.25808, 2025

work page arXiv 2025
[45]

Z. Zhao, Z. Ren, J. Zou, L. Yang, Z. Xu, X. Ge, Z. Chen, X. Ma, D. Shi, S. Wang, et al. Rein- forced efficient reasoning via semantically diverse exploration.arXiv preprint arXiv:2601.05053, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[46]

Zheng, Y

H. Zheng, Y . Zhou, B. R. Bartoldson, B. Kailkhura, F. Lai, J. Zhao, and B. Chen. Act only when it pays: Efficient reinforcement learning for llm reasoning via selective rollouts. InNeurIPS, 2025

2025
[47]

X. Zhu, M. Xia, Z. Wei, W.-L. Chen, D. Chen, and Y . Meng. The surprising effectiveness of negative reinforcement in llm reasoning. InNeurIPS, 2025

2025
[48]

Zhuang, Y

H. Zhuang, Y . Zhou, T. Guo, Y . Huang, F. Liu, K. Song, and X. Zhang. Exploring multi-temperature strategies for token-and rollout-level control in rlvr.arXiv preprint arXiv:2510.08892, 2025. 12 A Experimental Details A.1 RL Training and Evaluation Details We evaluate REFT as a drop-in rollout-sampling modification on top of DAPO [39] and GRPO [28]. The ...

work page arXiv 2025
[49]

John drinks water at the following times: breakfast, lunch, dinner, and before bed. 2. Therefore, John drinks water4 times a day. 3. There are 5 weekdays and 2 weekend days. 4. On weekdays: 5×4 = 20 glasses. 5. On weekends, he switches from water to soda for dinner, so he drinks3glasses each weekend day,2×3 = 6glasses. 20 + 6 = 26glasses. </think> <answer...

[1] [1]

Albalak, D

A. Albalak, D. Phung, N. Lile, R. Rafailov, K. Gandhi, L. Castricato, A. Singh, C. Blagden, V . Xiang, D. Mahan, et al. Big-math: A large-scale, high-quality math dataset for reinforcement learning in language models.arXiv preprint arXiv:2502.17387, 2025

work page arXiv 2025

[2] [2]

B. Bai, X. Wang, P. Ye, and T. Chen. Learning to explore with parameter-space noise: A deep dive into parameter-space noise for reinforcement learning with verifiable rewards.arXiv preprint arXiv:2602.02555, 2026

work page arXiv 2026

[3] [3]

Chang, S

C.-C. Chang, S. Zhu, Z. Zeng, H. Lin, J. You, M. S. Abdelfattah, Z. Jiang, and X. Qian. Srt: Accelerating reinforcement learning via speculative rollout with tree-structured cache.arXiv preprint arXiv:2601.09083, 2026

work page arXiv 2026

[4] [4]

Z. Chen, H. Liu, Y . Zhou, H. Zheng, and B. Chen. Jackpot: Optimal budgeted rejection sampling for extreme actor-policy mismatch reinforcement learning.arXiv preprint arXiv:2602.06107, 2026

work page arXiv 2026

[5] [5]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[6] [6]

X. Dang, C. Baek, J. Z. Kolter, and A. Raghunathan. Assessing diversity collapse in reasoning. InICLR 2025 Workshop on SSI-FM, 2025

2025

[7] [7]

Y . Fang, J. Lin, X. Fu, C. Qin, H. Shi, C. Hu, L. Pan, K. Zeng, and X. Cai. How to allocate, how to learn? dynamic rollout allocation and advantage modulation for policy optimization.arXiv preprint arXiv:2602.19208, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[8] [8]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

H. Gu, H. Wang, J. Liu, L. Li, Q. Zhu, B. Liu, B. Xu, L. Wang, X. Yang, S. Lin, et al. Qarl: Rollout-aligned quantization-aware rl for fast and stable training under training–inference mismatch.arXiv preprint arXiv:2604.07853, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[10] [10]

D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Z. Hou, Z. Hu, Y . Li, R. Lu, J. Tang, and Y . Dong. TreeRL: LLM reinforcement learning with on-policy tree search. InACL, 2025

2025

[12] [12]

E. J. Hu, yelong shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. LoRA: Low-rank adaptation of large language models. InICLR, 2022

2022

[13] [13]

J. Hu, Y . Zhang, Q. Han, D. Jiang, X. Zhang, and H.-Y . Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model. InNeurIPS, 2025

2025

[14] [14]

Z. Hu, S. Zhang, Y . Li, J. Yan, X. Hu, L. Cui, X. Qu, C. Chen, Y . Cheng, and Z. Wang. Diversity-incentivized exploration for versatile reasoning. InICLR, 2026

2026

[15] [15]

Huang and X

B. Huang and X. Wan. Pros: Towards compute-efficient rlvr via rollout prefix reuse. InICLR, 2026

2026

[16] [16]

Huang, Y

W. Huang, Y . Ge, S. Yang, Y . Xiao, H. Mao, Y . Lin, H. Ye, S. Liu, K. C. Cheung, H. Yin, et al. Qerl: Beyond efficiency–quantization-enhanced reinforcement learning for llms. InICLR, 2026

2026

[17] [17]

Where does output diversity collapse in post-training?

C. Karouzos, X. Tan, and N. Aletras. Where does output diversity collapse in post-training? arXiv preprint arXiv:2604.16027, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[18] [18]

R. Kirk, I. Mediratta, C. Nalmpantis, J. Luketina, E. Hambro, E. Grefenstette, and R. Raileanu. Understanding the effects of rlhf on llm generalisation and diversity. InICLR, 2024. 10

2024

[19] [19]

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention. InSOSP, 2023

2023

[20] [20]

T.-L. V . Le, M. Jeon, K. Vu, V . Lai, and E. Yang. No prompt left behind: Exploiting zero- variance prompts in llm reinforcement learning via entropy-guided advantage shaping. InICLR, 2026

2026

[21] [21]

J. Li, E. Beeching, L. Tunstall, B. Lipkin, R. Soletskyi, S. Huang, K. Rasul, L. Yu, A. Q. Jiang, Z. Shen, et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions. https://huggingface.co/datasets/AI-MO/ NuminaMath-TIR, 2024. Hugging Face dataset repository

2024

[22] [22]

Lightman, V

H. Lightman, V . Kosaraju, Y . Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let’s verify step by step. InICLR, 2024

2024

[23] [23]

B. Liu, A. Wang, Z. Min, L. Yao, H. Zhang, Y . Liu, A. Zeng, and J. Su. Spec-rl: Accelerating on-policy reinforcement learning via speculative rollouts.arXiv preprint arXiv:2509.23232, 2025

work page arXiv 2025

[24] [24]

C. Liu, J. Liang, Y . Jia, B. Cao, Y . Bai, H. Huang, and X. Chen. Explore data left behind in reinforcement learning for reasoning language models.arXiv preprint arXiv:2511.04800, 2025

work page arXiv 2025

[25] [25]

Rajbhandari, J

S. Rajbhandari, J. Rasley, O. Ruwase, and Y . He. Zero: memory optimizations toward training trillion parameter models. InSC20, 2020

2020

[26] [26]

Reimers and I

N. Reimers and I. Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. InEMNLP, 2019

2019

[27] [27]

Setlur, Z

A. Setlur, Z. Wang, A. Cohen, P. Rashidinejad, and S. M. Xie. Reuse your flops: Scaling rl on hard problems by conditioning on very off-policy prefixes.arXiv preprint arXiv:2601.18795, 2026

work page arXiv 2026

[28] [28]

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

K. Song, X. Tan, T. Qin, J. Lu, and T.-Y . Liu. Mpnet: Masked and permuted pre-training for language understanding. InNeurIPS, 2020

2020

[30] [30]

Y . Song, J. Kempe, and R. Munos. Outcome-based exploration for llm reasoning.arXiv preprint arXiv:2509.06941, 2025

work page arXiv 2025

[31] [31]

Z. Wan, Y . Shen, Z. Dou, D. Zhou, Y . Zhang, X. Wang, H. Shen, J. Xiong, C. Tao, Z. Zhong, et al. Dsdr: Dual-scale diversity regularization for exploration in llm reasoning.arXiv preprint arXiv:2602.19895, 2026

work page arXiv 2026

[32] [32]

S. Wang, L. Yu, C. Gao, C. Zheng, S. Liu, R. Lu, K. Dang, X.-H. Chen, J. Yang, Z. Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning. InNeurIPS, 2025

2025

[33] [33]

L. Wei, Y . Zhang, Z. Zhang, Z. Wang, S. Zhao, T. Huang, H. Zhao, C. Liu, S. Zhang, and J. Yan. Entropy-tree: Tree-based decoding with entropy-guided exploration.arXiv preprint arXiv:2601.15296, 2026

work page arXiv 2026

[34] [34]

F. Wu, W. Xuan, X. Lu, M. Liu, Y . Dong, Z. Harchaoui, and Y . Choi. The invisible leash: Why rlvr may or may not escape its origin.arXiv preprint arXiv:2507.14843, 2025

work page arXiv 2025

[35] [35]

S. Xing, S. Wang, C. Yang, X. Dai, and X. Ren. Lookahead tree-based rollouts for enhanced trajectory-level exploration in reinforcement learning with verifiable rewards. InICLR, 2026

2026

[36] [36]

Y . E. Xu, Y . Savani, F. Fang, and J. Z. Kolter. Not all rollouts are useful: Down-sampling rollouts in llm reinforcement learning.arXiv preprint arXiv:2504.13818, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[38] [38]

C. Yang, L. Gui, C. Yang, V . Veitch, L. Zhang, and Z. Zhao. Let it calm: Exploratory annealed decoding for verifiable reinforcement learning.arXiv preprint arXiv:2510.05251, 2025

work page arXiv 2025

[39] [39]

Q. Yu, Z. Zhang, R. Zhu, Y . Yuan, X. Zuo, Y . Yue, W. Dai, T. Fan, G. Liu, J. Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. InNeurIPS, 2025

2025

[40] [40]

S. Yu, L. Li, W. Zhao, and Z. Yang. Erpo: Token-level entropy-regulated policy optimization for large reasoning models.arXiv preprint arXiv:2603.28204, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[41] [41]

Y . Yue, Z. Chen, R. Lu, A. Zhao, Z. Wang, S. Song, and G. Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? InNeurIPS, 2025

2025

[42] [42]

Zhang and Math-AI

Y . Zhang and Math-AI. American invitational mathematics examination (aime) 2024.https: //huggingface.co/datasets/math-ai/aime24, 2024. Hugging Face dataset repository

2024

[43] [43]

Zhang and Math-AI

Y . Zhang and Math-AI. American invitational mathematics examination (aime) 2025.https: //huggingface.co/datasets/math-ai/aime25, 2025. Hugging Face dataset repository

2025

[44] [44]

Improving sampling efficiency in rlvr through adaptive rollout and response reuse

Y . Zhang, W. Yao, C. Yu, Y . Liu, Q. Yin, B. Yin, H. Yun, and L. Li. Improving sampling efficiency in rlvr through adaptive rollout and response reuse.arXiv preprint arXiv:2509.25808, 2025

work page arXiv 2025

[45] [45]

Z. Zhao, Z. Ren, J. Zou, L. Yang, Z. Xu, X. Ge, Z. Chen, X. Ma, D. Shi, S. Wang, et al. Rein- forced efficient reasoning via semantically diverse exploration.arXiv preprint arXiv:2601.05053, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[46] [46]

Zheng, Y

H. Zheng, Y . Zhou, B. R. Bartoldson, B. Kailkhura, F. Lai, J. Zhao, and B. Chen. Act only when it pays: Efficient reinforcement learning for llm reasoning via selective rollouts. InNeurIPS, 2025

2025

[47] [47]

X. Zhu, M. Xia, Z. Wei, W.-L. Chen, D. Chen, and Y . Meng. The surprising effectiveness of negative reinforcement in llm reasoning. InNeurIPS, 2025

2025

[48] [48]

Zhuang, Y

H. Zhuang, Y . Zhou, T. Guo, Y . Huang, F. Liu, K. Song, and X. Zhang. Exploring multi-temperature strategies for token-and rollout-level control in rlvr.arXiv preprint arXiv:2510.08892, 2025. 12 A Experimental Details A.1 RL Training and Evaluation Details We evaluate REFT as a drop-in rollout-sampling modification on top of DAPO [39] and GRPO [28]. The ...

work page arXiv 2025

[49] [49]

John drinks water at the following times: breakfast, lunch, dinner, and before bed. 2. Therefore, John drinks water4 times a day. 3. There are 5 weekdays and 2 weekend days. 4. On weekdays: 5×4 = 20 glasses. 5. On weekends, he switches from water to soda for dinner, so he drinks3glasses each weekend day,2×3 = 6glasses. 20 + 6 = 26glasses. </think> <answer...