pith. sign in

arxiv: 2606.06080 · v1 · pith:DQ2PYM75new · submitted 2026-06-04 · 💻 cs.LG · cs.AI· cs.CL

On Advantage Estimates for Max@K Policy Gradients

Pith reviewed 2026-06-28 02:18 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords policy gradientadvantage estimationmax@KLeave-Two-Outreinforcement learningLLM post-traininggradient variance reductioncentered advantages
0
0 comments X

The pith

The Leave-Two-Out baseline makes realized batch advantages exactly centered for max@K while preserving policy-gradient unbiasedness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard advantage estimators for max@K objectives are unbiased but produce non-centered advantages in finite batches. It introduces a Leave-Two-Out baseline that achieves exact centering without losing unbiasedness, yielding the MaxPO method with quadratic-time computation. This approach integrates with group-based RL for LLM post-training and empirically lowers gradient variance. A sympathetic reader would care because centered advantages can stabilize optimization in sparse-reward settings like reasoning model training.

Core claim

Starting from the advantage estimator of a leading method in the field, we show that it is policy-gradient unbiased but yields a non-centered advantage. We then introduce a Leave-Two-Out baseline that preserves policy-gradient unbiasedness while making realized batch advantages exactly centered. The resulting method, MaxPO, has an efficient quadratic-time implementation and integrates naturally into group-based RL for LLM post-training. We further derive the canonical finite-batch advantage for max@K, providing a unified view of existing advantage estimators. Empirically, we verify that the L2O baseline reduces gradient variance and outperforms non-centered alternatives.

What carries the argument

The Leave-Two-Out baseline, which uses leave-two-out estimates to ensure exact centering of batch advantages while maintaining unbiased policy gradients for the max@K objective.

If this is right

  • The MaxPO method reduces gradient variance compared to non-centered estimators.
  • The canonical finite-batch advantage unifies existing estimators for max@K.
  • MaxPO integrates naturally into group-based RL training for LLMs.
  • The L2O baseline has an efficient quadratic-time implementation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the variance reduction holds across tasks, MaxPO could enable more stable training with smaller batch sizes.
  • The centering property might generalize to other group-based sampling methods in RL beyond max@K.
  • Future work could test whether the unbiased centered advantages improve convergence speed on downstream reasoning benchmarks.

Load-bearing premise

The observed reduction in gradient variance from using the centered L2O baseline will translate into improved final performance on downstream tasks without introducing instabilities.

What would settle it

Running the same LLM post-training experiment with MaxPO versus a non-centered baseline and measuring the difference in final model accuracy or pass@K scores on held-out reasoning tasks would test if the centering improves outcomes.

Figures

Figures reproduced from arXiv: 2606.06080 by Gouki Minegishi, Kohsei Matsutani, Paavo Parmas, Shota Takashiro, Soichiro Nishimori, Takeshi Kojima, Yongmin Kim, Yusuke Iwasawa, Yutaka Matsuo.

Figure 1
Figure 1. Figure 1: Estimation error (left), variance vs. action space size (center), and variance vs. [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Task-average pass@k (k ≤ 256). Unweighted average over AIME24, AIME25, AMC23, MATH500, and Minerva (temperature 0.6, top-p 0.95). Our method demonstrates consistent improve￾ment over strong baselines. 5.1 Bandits Here, we validate the two theoretical properties of the L2O baseline: (1) unbiasedness as a PG estimator and (2) variance reduction over the raw EI estimator due to centering the advantage. Settin… view at source ↗
Figure 3
Figure 3. Figure 3: Adam-moment variance proxy during training (3 seeds). The proxy is estimated from Adam states via Var(g) ≈ vˆt − mˆ 2 t and aggre￾gated across parameters. Our method (EI+L2O) reduces this variance proxy compared to PKPO across training, sup￾porting the proposed variance-reduction mechanism. To test our theoretical motivation, we measure a smoothed proxy for the variance of model gradients during RL train￾i… view at source ↗
Figure 4
Figure 4. Figure 4: Variance vs action space size. Variance vs action space size. Here, we report additional results on variance versus action space size ( [PITH_FULL_IMAGE:figures/full_fig_p033_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Variance vs K. Variance vs K. Here, we report additional results on variance versus K ( [PITH_FULL_IMAGE:figures/full_fig_p033_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Structure of the Bi￾ased Maze Environment. The red line indicates the unique op￾timal path. Environment. The maze is deterministic, with binary actions (0 or 1) available at each state. The agent receives a reward of +1 for each forward step along a valid path. There exists a single correct path to the goal (indicated by the red line in [PITH_FULL_IMAGE:figures/full_fig_p033_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Moving average (window size 10) of the average return, probability of action 1 (via [PITH_FULL_IMAGE:figures/full_fig_p035_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Per-benchmark pass@k curves. For each benchmark, we plot pass@k as a function of inference compute k for all methods. This complements the task-average curves in [PITH_FULL_IMAGE:figures/full_fig_p037_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Empirical support dynamics relative to the base model, following the taxonomy of Wu [PITH_FULL_IMAGE:figures/full_fig_p038_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Sensitivity to the training objective size [PITH_FULL_IMAGE:figures/full_fig_p039_10.png] view at source ↗
read the original abstract

Reinforcement learning with verifiable rewards is widely used for post-training reasoning models, but sparse outcome rewards make exploration difficult. A complementary approach is to optimize inference-time objectives such as pass@K and max@K directly, yet existing policy-gradient estimators for these objectives use different signals, baselines, and normalizations, making their relationships unclear. We study this issue through baseline design and advantage centering. Starting from the advantage estimator of a leading method in the field, we show that it is policy-gradient unbiased but yields a non-centered advantage. We then introduce a Leave-Two-Out baseline that preserves policy-gradient unbiasedness while making realized batch advantages exactly centered. The resulting method, MaxPO, has an efficient quadratic-time implementation and integrates naturally into group-based RL for LLM post-training. We further derive the canonical finite-batch advantage for max@K, providing a unified view of existing advantage estimators. Empirically, we verify that the L2O baseline reduces gradient variance and outperforms non-centered alternatives.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper starts from an existing advantage estimator for max@K policy gradients that is shown to be policy-gradient unbiased but non-centered. It introduces a Leave-Two-Out (L2O) baseline that preserves unbiasedness while forcing realized batch advantages to be exactly centered, yielding the MaxPO method with a quadratic-time implementation. The work also derives the canonical finite-batch advantage estimator for max@K and provides a unified view of prior estimators. Empirically, L2O is shown to reduce gradient variance and to outperform non-centered alternatives when integrated into group-based RL for LLM post-training with verifiable rewards.

Significance. If the unbiasedness and exact-centering claims are correct, the paper supplies a principled baseline design that addresses a practical difficulty in optimizing inference-time objectives under sparse outcome rewards. The derivation of the canonical estimator and the explicit centering construction are potentially useful for clarifying relationships among existing methods. The reported variance reduction is a concrete, measurable improvement, though its translation to final model quality remains the least-secured step.

major comments (2)
  1. [Empirical Evaluation / §5] Empirical Evaluation (abstract and §5): the manuscript reports that L2O reduces gradient variance and that MaxPO outperforms non-centered alternatives, yet supplies no analytic argument or controlled ablation showing that exact batch centering cannot suppress useful exploration signals under group-based sampling and sparse verifiable rewards; the link from lower-variance centered advantages to improved downstream task performance is therefore asserted rather than derived or isolated.
  2. [§3.2–3.3] §3.2–3.3: while the L2O construction is stated to preserve policy-gradient unbiasedness, the dependence introduced by the max@K operator over the group means that leaving out exactly two samples must be shown to cancel the bias term exactly; the provided derivation sketch does not explicitly verify this cancellation for the finite-batch case.
minor comments (2)
  1. [§2] Notation for the max@K objective and the group size K should be introduced once in §2 and used consistently thereafter to avoid redefinition.
  2. [§4] The quadratic-time implementation complexity is stated but not accompanied by a small worked example or pseudocode that would make the Leave-Two-Out procedure immediately reproducible.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. Below we respond point-by-point to the major concerns. We agree that both points identify areas where the manuscript can be strengthened and will incorporate revisions accordingly.

read point-by-point responses
  1. Referee: [Empirical Evaluation / §5] Empirical Evaluation (abstract and §5): the manuscript reports that L2O reduces gradient variance and that MaxPO outperforms non-centered alternatives, yet supplies no analytic argument or controlled ablation showing that exact batch centering cannot suppress useful exploration signals under group-based sampling and sparse verifiable rewards; the link from lower-variance centered advantages to improved downstream task performance is therefore asserted rather than derived or isolated.

    Authors: We agree that the current manuscript does not isolate the effect of centering on exploration via a dedicated analytic argument or controlled ablation. While the reported variance reduction and downstream gains are empirical, a direct link to preserved exploration is not formally derived. In revision we will add a controlled ablation that measures response diversity (e.g., distinct n-gram coverage and entropy within groups) under centered versus non-centered estimators on the same sampling budget, thereby providing the requested isolation between variance reduction and potential suppression of useful signals. revision: yes

  2. Referee: [§3.2–3.3] §3.2–3.3: while the L2O construction is stated to preserve policy-gradient unbiasedness, the dependence introduced by the max@K operator over the group means that leaving out exactly two samples must be shown to cancel the bias term exactly; the provided derivation sketch does not explicitly verify this cancellation for the finite-batch case.

    Authors: We acknowledge that the derivation sketch in §3.2–3.3 is not fully expanded for the finite-batch case. The manuscript claims unbiasedness is preserved, but the explicit cancellation of the bias term induced by the max@K operator when exactly two samples are left out is only sketched. In the revision we will replace the sketch with a complete, step-by-step algebraic verification that shows the expectation of the L2O advantage estimator equals the true policy gradient for any finite group size K ≥ 2. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained from existing estimators and new constructions.

full rationale

The paper begins with the advantage estimator from a leading method in the field, establishes that it is policy-gradient unbiased but non-centered, then constructs a Leave-Two-Out baseline that preserves unbiasedness while enforcing exact centering on realized batch advantages. It further derives the canonical finite-batch advantage for max@K and provides an efficient implementation for MaxPO. These steps rely on standard policy-gradient theory and explicit algebraic constructions rather than fitting parameters to the target result or reducing to self-citations. No load-bearing step equates a prediction or uniqueness claim to its own inputs by definition, and the empirical verification of variance reduction is presented as an independent check. The derivation chain therefore remains independent of the final performance claims.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated. Standard RL assumptions (unbiased policy gradient, finite-batch sampling) are implicitly used but not enumerated.

pith-pipeline@v0.9.1-grok · 5739 in / 1105 out tokens · 24157 ms · 2026-06-28T02:18:27.283486+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

75 extracted references · 44 canonical work pages · 21 internal anchors

  1. [1]

    Finite-time analysis of the multiarmed bandit problem.Machine learning, 47(2):235–256, 2002

    Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem.Machine learning, 47(2):235–256, 2002. 10

  2. [2]

    The best of N worlds: Aligning reinforcement learning with best-of-N sampling via max@k optimisation.arXiv preprint arXiv:2510.23393, 2025

    Farid Bagirov, Mikhail Arkhipov, Ksenia Sycheva, Evgeniy Glukhov, and Egor Bogomolov. The best of N worlds: Aligning reinforcement learning with best-of-N sampling via max@k optimisation.arXiv preprint arXiv:2510.23393, 2025

  3. [3]

    On- line preference alignment for language models via count-based exploration.arXiv preprint arXiv:2501.12735, 2025

    Chenjia Bai, Yang Zhang, Shuang Qiu, Qiaosheng Zhang, Kang Xu, and Xuelong Li. On- line preference alignment for language models via count-based exploration.arXiv preprint arXiv:2501.12735, 2025

  4. [4]

    Consistency is not always correct: Towards understanding the role of exploration in post-training reasoning.arXiv preprint arXiv:2511.07368, 2025

    Dake Bu, Wei Huang, Andi Han, Atsushi Nitanda, Bo Xue, Qingfu Zhang, Hau-San Wong, and Taiji Suzuki. Consistency is not always correct: Towards understanding the role of exploration in post-training reasoning.arXiv preprint arXiv:2511.07368, 2025

  5. [5]

    Exploration by Random Network Distillation

    Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation.arXiv preprint arXiv:1810.12894, 2018

  6. [6]

    T., Krishnamurthy, A., and Foster, D

    Fan Chen, Audrey Huang, Noah Golowich, Sadhika Malladi, Adam Block, Jordan T. Ash, Akshay Krishnamurthy, and Dylan J. Foster. The coverage principle: How pre-training enables post-training.arXiv preprint arXiv:2510.15020, 2025

  7. [7]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, and Greg Brockman et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

  8. [8]

    Pass@k training for adaptively balancing exploration and exploitation of large reasoning models.arXiv preprint arXiv:2508.10751,

    Zhipeng Chen, Xiaobo Qin, Youbin Wu, Yue Ling, Qinghao Ye, Wayne Xin Zhao, and Guang Shi. Pass@k training for adaptively balancing exploration and exploitation of large reasoning models.arXiv preprint arXiv:2508.10751, 2025

  9. [9]

    Reasoning with exploration: An entropy perspective.Proceedings of the AAAI Conference on Artificial Intelligence, 40(36):30377–30385, 2026

    Daixuan Cheng, Shaohan Huang, Xuekai Zhu, Bo Dai, Xin Zhao, Zhenliang Zhang, and Furu Wei. Reasoning with exploration: An entropy perspective.Proceedings of the AAAI Conference on Artificial Intelligence, 40(36):30377–30385, 2026. doi: 10.1609/aaai.v40i36.40290

  10. [10]

    Deep reinforcement learning from human preferences

    Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. InAdvances in Neural Information Processing Systems, volume 30, 2017

  11. [11]

    Beyond variance reduction: Understanding the true impact of baselines on policy optimization

    Wesley Chung, Valentin Thomas, Marlos C Machado, and Nicolas Le Roux. Beyond variance reduction: Understanding the true impact of baselines on policy optimization. InInternational conference on machine learning, pages 1999–2009. PMLR, 2021

  12. [12]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  13. [13]

    The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

    Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al. The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617, 2025

  14. [14]

    Weight ensembling improves reasoning in language models

    Xingyu Dang, Christina Baek, Kaiyue Wen, Zico Kolter, and Aditi Raghunathan. Weight ensembling improves reasoning in language models. InSecond Conference on Language Modeling, 2025

  15. [15]

    Navigate the unknown: Enhancing LLM reasoning with intrinsic motivation guided exploration.arXiv preprint arXiv:2505.17621, 2025

    Jingtong Gao, Ling Pan, Yejing Wang, Rui Zhong, Chi Lu, Qingpeng Cai, Peng Jiang, and Xiangyu Zhao. Navigate the unknown: Enhancing LLM reasoning with intrinsic motivation guided exploration.arXiv preprint arXiv:2505.17621, 2025

  16. [16]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, and Alex Vaughan et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  17. [17]

    Variance reduction techniques for gradient estimates in reinforcement learning.Journal of Machine Learning Research, 5(Nov): 1471–1530, 2004

    Evan Greensmith, Peter L Bartlett, and Jonathan Baxter. Variance reduction techniques for gradient estimates in reinforcement learning.Journal of Machine Learning Research, 5(Nov): 1471–1530, 2004. 11

  18. [18]

    MuProp: Unbiased Backpropagation for Stochastic Neural Networks

    Shixiang Gu, Sergey Levine, Ilya Sutskever, and Andriy Mnih. Muprop: Unbiased backpropa- gation for stochastic neural networks.arXiv preprint arXiv:1511.05176, 2015

  19. [19]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, and Xiao Bi et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  20. [20]

    Rewarding the unlikely: Lifting GRPO beyond distribution sharpening.arXiv preprint arXiv:2506.02355, 2025

    Andre He, Daniel Fried, and Sean Welleck. Rewarding the unlikely: Lifting GRPO beyond distribution sharpening.arXiv preprint arXiv:2506.02355, 2025

  21. [21]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

  22. [22]

    A class of statistics with asymptotically normal distribution

    Wassily Hoeffding. A class of statistics with asymptotically normal distribution. InBreak- throughs in statistics: Foundations and basic theory, pages 308–334. Springer, 1992

  23. [23]

    Emergent Slow Thinking in LLMs as Inverse Tree Freezing

    Sihan Hu, Xiansheng Cai, Yuan Huang, Zhiyuan Yao, Linfeng Zhang, Pan Zhang, Youjin Deng, and Kun Chen. Emergent slow thinking in LLMs as inverse tree freezing.arXiv preprint arXiv:2509.23629, 2025

  24. [24]

    Rethinking entropy regularization in large reasoning models.arXiv preprint arXiv:2509.25133, 2025

    Yuxian Jiang, Yafu Li, Guanxu Chen, Dongrui Liu, Yu Cheng, and Jing Shao. Rethinking entropy regularization in large reasoning models.arXiv preprint arXiv:2509.25133, 2025

  25. [25]

    Adam: A Method for Stochastic Optimization

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2017

  26. [26]

    Emergence of exploration in policy gradient reinforcement learning via resetting, 2023

    Sotetsu Koyamada, Paavo Parmas, Tadashi Kozuno, and Shin Ishii. Emergence of exploration in policy gradient reinforcement learning via resetting, 2023. URL https://openreview.net/forum? id=GKsNIC_mQRG

  27. [27]

    Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Christopher Wilhelm, Luca Soldaini, Noah A

    Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James Validad Miranda, Alisa Liu, Nouha Dziri, Xinxi Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Christopher Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannaneh...

  28. [28]

    Solving quantitative reasoning problems with language models

    Aitor Lewkowycz, Anders Johan Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Venkatesh Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models. InAdvances in Neural Information Processing Systems, 2022

  29. [29]

    Jointly reinforcing diversity and quality in language model generations.arXiv preprint arXiv:2509.02534, 2025

    Tianjian Li, Yiming Zhang, Ping Yu, Swarnadeep Saha, Daniel Khashabi, Jason Weston, Jack Lanchantin, and Tianlu Wang. Jointly reinforcing diversity and quality in language model generations.arXiv preprint arXiv:2509.02534, 2025

  30. [30]

    Can LLMs guide their own exploration? gradient-guided reinforcement learning for LLM reasoning.arXiv preprint arXiv:2512.15687, 2025

    Zhenwen Liang, Sidi Lu, Wenhao Yu, Kishan Panaganti, Yujun Zhou, Haitao Mi, and Dong Yu. Can LLMs guide their own exploration? gradient-guided reinforcement learning for LLM reasoning.arXiv preprint arXiv:2512.15687, 2025

  31. [31]

    Understanding r1-zero-like training: A critical perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective. InConference on Language Modeling (COLM), 2025

  32. [32]

    RL squeezes, SFT expands: A comparative study of reasoning LLMs

    Kohsei Matsutani, Shota Takashiro, Gouki Minegishi, Takeshi Kojima, Yusuke Iwasawa, and Yutaka Matsuo. RL squeezes, SFT expands: A comparative study of reasoning LLMs. InThe Fourteenth International Conference on Learning Representations, 2026

  33. [33]

    The role of baselines in policy gradient optimization.Advances in Neural Information Processing Systems, 35:17818–17830, 2022

    Jincheng Mei, Wesley Chung, Valentin Thomas, Bo Dai, Csaba Szepesvari, and Dale Schuur- mans. The role of baselines in policy gradient optimization.Advances in Neural Information Processing Systems, 35:17818–17830, 2022. 12

  34. [34]

    Variational inference for monte carlo objectives

    Andriy Mnih and Danilo Rezende. Variational inference for monte carlo objectives. In International Conference on Machine Learning, pages 2188–2196. PMLR, 2016

  35. [35]

    Asynchronous methods for deep reinforce- ment learning

    V olodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforce- ment learning. InInternational conference on machine learning, pages 1928–1937. PmLR, 2016

  36. [36]

    Emergence of exploration in policy gradient reinforcement learning via retrying

    Soichiro Nishimori, Paavo Parmas, Sotetsu Koyamada, Tadashi Kozuno, Toshinori Kitamura, Shin Ishii, and Yutaka Matsuo. Emergence of exploration in policy gradient reinforcement learning via retrying. InForty-third International Conference on Machine Learning, 2026. URL https://openreview.net/forum?id=NpvBAOc87E

  37. [37]

    OpenAI o1 System Card

    OpenAI, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. OpenAI o1 system card. arXiv preprint arXiv:2412.16720, 2024

  38. [38]

    Total stochastic gradient algorithms and applications in reinforcement learning

    Paavo Parmas. Total stochastic gradient algorithms and applications in reinforcement learning. Advances in Neural Information Processing Systems, 31, 2018

  39. [39]

    A unified view of likelihood ratio and reparameterization gradients

    Paavo Parmas and Masashi Sugiyama. A unified view of likelihood ratio and reparameterization gradients. InInternational Conference on Artificial Intelligence and Statistics, pages 4078–4086. PMLR, 2021

  40. [40]

    PIPPS: Flexible model- based policy search robust to the curse of chaos

    Paavo Parmas, Carl Edward Rasmussen, Jan Peters, and Kenji Doya. PIPPS: Flexible model- based policy search robust to the curse of chaos. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 ofPro- ceedings of Machine Learning Research, pages 4065–4074. PMLR, 10–15 Jul 2018. URL https:...

  41. [41]

    Beyond the Sampled Token: Preserving Candidate Support in RLVR

    Ruotian Peng, Yi Ren, Zhouliang Yu, Weiyang Liu, and Yandong Wen. SimKO: Simple Pass@K policy optimization.arXiv preprint arXiv:2510.14807, 2025

  42. [42]

    Reinforcement learning of motor skills with policy gradients

    Jan Peters and Stefan Schaal. Reinforcement learning of motor skills with policy gradients. Neural networks, 21(4):682–697, 2008

  43. [43]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  44. [44]

    Y ., Snell, C., Greer, J., Wu, I., Smith, V ., Simchowitz, M., and Kumar, A

    Amrith Setlur, Matthew YR Yang, Charlie Snell, Jeremy Greer, Ian Wu, Virginia Smith, Max Simchowitz, and Aviral Kumar. e3: Learning to explore enables extrapolation of test-time compute for LLMs.arXiv preprint arXiv:2506.09026, 2025

  45. [45]

    J., Rushton, P., Singla, S., Parmar, M., Smith, K., Vanjani, Y ., Vaswani, A., Chaluvaraju, A., Hojel, A., Ma, A., Thomas, A., Polloreno, A., Tanwer, A., Sibai, B

    Darsh J Shah, Peter Rushton, Somanshu Singla, Mohit Parmar, Kurt Smith, Yash Vanjani, Ashish Vaswani, Adarsh Chaluvaraju, Andrew Hojel, Andrew Ma, Anil Thomas, Anthony Polloreno, Ashish Tanwer, Burhan Drak Sibai, Divya S Mansingka, Divya Shivaprasad, Ishaan Shah, Karl Stratos, Khoi Nguyen, Michael Callahan, Michael Pust, Mrinal Iyer, Philip Monk, Platon M...

  46. [46]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  47. [47]

    On entropy control in LLM-RL algorithms

    Han Shen. On entropy control in LLM-RL algorithms. InThe Fourteenth International Conference on Learning Representations, 2026

  48. [48]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. HybridFlow: A flexible and efficient RLHF framework. arXiv preprint arXiv:2409.19256, 2024

  49. [49]

    Outcome-based exploration for LLM reasoning

    Yuda Song, Julia Kempe, and Remi Munos. Outcome-based exploration for LLM reasoning. arXiv preprint arXiv:2509.06941, 2025. 13

  50. [50]

    Kakade, Dean Foster, and Udaya Ghai

    Yuda Song, Hanlin Zhang, Carson Eisenach, Sham M. Kakade, Dean Foster, and Udaya Ghai. Mind the gap: Examining the self-improvement capabilities of large language models. InThe Thirteenth International Conference on Learning Representations, 2025

  51. [51]

    Optimizing language models for inference time objectives using reinforcement learning

    Yunhao Tang, Kunhao Zheng, Gabriel Synnaeve, and Remi Munos. Optimizing language models for inference time objectives using reinforcement learning. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 59066–59085. PMLR, 2025

  52. [52]

    Rebar: Low-variance, unbiased gradient estimates for discrete latent variable models.Advances in Neural Information Processing Systems, 30, 2017

    George Tucker, Andriy Mnih, Chris J Maddison, John Lawson, and Jascha Sohl-Dickstein. Rebar: Low-variance, unbiased gradient estimates for discrete latent variable models.Advances in Neural Information Processing Systems, 30, 2017

  53. [53]

    J., Krishnamurthy, A., and Ash, J

    Jens Tuyls, Dylan J Foster, Akshay Krishnamurthy, and Jordan T Ash. Representation-based ex- ploration for language models: From test-time to post-training.arXiv preprint arXiv:2510.11686, 2025

  54. [54]

    Pass@K policy optimization: Solving harder reinforcement learning problems.Advances in Neural Information Processing Systems, 38: 152416–152445, 2025

    Christian Walder and Deep Tejas Karkhanis. Pass@K policy optimization: Solving harder reinforcement learning problems.Advances in Neural Information Processing Systems, 38: 152416–152445, 2025

  55. [55]

    OctoThinker: Mid-training incentivizes reinforcement learning scaling.arXiv preprint arXiv:2506.20512, 2025

    Zengzhi Wang, Fan Zhou, Xuefeng Li, and Pengfei Liu. OctoThinker: Mid-training incentivizes reinforcement learning scaling.arXiv preprint arXiv:2506.20512, 2025

  56. [56]

    The Optimal Reward Baseline for Gradient-Based Reinforcement Learning

    Lex Weaver and Nigel Tao. The optimal reward baseline for gradient-based reinforcement learning.arXiv preprint arXiv:1301.2315, 2013

  57. [57]

    Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

    Xumeng Wen, Zihan Liu, Shun Zheng, Zhijian Xu, Shengyu Ye, Zhirong Wu, Xiao Liang, Yang Wang, Junjie Li, Ziming Miao, Jiang Bian, and Mao Yang. Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base LLMs.arXiv preprint arXiv:2506.14245, 2025

  58. [58]

    Simple statistical gradient-following algorithms for connectionist reinforce- ment learning.Machine Learning, 8(3):229–256, May 1992

    Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce- ment learning.Machine Learning, 8(3):229–256, May 1992

  59. [59]

    Variance Reduction for Policy Gradient with Action-Dependent Factorized Baselines

    Cathy Wu, Aravind Rajeswaran, Yan Duan, Vikash Kumar, Alexandre M Bayen, Sham Kakade, Igor Mordatch, and Pieter Abbeel. Variance reduction for policy gradient with action-dependent factorized baselines.arXiv preprint arXiv:1803.07246, 2018

  60. [60]

    The invisible leash: Why RLVR may or may not escape its origin.arXiv preprint arXiv:2507.14843, 2025

    Fang Wu, Weihao Xuan, Ximing Lu, Mingjie Liu, Yi Dong, Zaid Harchaoui, and Yejin Choi. The invisible leash: Why RLVR may or may not escape its origin.arXiv preprint arXiv:2507.14843, 2025

  61. [61]

    Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

    An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, and Zhenru Zhang. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement.arXiv preprint arXiv:2409.12122, 2024

  62. [62]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, and Chenxu Lv et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  63. [63]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi T...

  64. [64]

    RESTRAIN: From spurious votes to signals – self-driven rl with self-penalization.arXiv preprint arXiv:2510.02172, 2025

    Zhaoning Yu, Will Su, Leitian Tao, Haozhu Wang, Aashu Singh, Hanchao Yu, Jianyu Wang, Hongyang Gao, Weizhe Yuan, Jason Weston, Ping Yu, and Jing Xu. RESTRAIN: From spurious votes to signals – self-driven rl with self-penalization.arXiv preprint arXiv:2510.02172, 2025. 14

  65. [65]

    Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  66. [66]

    On the interplay of pre-training, mid-training, and rl on reasoning language models.arXiv preprint arXiv:2512.07783, 2025

    Charlie Zhang, Graham Neubig, and Xiang Yue. On the interplay of pre-training, mid-training, and rl on reasoning language models.arXiv preprint arXiv:2512.07783, 2025

  67. [67]

    Learning to reason as action abstractions with scalable mid-training rl.arXiv preprint arXiv:2509.25810, 2025

    Shenao Zhang, Donghan Yu, Yihao Feng, Bowen Jin, Zhaoran Wang, John Peebles, and Zirui Wang. Learning to reason as action abstractions with scalable mid-training rl.arXiv preprint arXiv:2509.25810, 2025

  68. [68]

    Echo chamber: Rl post-training amplifies behaviors learned in pretraining

    Rosie Zhao, Alexandru Meterez, Sham Kakade, Cengiz Pehlevan, Samy Jelassi, and Eran Malach. Echo chamber: Rl post-training amplifies behaviors learned in pretraining. InSecond Conference on Language Modeling, 2025

  69. [69]

    Act only when it pays: Efficient reinforcement learning for LLM reasoning via selective rollouts.arXiv preprint arXiv:2506.02177, 2025

    Haizhong Zheng, Yang Zhou, Brian R Bartoldson, Bhavya Kailkhura, Fan Lai, Jiawei Zhao, and Beidi Chen. Act only when it pays: Efficient reinforcement learning for LLM reasoning via selective rollouts.arXiv preprint arXiv:2506.02177, 2025

  70. [70]

    First return, entropy-eliciting explore

    Tianyu Zheng, Tianshun Xing, Qingshui Gu, Taoran Liang, Xingwei Qu, Xin Zhou, Yizhi Li, Zhoufutu Wen, Chenghua Lin, Wenhao Huang, et al. First return, entropy-eliciting explore. arXiv preprint arXiv:2507.07017, 2025

  71. [71]

    Evolving language models without labels: Majority drives selection, novelty promotes variation.arXiv preprint arXiv:2509.15194, 2025

    Yujun Zhou, Zhenwen Liang, Haolin Liu, Wenhao Yu, Kishan Panaganti, Linfeng Song, Dian Yu, Xiangliang Zhang, Haitao Mi, and Dong Yu. Evolving language models without labels: Majority drives selection, novelty promotes variation.arXiv preprint arXiv:2509.15194, 2025. 15 A Additional Related Work A.1 RLVR in LLMs Reinforcement learning with verifiable rewar...

  72. [72]

    [29] employed a semantic diversity score with an external semantic comparator, and Tuyls et al

    augmented training with a semantic novelty score computed from embeddings, Li et al. [29] employed a semantic diversity score with an external semantic comparator, and Tuyls et al. [53] computed a representation-based novelty score from hidden states to boost exploration. Liang et al

  73. [73]

    Setlur et al

    leveraged reward-model gradients to improve temperature sampling. Setlur et al. [44] promoted in-context exploration via skill asymmetries and negative gradients, enabling reliable extrapolation with increased test-time compute. Other studies directly optimize the pass@K metric; these are discussed in App. A.4. A.3 Policy Gradient Estimator and Baseline P...

  74. [74]

    " " Com pu te s bi no mi al c o e f f i c i e n t C (n , k ) in log - space

    and A3C [35]. In domains involving discrete latent variables or sequence generation, where learning a separate value function is often unstable or costly,sample-based baselineshave become the dominant approach. This concept was further refined in the context of variational inference by Mnih and Rezende[34] (VIMCO) and Gu et al. [18], which utilize the ave...

  75. [75]

    BoN mean

    Connection to ui: Every K-subset I⊆ B with i∈I can be written as {i} ∪S where |S|=K−1. Thus: S(i, K,B) = B−1 K−1 B K ui = K B ui.(83) 2.Connection tov i: Consider the sum over the reduced poolU=B \ {i}: X j∈U S(j, K, U) = 1B−1 K X I⊆U,|I|=K  X j∈I 1   M(I) =Kv i,(84) since eachK-subsetIcontains exactlyKindicesj. Dividing byB−1yields: 1 B−1 X j̸=i S(j,...