pith. sign in

arxiv: 2606.05800 · v1 · pith:RYIPWZWZnew · submitted 2026-06-04 · 💻 cs.LG

SALT: When More Rollouts Don't Help in Group-Based Policy Optimization and How to Make Them Matter

Pith reviewed 2026-06-28 02:49 UTC · model grok-4.3

classification 💻 cs.LG
keywords reinforcement learningpolicy optimizationgroup relative updatessubspace adaptationgradient cancellationRLVRGRPOrollout scaling
0
0 comments X

The pith

SALT reweights group-relative policy updates by amplifying residual gradient channels identified from mini-batch geometry to prevent cancellation when adding more rollouts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that increasing rollout count in GRPO-style group normalization for RLVR often fails to strengthen learning signals because per-rollout gradients collapse into a low-rank signed structure that cancels during aggregation. SALT counters this by estimating the dominant shared subspace via the mini-batch Gram geometry, splitting coefficients into shared and residual channels, and boosting the residual channel specifically when cancellation is pronounced. This yields stronger effective policy updates and better benchmark results while leaving the reward model and rollout procedure untouched. A reader cares because the approach explains why rollout scaling stalls in practice and supplies a geometry-aware correction that preserves the original sampling budget.

Core claim

SALT estimates a dominant shared subspace from the mini-batch Gram geometry, decomposes group-relative coefficients into shared and residual channels, and adaptively amplifies the residual channel when signed cancellation is severe, improving effective update geometry and performance across RLVR benchmarks without modifying the reward model or rollout sampling.

What carries the argument

Mini-batch Gram geometry that identifies the dominant shared subspace for decomposing and selectively amplifying residual channels in group-relative coefficients.

If this is right

  • Increasing rollout count becomes useful once residual-channel amplification is applied rather than relying on raw quantity.
  • Effective update strength improves on reasoning-oriented RLVR tasks across multiple model scales.
  • No changes to the reward model or sampling distribution are required for the gains.
  • Signed cancellation in low-rank gradient geometry is the primary bottleneck addressed by the subspace split.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same Gram-based decomposition might apply to other group-normalized objectives outside RLVR.
  • If residual amplification consistently reduces required group size, total compute for equivalent performance could drop.
  • Low-rank structure in policy gradients may recur in other mini-batch optimization settings and respond to similar channel separation.

Load-bearing premise

The mini-batch Gram geometry reliably identifies a dominant shared subspace whose decomposition into shared and residual channels allows selective amplification to produce net-positive policy updates without introducing instability or bias.

What would settle it

A controlled run on the same RLVR benchmarks where the Gram-derived subspace decomposition either fails to separate a clear residual channel or where amplifying that channel produces equal or worse final performance and higher variance than the unadjusted GRPO baseline.

Figures

Figures reproduced from arXiv: 2606.05800 by Chaoqun Sun, Chenyu Wang, Dongying Kong, JianXiang Xiang, Jinpeng Zhang, Lianrui Li, MiniWell Tsao, Powei Chang, Yukang Gao.

Figure 1
Figure 1. Figure 1: Illustration of rollout inefficiency in GRPO-style RLVR. Per-rollout gradients ex￾hibit signed, low-rank redundancy, so group￾normalized aggregation cancels much of the learning signal. SALT adaptively suppresses common signals and amplifies non-redundant di￾verse directions to strengthen effective updates. Reinforcement learning with verifiable rewards (RLVR) has become a central recipe for improving the … view at source ↗
Figure 2
Figure 2. Figure 2: Signed low-rank gradient redundancy in GRPO-style RLVR. (a) [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Training dynamics on MATH-TRAIN with GRPO (top) and DAPO (bottom), showing [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Ablation and rollout-scaling analysis of SALT. (a) SALT achieves the best PR–neff trade-off; ablations degrade update geometry. (b) Fixed-α exposes a brittle exploration–cancellation trade-off avoided by adaptive mixing. (c–d) With a fixed rollout budget, SALT better uses larger groups by raising P R while preserving neff, yielding consistent gains over GRPO/DAPO. with consistently low clipping fractions, … view at source ↗
Figure 5
Figure 5. Figure 5: Residual-gradient bottleneck under rollout scaling. (a) As the rollout group size G increases, the shared-energy ratio decreases while the residual-energy ratio increases and saturates. (b) The residual-bound proxy remains bounded, whereas the realized update norm decreases with the group-averaging scale. (c) Across prompt groups, the realized update norm follows the residual￾bound proxy, supporting that e… view at source ↗
Figure 6
Figure 6. Figure 6: Disjoint last-block proxy validation. We validate the LM-head proxy against per-sample gradients from the last Transformer block on identical rollouts and advantages. Since the two parameter subsets are disjoint, their agreement cannot arise from shared parameters. We compare the geometry used by SALT: (a) cosine-Gram correlation is high while permutation/noise nulls are near zero; (b) high-magnitude signe… view at source ↗
Figure 7
Figure 7. Figure 7: Signed low-rank geometry persists across model families, and SALT mitigates gradient cancellation. Training dynamics on AIME25 across multiple model families under matched sampling and optimization budgets. Top: Pass@1; Middle: effective sample size neff; Bottom: participation ratio (P R). Despite GRPO exhibiting low PR and low neff (strong cancellation), adding SALT consistently increases both PR and neff… view at source ↗
Figure 8
Figure 8. Figure 8: Code RLVR results on MBPP with DeepSeek-R1-Distill-Qwen-7B. GRPO exhibits strong cancellation (lower effective sample size neff) and an effectively low-rank update geometry (lower P R), while SALT consistently increases both neff and P R with higher Pass@1 performance. dicating that rollouts are converted into effective update directions rather than being canceled by opposing alignments. As summarized in … view at source ↗
Figure 9
Figure 9. Figure 9: Adaptive mixing coefficient and policy-gradient clipping fraction under different rollout [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Rollout scaling diagnostics: (a) Decomposition of SALT’s gradient/update contributions into the main and exploration channels as the rollout group size G increases; the exploration share (and thus mixing) grows with G, indicating on-demand exploration under stronger cancellation. (b) Comparison with an entropy-only baseline at G = 8, G = 64 and G = 256: entropy increases randomness (higher P R) but does n… view at source ↗
read the original abstract

Reinforcement learning with verifiable rewards (RLVR) often adopts GRPO-style group-relative updates, sampling multiple rollouts per prompt to construct normalized learning signals. However, merely increasing the number of rollouts does not reliably strengthen learning: under GRPO-style group normalization, per-rollout policy-gradient features can concentrate into a low-rank, signed geometry, causing substantial cancellation during aggregation and weakening the effective update. We address this failure mode with SALT, a Subspace-Adaptive geometry pLug-in componenT that uses sample-wise gradient geometry to reweight the coefficients of group-relative updates. SALT estimates a dominant shared subspace from the mini-batch Gram geometry, decomposes group-relative coefficients into shared and residual channels, and adaptively amplifies the residual channel when signed cancellation is severe. Across diverse reasoning-oriented RLVR benchmarks and model scales, SALT improves effective update geometry and performance without modifying the reward model or the rollout sampling procedure

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper identifies a failure mode in GRPO-style group-relative policy optimization for RLVR where increasing rollouts per prompt fails to strengthen learning because per-rollout policy-gradient features concentrate into a low-rank signed geometry, causing cancellation during aggregation. It proposes SALT, a Subspace-Adaptive geometry pLug-in componenT, that estimates a dominant shared subspace from the mini-batch Gram geometry of policy-gradient features, decomposes group-relative coefficients into shared and residual channels, and adaptively amplifies the residual channel when signed cancellation is severe. The method is claimed to improve effective update geometry and performance across diverse reasoning-oriented RLVR benchmarks and model scales without modifying the reward model or rollout sampling.

Significance. If the mechanism holds, SALT would address a practically relevant inefficiency in group-based RL methods by making additional rollouts contribute positively through geometry-aware reweighting rather than cancellation. The plug-in nature, requiring no changes to reward models or sampling, would make it broadly applicable if the subspace estimation reliably isolates cancellation directions.

major comments (1)
  1. [Abstract] Abstract: The central claim that SALT produces net-positive policy updates without bias or instability rests on the assumption that the top eigenvectors of the mini-batch Gram matrix align predominantly with low-rank signed cancellation directions (rather than prompt-specific variance or rollout noise). No bound on explained variance, proof that amplification preserves unbiasedness of the group-relative estimator, or derivation showing the decomposition isolates cancellation-specific components is supplied; this premise is load-bearing for the performance claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed reading and for identifying the load-bearing assumption in our work. Below we respond directly to the major comment, clarifying the empirical scope of the manuscript while acknowledging where formal analysis is absent.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that SALT produces net-positive policy updates without bias or instability rests on the assumption that the top eigenvectors of the mini-batch Gram matrix align predominantly with low-rank signed cancellation directions (rather than prompt-specific variance or rollout noise). No bound on explained variance, proof that amplification preserves unbiasedness of the group-relative estimator, or derivation showing the decomposition isolates cancellation-specific components is supplied; this premise is load-bearing for the performance claims.

    Authors: We agree that the manuscript supplies no theoretical bounds on explained variance, no proof that residual amplification preserves unbiasedness of the group-relative estimator, and no formal derivation isolating cancellation-specific components. The paper is an empirical study: it documents the low-rank signed geometry of per-rollout policy-gradient features under GRPO-style normalization, shows that this geometry produces cancellation when rollouts increase, and demonstrates that a Gram-matrix-based decomposition plus residual amplification yields measurable gains on reasoning benchmarks. The alignment assumption is supported by the mini-batch visualizations and ablation results reported in the main text and appendix, where the top eigenvectors consistently correlate with directions of opposing signs across rollouts of the same prompt. Because the method is presented as a practical plug-in rather than a theoretically guaranteed estimator, we do not claim unbiasedness preservation beyond the original group-relative baseline; any bias introduced by adaptive amplification is treated as an empirical trade-off whose net effect is positive in the reported experiments. We are prepared to add an explicit limitations paragraph stating the absence of these guarantees if the editor requests it. revision: partial

Circularity Check

0 steps flagged

No circularity; derivation is self-contained plug-in

full rationale

The paper introduces SALT as an external plug-in that estimates a dominant shared subspace from mini-batch Gram geometry of policy-gradient features, decomposes coefficients into shared/residual channels, and amplifies the residual when signed cancellation is detected. No equations, fitted parameters, or predictions are shown that reduce by construction to the method's own inputs (no self-definitional loops, no fitted-input-called-prediction, no load-bearing self-citations). The central mechanism is presented as an independent geometric reweighting step whose validity is claimed to be verified empirically across RLVR benchmarks rather than derived from the target result itself. The derivation chain therefore remains non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies insufficient technical detail to enumerate free parameters, axioms, or invented entities; SALT is described only at the level of a geometry-based plug-in.

pith-pipeline@v0.9.1-grok · 5726 in / 1114 out tokens · 29775 ms · 2026-06-28T02:49:47.504008+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 7 canonical work pages · 3 internal anchors

  1. [1]

    Llama 3 model card

    AI@Meta. Llama 3 model card. 2024. URL https://github.com/meta-llama/llama3/ blob/main/MODEL_CARD.md

  2. [2]

    Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, and Alekh Agarwal

    Jordan T. Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, and Alekh Agarwal. Deep batch active learning by diverse, uncertain gradient lower bounds. InInternational Conference on Learning Representations (ICLR), 2020. URL https://openreview.net/ forum?id=ryghZJBKPS

  3. [3]

    Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

  4. [4]

    Group policy gradient, 2025

    Junhua Chen, Zixi Zhang, Hantao Zhong, and Rika Antonova. Group policy gradient, 2025. URLhttps://arxiv.org/abs/2510.03679

  5. [5]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  6. [6]

    Training verifiers to solve math word problems, 2021

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. Introduces GSM8K

  7. [7]

    The entropy mechanism of reinforcement learning for reasoning language models, 2025

    Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, and Ning Ding. The entropy mechanism of reinforcement learning for reasoning language models, 2025. URLhttps://arxiv.org/abs/2505.22617

  8. [8]

    Cde: Curiosity-driven exploration for efficient reinforcement learning in large language models, 2025

    Runpeng Dai, Linfeng Song, Haolin Liu, Zhenwen Liang, Dian Yu, Haitao Mi, Zhaopeng Tu, Rui Liu, Tong Zheng, Hongtu Zhu, and Dong Yu. Cde: Curiosity-driven exploration for efficient reinforcement learning in large language models, 2025. URL https://arxiv.org/ abs/2509.09675

  9. [10]

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

    DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. doi: 10.48550/ arXiv.2501.12948. URLhttps://arxiv.org/abs/2501.12948

  10. [11]

    Effective dimensionality: A tutorial.Multivariate Behavioral Research, 56(3):527–542, 2021

    Marco Del Giudice. Effective dimensionality: A tutorial.Multivariate Behavioral Research, 56(3):527–542, 2021. doi: 10.1080/00273171.2020.1743631. URL https://doi.org/10. 1080/00273171.2020.1743631

  11. [12]

    Multi-layer grpo: Enhancing reasoning and self-correction in large language models, 2025

    Fei Ding, Baiqiao Wang, Zijian Zeng, and Youwei Wang. Multi-layer grpo: Enhancing reasoning and self-correction in large language models, 2025. URL https://arxiv.org/abs/2506. 04746

  12. [13]

    Group-in-group policy optimization for llm agent training, 2025

    Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training, 2025. URLhttps://arxiv.org/abs/2505.10978. 10

  13. [14]

    The effective number of shared dimen- sions between paired datasets

    Hamza Giaffar, Camille Rullán Buxó, and Mikio Aoi. The effective number of shared dimen- sions between paired datasets. In Sanjoy Dasgupta, Stephan Mandt, and Yingzhen Li, editors, Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, volume 238 ofProceedings of Machine Learning Research, pages 4249–4257. PMLR, 02–04 ...

  14. [15]

    Reinforcement learning with deep energy-based policies, 2017

    Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies, 2017. URLhttps://arxiv.org/abs/1702.08165

  15. [16]

    Fast neural tangent kernel alignment, norm and effective rank via trace estimation.arXiv:2511.10796, 2025

    James Hazelden. Fast neural tangent kernel alignment, norm and effective rank via trace estimation.arXiv:2511.10796, 2025. doi: 10.48550/arXiv.2511.10796. URL https://www. arxiv.org/abs/2511.10796

  16. [17]

    Measuring mathematical problem solving with the math dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021

  17. [18]

    Does rlhf scale? exploring the impacts from data, model, and method, 2024

    Zhenyu Hou, Pengfan Du, Yilin Niu, Zhengxiao Du, Aohan Zeng, Xiao Liu, Minlie Huang, Hongning Wang, Jie Tang, and Yuxiao Dong. Does rlhf scale? exploring the impacts from data, model, and method, 2024. URLhttps://arxiv.org/abs/2412.06000

  18. [19]

    Huggingfaceh4/aime_2024

    Hugging Face H4. Huggingfaceh4/aime_2024. https://huggingface.co/datasets/ HuggingFaceH4/aime_2024, 2024. Accessed: 2026-01-16

  19. [20]

    John Wiley & Sons, 1965

    Leslie Kish.Survey Sampling. John Wiley & Sons, 1965

  20. [21]

    Math-verify, 2026

    Hynek Kydlí ˇcek and Hugging Face. Math-verify, 2026. URL https://github.com/ huggingface/Math-Verify. GitHub repository. Accessed 2026-01-16

  21. [22]

    Scrpo: From errors to insights,

    Lianrui Li, Dakuan Lu, Jiawei Shao, Chi Zhang, and Xuelong Li. Scrpo: From errors to insights,

  22. [23]

    URLhttps://arxiv.org/abs/2511.06065

  23. [24]

    Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023

    Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023

  24. [25]

    Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation

    Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URL https: //openreview.net/forum?id=1qvx610Cu7

  25. [26]

    Lill- icrap, Tim Harley, David Silver, and Koray Kavukcuoglu

    V olodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lill- icrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning, 2016. URLhttps://arxiv.org/abs/1602.01783

  26. [27]

    opencompass/aime2025

    OpenCompass. opencompass/aime2025. https://huggingface.co/datasets/ opencompass/AIME2025, 2025. Accessed: 2026-01-16

  27. [28]

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback,...

  28. [29]

    TRAK: Attributing model behavior at scale

    Sung Min Park, Kristian Georgiev, Andrew Ilyas, Guillaume Leclerc, and Aleksander Madry. TRAK: Attributing model behavior at scale. InProceedings of the 40th International Conference on Machine Learning (ICML), volume 202 ofProceedings of Machine Learning Research,

  29. [30]

    URLhttps://proceedings.mlr.press/v202/park23c.html

  30. [31]

    Estimating training data influence by tracing gradient descent

    Garima Pruthi, Frederick Liu, Satyen Kale, and Mukund Sundararajan. Estimating training data influence by tracing gradient descent. InAdvances in Neural Information Processing Systems (NeurIPS), 2020. URLhttps://arxiv.org/abs/2002.08484. 11

  31. [32]

    Steinmetz, and Eric Shea-Brown

    Stefano Recanatesi, Serena Bradde, Vijay Balasubramanian, Nicholas A. Steinmetz, and Eric Shea-Brown. A scale-dependent measure of system dimensionality.Patterns, 3(8):100555, 2022. doi: 10.1016/j.patter.2022.100555. URL https://www.sciencedirect.com/science/ article/pii/S266638992200160X

  32. [33]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark, 2023

  33. [34]

    Approximating kl divergence

    John Schulman. Approximating kl divergence. Blog post, 2016. URL http://joschu.net/ blog/kl-approx.html. Accessed 2026-01-16. Some secondary sources cite the post as 2020

  34. [35]

    Jordan, and Pieter Abbeel

    John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan, and Pieter Abbeel. High- dimensional continuous control using generalized advantage estimation, 2016. URL https: //arxiv.org/abs/1506.02438

  35. [36]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017. doi: 10.48550/arXiv. 1707.06347. URLhttps://arxiv.org/abs/1707.06347

  36. [37]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/ 2402.03300

  37. [38]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Yu Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. doi: 10.48550/arXiv.2402.03300. URLhttps://arxiv.org/abs/2402.03300

  38. [39]

    On entropy control in llm-rl algorithms, 2025

    Han Shen. On entropy control in llm-rl algorithms, 2025. URL https://arxiv.org/abs/ 2509.03493

  39. [40]

    Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

  40. [41]

    Sutton and Andrew G

    Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 2 edition, 2018

  41. [42]

    Qwen3 technical report, 2025

    Qwen Team. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388

  42. [43]

    Beyond reverse kl: Generalizing direct preference optimization with diverse divergence constraints, 2023

    Chaoqi Wang, Yibo Jiang, Chenghao Yang, Han Liu, and Yuxin Chen. Beyond reverse kl: Generalizing direct preference optimization with diverse divergence constraints, 2023. URL https://arxiv.org/abs/2309.16240

  43. [44]

    Zico Kolter

    Yixuan Even Xu, Yash Savani, Fei Fang, and J. Zico Kolter. Not all rollouts are useful: Down- sampling rollouts in llm reinforcement learning, 2025. URL https://arxiv.org/abs/2504. 13818

  44. [45]

    Qwen2.5-math technical report: Toward mathematical expert model via self-improvement.arXiv preprint arXiv:2409.12122, 2024

    An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, and Zhenru Zhang. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement.arXiv preprint arXiv:2409.12122, 2024

  45. [46]

    Remmers, Chukwuma B

    Yaning Yang, Elaine F. Remmers, Chukwuma B. Ogunwole, Daniel L. Kastner, Peter K. Gregersen, and Wentian Li. Effective sample size: Quick estimation of the effect of related samples in genetic case-control association analyses.arXiv: q-bio/0611093, 2006. URL https://arxiv.org/abs/q-bio/0611093

  46. [47]

    Diversity-aware policy optimization for large language model reasoning

    Jian Yao, Ran Cheng, Xingyu Wu, Jibin Wu, and Kay Chen Tan. Diversity-aware policy optimization for large language model reasoning. InAdvances in Neural Information Process- ing Systems (NeurIPS), 2025. URL https://openreview.net/forum?id=5eZ0iykpDU. Spotlight. 12

  47. [48]

    Gradient diversity: a key ingredient for scalable distributed learning

    Dong Yin, Ashwin Pananjady, Max Lam, Dimitris Papailiopoulos, Kannan Ramchandran, and Peter Bartlett. Gradient diversity: a key ingredient for scalable distributed learning. In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, volume 84 ofProceedings of Machine Learning Research, pages 1998–2007. PMLR, 201...

  48. [49]

    Dapo: An open-source llm reinforcement learning system at scale, 2025

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

  49. [50]

    Gradient surgery for multi-task learning, 2020

    Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning, 2020. URL https://arxiv.org/abs/2001.06782

  50. [51]

    Edge-grpo: Entropy-driven grpo with guided error correction for advantage diversity, 2025

    Xingjian Zhang, Siwei Wen, Wenjun Wu, and Lei Huang. Edge-grpo: Entropy-driven grpo with guided error correction for advantage diversity, 2025. URL https://arxiv.org/abs/2507. 21848

  51. [52]

    Group Sequence Policy Optimization

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025. doi: 10.48550/arXiv.2507.18071. URL https://arxiv.org/abs/2507.18071

  52. [53]

    Zhestyatsky at SemEval-2021 task 2: ReLU over cosine similarity for BERT fine-tuning

    Boris Zhestiankin and Maria Ponomareva. Zhestyatsky at SemEval-2021 task 2: ReLU over cosine similarity for BERT fine-tuning. In Alexis Palmer, Nathan Schneider, Natalie Schluter, Guy Emerson, Aurelie Herbelot, and Xiaodan Zhu, editors,Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), pages 163–168, Online, August 2021....

  53. [54]

    Evolving language models without labels: Majority drives selection, novelty promotes variation, 2025

    Yujun Zhou, Zhenwen Liang, Haolin Liu, Wenhao Yu, Kishan Panaganti, Linfeng Song, Dian Yu, Xiangliang Zhang, Haitao Mi, and Dong Yu. Evolving language models without labels: Majority drives selection, novelty promotes variation, 2025. URL https://arxiv.org/abs/ 2509.15194. 13 Appendix A Proof of the residual-gradient norm bound This appendix provides a de...