pith. sign in

arxiv: 2605.19425 · v1 · pith:XB4WHAGPnew · submitted 2026-05-19 · 💻 cs.LG · cs.AI

When to Stop Reusing: Dynamic Gradient Gating for Sample-Efficient RLVR

Pith reviewed 2026-05-20 07:27 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords RLVRsample efficiencygradient gatingpolicy divergencelarge language modelsreinforcement learningweight divergence
0
0 comments X

The pith

The lm_head gradient norm lower-bounds policy divergence, so gating gradients on its surges lets rollout batches be reused safely for up to 2.93 times better sample efficiency in RLVR.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

In RLVR for large language models, each rollout batch is expensive, so reusing it for multiple gradient updates seems attractive. Yet reuse quickly amplifies policy shift and collapses performance. The paper identifies that this collapse coincides with a sharp surge in lm_head weight changes while intermediate layers stay stable. It proves harmful gradients concentrate at the lm_head and that the norm of those gradients lower-bounds the policy divergence. From this signal the authors build a lightweight gate that intercepts bad gradients in real time, letting each batch drive several updates without the usual degradation.

Core claim

The authors establish the Disproportionate Weight Divergence phenomenon in which performance degradation is synchronized with a sharp surge in lm_head weight change. They prove that harmful gradients concentrate at the lm_head while intermediate layers are structurally attenuated, and that the lm_head gradient norm lower-bounds the policy divergence. Guided by these facts they introduce Dynamic Gradient Gating, which monitors the lm_head gradient norm in real time and intercepts gradients before they corrupt the optimizer.

What carries the argument

Dynamic Gradient Gating, a monitor that intercepts gradients whenever the lm_head gradient norm surges, thereby blocking updates that would produce large policy divergence.

If this is right

  • Each rollout batch can be used for multiple gradient steps while matching or exceeding the performance of single-use training.
  • Sample efficiency reaches up to 2.93 times the baseline across math, ALFWorld, WebShop, and search-augmented QA tasks.
  • Wall-clock training time improves by up to 2.14 times because fewer total rollouts are required.
  • The lm_head norm provides a real-time, model-agnostic indicator that can be checked before every optimizer step.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar gradient-norm monitoring at the output layer may stabilize policy updates in other reinforcement-learning settings beyond language models.
  • Combining the gate with replay buffers or importance sampling could further reduce the number of environment interactions needed.
  • The structural attenuation of gradients in intermediate layers suggests that output-layer stability is a general requirement for coherent long-horizon policies.

Load-bearing premise

That a surge in the lm_head gradient norm reliably marks the start of harmful policy shift and that blocking gradients on this signal alone prevents degradation without creating new failure modes.

What would settle it

An experiment on a new LLM-task pair in which the lm_head gradient norm fails to rise at the onset of degradation or in which gating on the norm still produces performance collapse.

Figures

Figures reproduced from arXiv: 2605.19425 by Lefei Zhang, Qi Gu, Sen Zhang, Xunliang Cai, Yaorui Shi, Yuchun Miao, Yuqi Zhang.

Figure 1
Figure 1. Figure 1: Illustration of the Disproportionate Weight Divergence (DWD) phenomenon on Qwen3- 4B-Instruct across various tasks. Relative weight change is defined as ∥Wt − Wt−∆∥F /∥Wref∥F (Wref: initial pretrained weight; ∆: profiling interval, set to 10 RL steps); red dashed lines mark the onset of performance degradation for GRPO w/ Naive Reuse. Observation: Sample reuse accelerates early convergence, followed by a s… view at source ↗
Figure 2
Figure 2. Figure 2: Performance comparison across two LLMs and four tasks. GRPO w/ Naive Reuse uses a fixed number of updates per batch, equal to DGG’s maximum reuse. Gray dashed lines mark the con￾verged performance of GRPO (Single-Use Rollout)—averaged over the last five checkpoints—used as the reference for sample efficiency. Observation: DGG eliminates the instability of naive sample reuse, substantially improving sample … view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of different monitoring signals on Qwen2.5-7B-Instruct (Math500). The red dashed line marks the onset of training collapse for GRPO w/ Naive Reuse. Observation: While KL divergence, clip ratio, and global gradient norm lack clear collapse signals, the lm_head gradient norm provides a sharp spike at collapse, serving as a reliable indicator that empirically validates our Structural Gradient Asymm… view at source ↗
Figure 5
Figure 5. Figure 5: Sensitivity analysis on DGG’s hyper￾parameters τ and K, conducted on Qwen3-4B￾Instruct (Math500). Note that K = 1 reduces to the single-use baseline. Observation: DGG achieves better performance than the single-use baseline across most settings. with our observation in Section 4 that the DWD phenomenon is broadly present across diverse LLM architectures and tasks, providing a principled foundation for DGG’… view at source ↗
Figure 6
Figure 6. Figure 6 [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Illustration of the Disproportionate Weight Divergence (DWD) phenomenon on Qwen2.5-7B-Instruct across four tasks. Relative weight change is defined as ∥Wt − Wt−∆∥F /∥Wref∥F (Wref: initial pretrained weight; ∆: profiling interval, set to 10 RL steps); red dashed lines mark the onset of performance degradation for GRPO w/ Naive Reuse. Observation: Sample reuse accelerates early convergence, followed by a sev… view at source ↗
Figure 8
Figure 8. Figure 8: Illustration of the Disproportionate Weight Divergence (DWD) phenomenon on math task across different LLMs. Relative weight change is defined as ∥Wt − Wt−∆∥F /∥Wref∥F (Wref: initial pretrained weight; ∆: profiling interval, set to 10 RL steps); red dashed lines mark the onset of performance degradation for GRPO w/ Naive Reuse. Observation: Sample reuse accelerates early convergence, followed by a severe pe… view at source ↗
Figure 9
Figure 9. Figure 9: Stability of DGG across three random seeds on Qwen3-4B-Instruct and Qwen2.5-7B￾Instruct (Math500). Solid lines show the mean accuracy and shaded regions denote one standard deviation across seeds. Observation: DGG consistently outperforms the single-use baseline and avoids Naive Reuse’s collapse across all seeds, confirming the reproducibility of its gains. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Relative weight change throughout the full RL training process on Qwen3-4B-Instruct and Qwen2.5-7B-Instruct (Math500), under the GRPO w/o Reuse regime. Observation: Without sample reuse, the relative weight change of all components—including the lm_head—grows smoothly throughout training, confirming that the abrupt lm_head surge observed under sample reuse is caused by reuse itself rather than by the incr… view at source ↗
read the original abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has become the dominant paradigm for advanced reasoning in Large Language Models (LLMs), but rollout samples are expensive to obtain, making sample efficiency a critical bottleneck. A natural remedy is to reuse each rollout batch for multiple gradient updates, a standard practice in classical RL. Yet in RLVR, this amplifies policy shift, leading to severe performance degradation. Detecting the onset of degradation early enough to stop reuse remains an open and challenging problem. We close this gap by identifying the \textit{Disproportionate Weight Divergence (DWD)} phenomenon: performance degradation is synchronized with a sharp surge in the \texttt{lm\_head} weight change, while intermediate layers remain stable. Empirically, we verify that DWD emerges consistently across diverse LLMs and tasks. Theoretically, we prove that (i) harmful gradients concentrate at the \texttt{lm\_head} while intermediate layers are structurally attenuated, and (ii) the \texttt{lm\_head} gradient norm lower-bounds the policy divergence. These results establish the \texttt{lm\_head} gradient norm as a principled, real-time signal of catastrophic policy shift. Guided by this insight, we propose \textit{Dynamic Gradient Gating (DGG)}, a lightweight intervention that monitors the \texttt{lm\_head} gradient norm in real time and intercepts harmful gradients before they corrupt the optimizer. DGG consistently matches or exceeds the standard single-use baseline, achieving up to $2.93\times$ sample efficiency and $2.14\times$ wall-clock speedup across math, ALFWorld, WebShop, and search-augmented QA tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper identifies the Disproportionate Weight Divergence (DWD) phenomenon in RLVR for LLMs, where performance degradation during reuse of rollout batches correlates with surges in lm_head weight changes while intermediate layers remain stable. It provides theoretical proofs that harmful gradients concentrate at the lm_head and that the lm_head gradient norm lower-bounds policy divergence. Based on this, it introduces Dynamic Gradient Gating (DGG), a method that monitors the lm_head gradient norm in real time to intercept harmful gradients, enabling safe reuse of batches and achieving up to 2.93× sample efficiency and 2.14× wall-clock speedup across math, ALFWorld, WebShop, and search-augmented QA tasks while matching or exceeding the single-use baseline.

Significance. If the lower-bound proof is tight and extends rigorously to the multi-reuse regime without introducing new instabilities, this could meaningfully advance sample efficiency in RLVR by allowing controlled reuse of costly rollouts. The empirical verification of DWD consistency across diverse LLMs and tasks strengthens the practical case, and the lightweight nature of DGG is attractive for deployment. Credit is due for attempting a theoretically grounded gating signal rather than purely heuristic thresholds.

major comments (2)
  1. [Theoretical Analysis (proof of lower bound)] The central theoretical claim that the lm_head gradient norm lower-bounds policy divergence must be shown to apply to cumulative divergence after multiple reuses on the same batch. If the derivation (likely via Jacobian of softmax or Pinsker inequality) is for a single gradient step, an inductive step or telescoping argument is required; otherwise the real-time gating signal may miss degradation precisely when reuse is most harmful.
  2. [§3 (Theoretical Results)] The manuscript should clarify whether the lower bound is derived independently or effectively restates the observed DWD correlation. If the proof begins from the empirical surge in lm_head norm, the claim that this norm is a 'principled' signal risks circularity and weakens the justification for using it as a gating criterion.
minor comments (2)
  1. [Theoretical section] Define the precise divergence measure (KL, total variation, or other) used in the lower-bound statement and state the assumptions on the policy and reward model explicitly.
  2. [Experiments] Provide ablation results showing performance when gating is disabled versus when thresholds are tuned post-hoc on the same runs used for the reported speedups.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of the theoretical analysis that we address below with clarifications and planned revisions to strengthen the presentation.

read point-by-point responses
  1. Referee: [Theoretical Analysis (proof of lower bound)] The central theoretical claim that the lm_head gradient norm lower-bounds policy divergence must be shown to apply to cumulative divergence after multiple reuses on the same batch. If the derivation (likely via Jacobian of softmax or Pinsker inequality) is for a single gradient step, an inductive step or telescoping argument is required; otherwise the real-time gating signal may miss degradation precisely when reuse is most harmful.

    Authors: We appreciate this observation on extending the bound to the multi-reuse setting. The existing proof derives the per-update lower bound directly from the Jacobian of the softmax and an application of Pinsker's inequality to the one-step KL divergence between policies. Because DGG evaluates the lm_head gradient norm after every individual gradient step and gates before the optimizer applies the update, the per-step bound is enforced sequentially. To make the cumulative control explicit, we will add a telescoping argument in the revised §3 showing that the total policy divergence after k reuses is upper-bounded by the sum of the per-step norms; DGG therefore prevents accumulation by terminating reuse as soon as any single step would violate the threshold. This addition will be included in the next version. revision: yes

  2. Referee: [§3 (Theoretical Results)] The manuscript should clarify whether the lower bound is derived independently or effectively restates the observed DWD correlation. If the proof begins from the empirical surge in lm_head norm, the claim that this norm is a 'principled' signal risks circularity and weakens the justification for using it as a gating criterion.

    Authors: The lower bound is obtained from first principles by analyzing gradient flow through the transformer architecture: the lm_head receives unattenuated gradients from the output logits while intermediate layers are structurally damped by residual connections and layer norms. The derivation does not invoke the empirical DWD observations; those observations serve only as subsequent verification that the predicted concentration occurs in practice. We will revise the opening paragraph of §3 to state the logical order explicitly—architectural analysis and proof first, followed by empirical confirmation of DWD—to eliminate any appearance of circularity. revision: partial

Circularity Check

0 steps flagged

No significant circularity in theoretical bound or DWD identification.

full rationale

The paper reports an empirical observation of the Disproportionate Weight Divergence phenomenon and separately states a theoretical proof that the lm_head gradient norm lower-bounds policy divergence via gradient concentration and structural attenuation arguments. No equations or text in the provided sections reduce the bound to a fitted parameter, self-citation chain, or restatement of the observation itself. The proof is presented as first-principles reasoning (Jacobian/inequality style) independent of the reuse-regime data. The central claim therefore retains independent mathematical content and does not collapse to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the unverified consistency of DWD across models and the assumption that lm_head gradient norm is a sufficient statistic for harmful policy shift. No explicit free parameters or invented entities are named in the abstract.

pith-pipeline@v0.9.0 · 5856 in / 1157 out tokens · 74216 ms · 2026-05-20T07:27:12.916941+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 15 internal anchors

  1. [1]

    Efficient RL Training for LLMs with Experience Replay

    Charles Arnal, Vivien Cabannes, Taco Cohen, Julia Kempe, and Remi Munos. Efficient rl training for llms with experience replay.arXiv preprint arXiv:2604.08706, 2026. URL https://arxiv.org/abs/2604.08706

  2. [2]

    Prompt replay: speeding up grpo with on-policy reuse of high-signal prompts.arXiv preprint arXiv:2603.21177, 2026

    Andrei Baroian and Rutger Berger. Prompt replay: speeding up grpo with on-policy reuse of high-signal prompts.arXiv preprint arXiv:2603.21177, 2026. URL https://arxiv.org/abs/ 2603.21177

  3. [3]

    arXiv preprint arXiv:2511.16108(2025)

    Shiyi Cao, Dacheng Li, Fangzhou Zhao, Shuo Yuan, Sumanth R Hegde, Connor Chen, Charlie Ruan, Tyler Griggs, Shu Liu, Eric Tang, et al. Skyrl-agent: Efficient rl training for multi-turn llm agent.arXiv preprint arXiv:2511.16108, 2025. URLhttps://arxiv.org/abs/2511.16108

  4. [4]

    MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

    Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention.arXiv preprint arXiv:2506.13585, 2025. URL https://arxiv.org/abs/ 2506.13585

  5. [5]

    Agentic reinforced policy optimization

    Guanting Dong, Hangyu Mao, Kai Ma, Licheng Bao, Yifei Chen, Zhongyuan Wang, Zhongxia Chen, Jiazhen Du, Huiyang Wang, Fuzheng Zhang, Guorui Zhou, Yutao Zhu, Ji-Rong Wen, and Zhicheng Dou. Agentic reinforced policy optimization. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id= TX4k7BF6aO

  6. [6]

    Group-in-group policy optimization for LLM agent training

    Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for LLM agent training. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview.net/forum?id=QXEhBMNrCW

  7. [7]

    Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025. URL https://www.nature.com/ articles/s41586-025-09422-z

  8. [8]

    OlympiadBench: A challenging benchmark for promoting AGI with olympiad-level bilin- gual multimodal scientific problems

    Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. OlympiadBench: A challenging benchmark for promoting AGI with olympiad-level bilin- gual multimodal scientific problems. In Lun-Wei Ku, Andre Martins, and Vivek Sriku- mar, editors,Proceedin...

  9. [9]

    2024 , address =

    Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.211. URL https://aclanthology.org/2024.acl-long.211/

  10. [10]

    Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps

    Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. In Donia Scott, Nuria Bel, and Chengqing Zong, editors,Proceedings of the 28th International Conference on Computational Linguistics, pages 6609–6625, Barcelona, Spain (Online), December 2020. 10 Internation...

  11. [11]

    Solving quantitative reasoning problems with language models

    Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Ad...

  12. [12]

    Treepo: Bridging the gap of policy optimization and efficacy and inference efficiency with heuristic tree-based modeling.arXiv preprint arXiv:2508.17445, 2025

    Yizhi Li, Qingshui Gu, Zhoufutu Wen, Ziniu Li, Tianshun Xing, Shuyue Guo, Tianyu Zheng, Xin Zhou, Xingwei Qu, Wangchunshu Zhou, et al. Treepo: Bridging the gap of policy optimization and efficacy and inference efficiency with heuristic tree-based modeling.arXiv preprint arXiv:2508.17445, 2025. URLhttps://arxiv.org/abs/2508.17445

  13. [13]

    Knapsack rl: Unlocking exploration of llms via optimizing budget allocation

    Ziniu Li, Congliang Chen, Tianyun Yang, Tian Ding, Ruoyu Sun, Ge Zhang, Wenhao Huang, and Zhi-Quan Luo. Knapsack rl: Unlocking exploration of llms via optimizing budget allocation. arXiv preprint arXiv:2509.25849, 2025. URLhttps://arxiv.org/pdf/2509.25849

  14. [14]

    Squeeze the soaked sponge: Efficient off-policy RFT for large language model

    Jing Liang, Jinyi Liu, Yi Ma, Hongyao Tang, Y AN ZHENG, Shuyue Hu, LEI BAI, and Jianye HAO. Squeeze the soaked sponge: Efficient off-policy RFT for large language model. In The Fourteenth International Conference on Learning Representations, 2026. URL https: //openreview.net/forum?id=quBjNSJMrC

  15. [15]

    https://aclanthology.org/2025.emnlp-main.75/

    Mengqi Liao, Xiangyu Xi, Chen Ruinian, Jia Leng, Yangen Hu, Ke Zeng, Shuai Liu, and Huaiyu Wan. Enhancing efficiency and exploration in reinforcement learning for llms. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 1451–1463, 2025. URL"https://aclanthology.org/2025.emnlp-main.75/"

  16. [16]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=v8L0pN6EOi

  17. [17]

    STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens

    Shiqi Liu, Zeyu He, Guojian Zhan, Letian Tao, Zhilong Zheng, Jiang Wu, Yinuo Wang, Yang Guan, Kehua Sheng, Bo Zhang, et al. Stapo: Stabilizing reinforcement learning for llms by silencing rare spurious tokens.arXiv preprint arXiv:2602.15620, 2026. URL https: //arxiv.org/abs/2602.15620

  18. [18]

    Adaptive rollout allocation for online reinforcement learning with verifiable rewards.arXiv preprint arXiv:2602.01601, 2026

    Hieu Trung Nguyen, Bao Nguyen, Wenao Ma, Yuzhi Zhao, Ruifeng She, and Viet Anh Nguyen. Adaptive rollout allocation for online reinforcement learning with verifiable rewards.arXiv preprint arXiv:2602.01601, 2026. URLhttps://arxiv.org/abs/2602.01601

  19. [19]

    and Lewis, Mike , editor =

    Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah Smith, and Mike Lewis. Mea- suring and narrowing the compositionality gap in language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 5687–5711, Singapore, December 2023. Association for Computational Linguisti...

  20. [20]

    Trust Region Policy Optimization

    John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. InInternational conference on machine learning, pages 1889–1897. PMLR, 2015. URLhttps://arxiv.org/abs/1502.05477

  21. [21]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017. URL https://arxiv. org/abs/1707.06347

  22. [22]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. URL https://arxiv.org/ abs/2402.03300. 11

  23. [23]

    {ALFW}orld: Aligning text and embodied environments for interactive learning

    Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Cote, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. {ALFW}orld: Aligning text and embodied environments for interactive learning. InInternational Conference on Learning Representations, 2021. URL https: //openreview.net/forum?id=0IOX0YcCdTn

  24. [24]

    A tail-index analysis of stochastic gradient noise in deep neural networks

    Umut Simsekli, Levent Sagun, and Mert Gurbuzbalaban. A tail-index analysis of stochastic gradient noise in deep neural networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 5827–5837. PMLR, 09–15 Jun 2019. URL http...

  25. [25]

    Robust Large Margin Deep Neural Networks

    Jure Sokoli´c, Raja Giryes, Guillermo Sapiro, and Miguel RD Rodrigues. Robust large margin deep neural networks.IEEE Transactions on Signal Processing, 65(16):4265–4280, 2017. URL https://arxiv.org/abs/1605.08254

  26. [26]

    Improving data efficiency for LLM reinforcement fine-tuning through difficulty- targeted online data selection and rollout replay

    Yifan Sun, Jingyan Shen, Yibin Wang, Tianyu Chen, Zhendong Wang, Mingyuan Zhou, and Huan Zhang. Improving data efficiency for LLM reinforcement fine-tuning through difficulty- targeted online data selection and rollout replay. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id= uwUkETPIJN

  27. [27]

    Introducing longcat-flash-thinking: A technical report.arXiv preprint arXiv:2509.18883, 2025

    Meituan LongCat Team, Anchun Gui, Bei Li, Bingyang Tao, Bole Zhou, Borun Chen, Chao Zhang, Chengcheng Han, Chenhui Yang, Chi Zhang, et al. Introducing longcat-flash-thinking: A technical report.arXiv preprint arXiv:2509.18883, 2025. URL https://arxiv.org/abs/ 2509.18883

  28. [28]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Tim- othée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023. URL https://arxiv.org/abs/2302.13971

  29. [29]

    ♫ M u S i Q ue: Multihop Questions via Single-hop Question Composition

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 05 2022. ISSN 2307-387X. doi: 10.1162/tacl_a_00475. URLhttps://doi.org/10.1162/tacl_a_00475

  30. [30]

    Eframe: Deeper reasoning via exploration-filter-replay reinforcement learning framework.arXiv preprint arXiv:2506.22200, 2025

    Chen Wang, Lai Wei, Yanzhi Zhang, Chenyang Shao, Zedong Dan, Weiran Huang, Yuzhi Zhang, and Yue Wang. Eframe: Deeper reasoning via exploration-filter-replay reinforcement learning framework.arXiv preprint arXiv:2506.22200, 2025. URL https://arxiv.org/abs/ 2506.22200

  31. [31]

    When Importance Sampling Misallocates Credit: Asymmetric Ratios for Outcome-Supervised RL

    Jiakang Wang, Runze Liu, Lei Lin, Wenping Hu, Xiu Li, Fuzheng Zhang, Guorui Zhou, and Kun Gai. Aspo: Asymmetric importance sampling policy optimization.arXiv preprint arXiv:2510.06062, 2025. URLhttps://arxiv.org/abs/2510.06062

  32. [32]

    Text Embeddings by Weakly-Supervised Contrastive Pre-training

    Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022. URLhttps://arxiv.org/abs/2212.03533

  33. [33]

    Numerical pitfalls in policy gradient updates

    Tao Wang and Sicun Gao. Numerical pitfalls in policy gradient updates. 2025. URL https: //openreview.net/forum?id=u4dORXVAnx

  34. [35]

    URLhttps://arxiv.org/abs/2505.24034

  35. [36]

    Lookahead tree- based rollouts for enhanced trajectory-level exploration in reinforcement learning with verifiable rewards

    Shangyu Xing, Siyuan Wang, Chenyuan Yang, Xinyu Dai, and Xiang Ren. Lookahead tree- based rollouts for enhanced trajectory-level exploration in reinforcement learning with verifiable rewards. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=4nLvUk8edu. 12

  36. [37]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. URLhttps://arxiv.org/abs/2505.09388

  37. [38]

    SSPO: Subsentence-level Policy Optimization

    Kun Yang, Yanmeng Wang, Zhigen Li, et al. Sspo: Subsentence-level policy optimization. arXiv preprint arXiv:2511.04256, 2025. URLhttps://arxiv.org/abs/2511.04256

  38. [39]

    Do not let low-probability tokens over-dominate in RL for LLMs

    Zhihe Yang, Xufang Luo, Zilong Wang, Dongqi Han, Zhiyuan He, Dongsheng Li, and Yunjian Xu. Do not let low-probability tokens over-dominate in RL for LLMs. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview. net/forum?id=FOnAdLo0tM

  39. [40]

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processi...

  40. [41]

    Webshop: To- wards scalable real-world web interaction with grounded language agents

    Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: To- wards scalable real-world web interaction with grounded language agents. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neu- ral Information Processing Systems, volume 35, pages 20744–20757. Curran Associates, Inc., 2022. URL https://proceed...

  41. [42]

    Mastering complex control in moba games with deep reinforcement learning

    Deheng Ye, Zhao Liu, Mingfei Sun, Bei Shi, Peilin Zhao, Hao Wu, Hongsheng Yu, Shaojie Yang, Xipeng Wu, Qingwei Guo, et al. Mastering complex control in moba games with deep reinforcement learning. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 6672–6679, 2020. URLhttps://arxiv.org/abs/1912.09729

  42. [43]

    Spectral Norm Regularization for Improving the Generalizability of Deep Learning

    Yuichi Yoshida and Takeru Miyato. Spectral norm regularization for improving the gen- eralizability of deep learning.arXiv preprint arXiv:1705.10941, 2017. URL https: //arxiv.org/abs/1705.10941

  43. [44]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025. URL https://arxiv.org/ abs/2503.14476

  44. [45]

    Wong, and Yu Cheng

    Runzhe Zhan, Yafu Li, Zhi Wang, Xiaoye Qu, Dongrui Liu, Jing Shao, Derek F. Wong, and Yu Cheng. ExGRPO: Learning to reason from experience. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id= 701tjQXWVk

  45. [46]

    Prorl agent: Rollout-as-a-service for rl training of multi-turn llm agents.arXiv preprint arXiv:2603.18815, 2026

    Hao Zhang, Mingjie Liu, Shaokun Zhang, Songyang Han, Jian Hu, Zhenghui Jin, Yuchi Zhang, Shizhe Diao, Ximing Lu, Binfeng Xu, et al. Prorl agent: Rollout-as-a-service for rl training of multi-turn llm agents.arXiv preprint arXiv:2603.18815, 2026. URL https: //arxiv.org/abs/2603.18815

  46. [47]

    Why are adaptive methods good for attention models? In H

    Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank Reddi, Sanjiv Kumar, and Suvrit Sra. Why are adaptive methods good for attention models? In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neu- ral Information Processing Systems, volume 33, pages 15383–15393. Curran Associates, Inc., 2020. ...

  47. [48]

    American invitational mathematics examination (aime) 2025, 2025

    Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2025, 2025

  48. [49]

    Train less, learn more: Adaptive efficient rollout optimization for group-based reinforcement learning.arXiv preprint arXiv:2602.14338, 2026

    Zhi Zhang, Zhen Han, Costas Mavromatis, Qi Zhu, Yunyi Zhang, Sheng Guan, Dingmin Wang, Xiong Zhou, Shuai Wang, Soji Adeshina, et al. Train less, learn more: Adaptive efficient rollout optimization for group-based reinforcement learning.arXiv preprint arXiv:2602.14338, 2026. URLhttps://arxiv.org/abs/2602.14338. 13

  49. [50]

    Group Sequence Policy Optimization

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025. URLhttps://arxiv.org/abs/2507.18071

  50. [51]

    Prosperity before collapse: How far can off-policy RL reach with stale data on LLMs? InThe Fourteenth International Conference on Learning Representations, 2026

    Haizhong Zheng, Jiawei Zhao, and Beidi Chen. Prosperity before collapse: How far can off-policy RL reach with stale data on LLMs? InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=IIgl5MWelz

  51. [52]

    Setup and notation

    Haizhong Zheng, Yang Zhou, Brian R. Bartoldson, Bhavya Kailkhura, Fan Lai, Jiawei Zhao, and Beidi Chen. Act only when it pays: Efficient reinforcement learning for LLM reasoning via selective rollouts. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview.net/forum?id=x5lITYXmW2. 14 Contents 1 Introduct...

  52. [53]

    Applying this identity to each rank-1 term in Eq. (58): ∥Eih⊤ L,i∥2 F =∥E i∥2 2 · ∥hL,i∥2 2.(59) Substituting the explicit form ofE i: ∥Ei∥2 2 =r 2 i ˆA2 i ∥eai −π θ(· |h L,i)∥2 2.(60) Combining these: ∥Glm∥2 F ≤ 1 T TX i=1 r2 i ˆA2 i ∥eai −π θ(· |h L,i)∥2 2∥hL,i∥2 2.(61) Step 3: Absorbing representation factors into cmax.By the definition of cmax, the pe...