When to Stop Reusing: Dynamic Gradient Gating for Sample-Efficient RLVR
Pith reviewed 2026-05-20 07:27 UTC · model grok-4.3
The pith
The lm_head gradient norm lower-bounds policy divergence, so gating gradients on its surges lets rollout batches be reused safely for up to 2.93 times better sample efficiency in RLVR.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish the Disproportionate Weight Divergence phenomenon in which performance degradation is synchronized with a sharp surge in lm_head weight change. They prove that harmful gradients concentrate at the lm_head while intermediate layers are structurally attenuated, and that the lm_head gradient norm lower-bounds the policy divergence. Guided by these facts they introduce Dynamic Gradient Gating, which monitors the lm_head gradient norm in real time and intercepts gradients before they corrupt the optimizer.
What carries the argument
Dynamic Gradient Gating, a monitor that intercepts gradients whenever the lm_head gradient norm surges, thereby blocking updates that would produce large policy divergence.
If this is right
- Each rollout batch can be used for multiple gradient steps while matching or exceeding the performance of single-use training.
- Sample efficiency reaches up to 2.93 times the baseline across math, ALFWorld, WebShop, and search-augmented QA tasks.
- Wall-clock training time improves by up to 2.14 times because fewer total rollouts are required.
- The lm_head norm provides a real-time, model-agnostic indicator that can be checked before every optimizer step.
Where Pith is reading between the lines
- Similar gradient-norm monitoring at the output layer may stabilize policy updates in other reinforcement-learning settings beyond language models.
- Combining the gate with replay buffers or importance sampling could further reduce the number of environment interactions needed.
- The structural attenuation of gradients in intermediate layers suggests that output-layer stability is a general requirement for coherent long-horizon policies.
Load-bearing premise
That a surge in the lm_head gradient norm reliably marks the start of harmful policy shift and that blocking gradients on this signal alone prevents degradation without creating new failure modes.
What would settle it
An experiment on a new LLM-task pair in which the lm_head gradient norm fails to rise at the onset of degradation or in which gating on the norm still produces performance collapse.
Figures
read the original abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has become the dominant paradigm for advanced reasoning in Large Language Models (LLMs), but rollout samples are expensive to obtain, making sample efficiency a critical bottleneck. A natural remedy is to reuse each rollout batch for multiple gradient updates, a standard practice in classical RL. Yet in RLVR, this amplifies policy shift, leading to severe performance degradation. Detecting the onset of degradation early enough to stop reuse remains an open and challenging problem. We close this gap by identifying the \textit{Disproportionate Weight Divergence (DWD)} phenomenon: performance degradation is synchronized with a sharp surge in the \texttt{lm\_head} weight change, while intermediate layers remain stable. Empirically, we verify that DWD emerges consistently across diverse LLMs and tasks. Theoretically, we prove that (i) harmful gradients concentrate at the \texttt{lm\_head} while intermediate layers are structurally attenuated, and (ii) the \texttt{lm\_head} gradient norm lower-bounds the policy divergence. These results establish the \texttt{lm\_head} gradient norm as a principled, real-time signal of catastrophic policy shift. Guided by this insight, we propose \textit{Dynamic Gradient Gating (DGG)}, a lightweight intervention that monitors the \texttt{lm\_head} gradient norm in real time and intercepts harmful gradients before they corrupt the optimizer. DGG consistently matches or exceeds the standard single-use baseline, achieving up to $2.93\times$ sample efficiency and $2.14\times$ wall-clock speedup across math, ALFWorld, WebShop, and search-augmented QA tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper identifies the Disproportionate Weight Divergence (DWD) phenomenon in RLVR for LLMs, where performance degradation during reuse of rollout batches correlates with surges in lm_head weight changes while intermediate layers remain stable. It provides theoretical proofs that harmful gradients concentrate at the lm_head and that the lm_head gradient norm lower-bounds policy divergence. Based on this, it introduces Dynamic Gradient Gating (DGG), a method that monitors the lm_head gradient norm in real time to intercept harmful gradients, enabling safe reuse of batches and achieving up to 2.93× sample efficiency and 2.14× wall-clock speedup across math, ALFWorld, WebShop, and search-augmented QA tasks while matching or exceeding the single-use baseline.
Significance. If the lower-bound proof is tight and extends rigorously to the multi-reuse regime without introducing new instabilities, this could meaningfully advance sample efficiency in RLVR by allowing controlled reuse of costly rollouts. The empirical verification of DWD consistency across diverse LLMs and tasks strengthens the practical case, and the lightweight nature of DGG is attractive for deployment. Credit is due for attempting a theoretically grounded gating signal rather than purely heuristic thresholds.
major comments (2)
- [Theoretical Analysis (proof of lower bound)] The central theoretical claim that the lm_head gradient norm lower-bounds policy divergence must be shown to apply to cumulative divergence after multiple reuses on the same batch. If the derivation (likely via Jacobian of softmax or Pinsker inequality) is for a single gradient step, an inductive step or telescoping argument is required; otherwise the real-time gating signal may miss degradation precisely when reuse is most harmful.
- [§3 (Theoretical Results)] The manuscript should clarify whether the lower bound is derived independently or effectively restates the observed DWD correlation. If the proof begins from the empirical surge in lm_head norm, the claim that this norm is a 'principled' signal risks circularity and weakens the justification for using it as a gating criterion.
minor comments (2)
- [Theoretical section] Define the precise divergence measure (KL, total variation, or other) used in the lower-bound statement and state the assumptions on the policy and reward model explicitly.
- [Experiments] Provide ablation results showing performance when gating is disabled versus when thresholds are tuned post-hoc on the same runs used for the reported speedups.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of the theoretical analysis that we address below with clarifications and planned revisions to strengthen the presentation.
read point-by-point responses
-
Referee: [Theoretical Analysis (proof of lower bound)] The central theoretical claim that the lm_head gradient norm lower-bounds policy divergence must be shown to apply to cumulative divergence after multiple reuses on the same batch. If the derivation (likely via Jacobian of softmax or Pinsker inequality) is for a single gradient step, an inductive step or telescoping argument is required; otherwise the real-time gating signal may miss degradation precisely when reuse is most harmful.
Authors: We appreciate this observation on extending the bound to the multi-reuse setting. The existing proof derives the per-update lower bound directly from the Jacobian of the softmax and an application of Pinsker's inequality to the one-step KL divergence between policies. Because DGG evaluates the lm_head gradient norm after every individual gradient step and gates before the optimizer applies the update, the per-step bound is enforced sequentially. To make the cumulative control explicit, we will add a telescoping argument in the revised §3 showing that the total policy divergence after k reuses is upper-bounded by the sum of the per-step norms; DGG therefore prevents accumulation by terminating reuse as soon as any single step would violate the threshold. This addition will be included in the next version. revision: yes
-
Referee: [§3 (Theoretical Results)] The manuscript should clarify whether the lower bound is derived independently or effectively restates the observed DWD correlation. If the proof begins from the empirical surge in lm_head norm, the claim that this norm is a 'principled' signal risks circularity and weakens the justification for using it as a gating criterion.
Authors: The lower bound is obtained from first principles by analyzing gradient flow through the transformer architecture: the lm_head receives unattenuated gradients from the output logits while intermediate layers are structurally damped by residual connections and layer norms. The derivation does not invoke the empirical DWD observations; those observations serve only as subsequent verification that the predicted concentration occurs in practice. We will revise the opening paragraph of §3 to state the logical order explicitly—architectural analysis and proof first, followed by empirical confirmation of DWD—to eliminate any appearance of circularity. revision: partial
Circularity Check
No significant circularity in theoretical bound or DWD identification.
full rationale
The paper reports an empirical observation of the Disproportionate Weight Divergence phenomenon and separately states a theoretical proof that the lm_head gradient norm lower-bounds policy divergence via gradient concentration and structural attenuation arguments. No equations or text in the provided sections reduce the bound to a fitted parameter, self-citation chain, or restatement of the observation itself. The proof is presented as first-principles reasoning (Jacobian/inequality style) independent of the reuse-regime data. The central claim therefore retains independent mathematical content and does not collapse to its inputs by construction.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the lm_head gradient norm lower-bounds the empirical Pearson χ² divergence between the updating and behavior policies (Theorem 2)
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
harmful gradients concentrate at the lm_head while intermediate layers are structurally attenuated (Theorem 1)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Efficient RL Training for LLMs with Experience Replay
Charles Arnal, Vivien Cabannes, Taco Cohen, Julia Kempe, and Remi Munos. Efficient rl training for llms with experience replay.arXiv preprint arXiv:2604.08706, 2026. URL https://arxiv.org/abs/2604.08706
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
Andrei Baroian and Rutger Berger. Prompt replay: speeding up grpo with on-policy reuse of high-signal prompts.arXiv preprint arXiv:2603.21177, 2026. URL https://arxiv.org/abs/ 2603.21177
-
[3]
arXiv preprint arXiv:2511.16108(2025)
Shiyi Cao, Dacheng Li, Fangzhou Zhao, Shuo Yuan, Sumanth R Hegde, Connor Chen, Charlie Ruan, Tyler Griggs, Shu Liu, Eric Tang, et al. Skyrl-agent: Efficient rl training for multi-turn llm agent.arXiv preprint arXiv:2511.16108, 2025. URLhttps://arxiv.org/abs/2511.16108
-
[4]
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention.arXiv preprint arXiv:2506.13585, 2025. URL https://arxiv.org/abs/ 2506.13585
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Agentic reinforced policy optimization
Guanting Dong, Hangyu Mao, Kai Ma, Licheng Bao, Yifei Chen, Zhongyuan Wang, Zhongxia Chen, Jiazhen Du, Huiyang Wang, Fuzheng Zhang, Guorui Zhou, Yutao Zhu, Ji-Rong Wen, and Zhicheng Dou. Agentic reinforced policy optimization. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id= TX4k7BF6aO
work page 2026
-
[6]
Group-in-group policy optimization for LLM agent training
Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for LLM agent training. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview.net/forum?id=QXEhBMNrCW
work page 2026
-
[7]
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025. URL https://www.nature.com/ articles/s41586-025-09422-z
work page 2025
-
[8]
Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. OlympiadBench: A challenging benchmark for promoting AGI with olympiad-level bilin- gual multimodal scientific problems. In Lun-Wei Ku, Andre Martins, and Vivek Sriku- mar, editors,Proceedin...
-
[9]
Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.211. URL https://aclanthology.org/2024.acl-long.211/
-
[10]
Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps
Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. In Donia Scott, Nuria Bel, and Chengqing Zong, editors,Proceedings of the 28th International Conference on Computational Linguistics, pages 6609–6625, Barcelona, Spain (Online), December 2020. 10 Internation...
-
[11]
Solving quantitative reasoning problems with language models
Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Ad...
work page 2022
-
[12]
Yizhi Li, Qingshui Gu, Zhoufutu Wen, Ziniu Li, Tianshun Xing, Shuyue Guo, Tianyu Zheng, Xin Zhou, Xingwei Qu, Wangchunshu Zhou, et al. Treepo: Bridging the gap of policy optimization and efficacy and inference efficiency with heuristic tree-based modeling.arXiv preprint arXiv:2508.17445, 2025. URLhttps://arxiv.org/abs/2508.17445
-
[13]
Knapsack rl: Unlocking exploration of llms via optimizing budget allocation
Ziniu Li, Congliang Chen, Tianyun Yang, Tian Ding, Ruoyu Sun, Ge Zhang, Wenhao Huang, and Zhi-Quan Luo. Knapsack rl: Unlocking exploration of llms via optimizing budget allocation. arXiv preprint arXiv:2509.25849, 2025. URLhttps://arxiv.org/pdf/2509.25849
-
[14]
Squeeze the soaked sponge: Efficient off-policy RFT for large language model
Jing Liang, Jinyi Liu, Yi Ma, Hongyao Tang, Y AN ZHENG, Shuyue Hu, LEI BAI, and Jianye HAO. Squeeze the soaked sponge: Efficient off-policy RFT for large language model. In The Fourteenth International Conference on Learning Representations, 2026. URL https: //openreview.net/forum?id=quBjNSJMrC
work page 2026
-
[15]
https://aclanthology.org/2025.emnlp-main.75/
Mengqi Liao, Xiangyu Xi, Chen Ruinian, Jia Leng, Yangen Hu, Ke Zeng, Shuai Liu, and Huaiyu Wan. Enhancing efficiency and exploration in reinforcement learning for llms. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 1451–1463, 2025. URL"https://aclanthology.org/2025.emnlp-main.75/"
work page 2025
-
[16]
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=v8L0pN6EOi
work page 2024
-
[17]
STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens
Shiqi Liu, Zeyu He, Guojian Zhan, Letian Tao, Zhilong Zheng, Jiang Wu, Yinuo Wang, Yang Guan, Kehua Sheng, Bo Zhang, et al. Stapo: Stabilizing reinforcement learning for llms by silencing rare spurious tokens.arXiv preprint arXiv:2602.15620, 2026. URL https: //arxiv.org/abs/2602.15620
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[18]
Hieu Trung Nguyen, Bao Nguyen, Wenao Ma, Yuzhi Zhao, Ruifeng She, and Viet Anh Nguyen. Adaptive rollout allocation for online reinforcement learning with verifiable rewards.arXiv preprint arXiv:2602.01601, 2026. URLhttps://arxiv.org/abs/2602.01601
-
[19]
Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah Smith, and Mike Lewis. Mea- suring and narrowing the compositionality gap in language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 5687–5711, Singapore, December 2023. Association for Computational Linguisti...
-
[20]
Trust Region Policy Optimization
John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. InInternational conference on machine learning, pages 1889–1897. PMLR, 2015. URLhttps://arxiv.org/abs/1502.05477
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[21]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017. URL https://arxiv. org/abs/1707.06347
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[22]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. URL https://arxiv.org/ abs/2402.03300. 11
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
{ALFW}orld: Aligning text and embodied environments for interactive learning
Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Cote, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. {ALFW}orld: Aligning text and embodied environments for interactive learning. InInternational Conference on Learning Representations, 2021. URL https: //openreview.net/forum?id=0IOX0YcCdTn
work page 2021
-
[24]
A tail-index analysis of stochastic gradient noise in deep neural networks
Umut Simsekli, Levent Sagun, and Mert Gurbuzbalaban. A tail-index analysis of stochastic gradient noise in deep neural networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 5827–5837. PMLR, 09–15 Jun 2019. URL http...
work page 2019
-
[25]
Robust Large Margin Deep Neural Networks
Jure Sokoli´c, Raja Giryes, Guillermo Sapiro, and Miguel RD Rodrigues. Robust large margin deep neural networks.IEEE Transactions on Signal Processing, 65(16):4265–4280, 2017. URL https://arxiv.org/abs/1605.08254
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[26]
Yifan Sun, Jingyan Shen, Yibin Wang, Tianyu Chen, Zhendong Wang, Mingyuan Zhou, and Huan Zhang. Improving data efficiency for LLM reinforcement fine-tuning through difficulty- targeted online data selection and rollout replay. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id= uwUkETPIJN
work page 2026
-
[27]
Introducing longcat-flash-thinking: A technical report.arXiv preprint arXiv:2509.18883, 2025
Meituan LongCat Team, Anchun Gui, Bei Li, Bingyang Tao, Bole Zhou, Borun Chen, Chao Zhang, Chengcheng Han, Chenhui Yang, Chi Zhang, et al. Introducing longcat-flash-thinking: A technical report.arXiv preprint arXiv:2509.18883, 2025. URL https://arxiv.org/abs/ 2509.18883
-
[28]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Tim- othée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023. URL https://arxiv.org/abs/2302.13971
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[29]
♫ M u S i Q ue: Multihop Questions via Single-hop Question Composition
Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 05 2022. ISSN 2307-387X. doi: 10.1162/tacl_a_00475. URLhttps://doi.org/10.1162/tacl_a_00475
-
[30]
Chen Wang, Lai Wei, Yanzhi Zhang, Chenyang Shao, Zedong Dan, Weiran Huang, Yuzhi Zhang, and Yue Wang. Eframe: Deeper reasoning via exploration-filter-replay reinforcement learning framework.arXiv preprint arXiv:2506.22200, 2025. URL https://arxiv.org/abs/ 2506.22200
-
[31]
When Importance Sampling Misallocates Credit: Asymmetric Ratios for Outcome-Supervised RL
Jiakang Wang, Runze Liu, Lei Lin, Wenping Hu, Xiu Li, Fuzheng Zhang, Guorui Zhou, and Kun Gai. Aspo: Asymmetric importance sampling policy optimization.arXiv preprint arXiv:2510.06062, 2025. URLhttps://arxiv.org/abs/2510.06062
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
Text Embeddings by Weakly-Supervised Contrastive Pre-training
Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022. URLhttps://arxiv.org/abs/2212.03533
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[33]
Numerical pitfalls in policy gradient updates
Tao Wang and Sicun Gao. Numerical pitfalls in policy gradient updates. 2025. URL https: //openreview.net/forum?id=u4dORXVAnx
work page 2025
- [35]
-
[36]
Shangyu Xing, Siyuan Wang, Chenyuan Yang, Xinyu Dai, and Xiang Ren. Lookahead tree- based rollouts for enhanced trajectory-level exploration in reinforcement learning with verifiable rewards. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=4nLvUk8edu. 12
work page 2026
-
[37]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. URLhttps://arxiv.org/abs/2505.09388
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
SSPO: Subsentence-level Policy Optimization
Kun Yang, Yanmeng Wang, Zhigen Li, et al. Sspo: Subsentence-level policy optimization. arXiv preprint arXiv:2511.04256, 2025. URLhttps://arxiv.org/abs/2511.04256
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Do not let low-probability tokens over-dominate in RL for LLMs
Zhihe Yang, Xufang Luo, Zilong Wang, Dongqi Han, Zhiyuan He, Dongsheng Li, and Yunjian Xu. Do not let low-probability tokens over-dominate in RL for LLMs. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview. net/forum?id=FOnAdLo0tM
work page 2026
-
[40]
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processi...
-
[41]
Webshop: To- wards scalable real-world web interaction with grounded language agents
Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: To- wards scalable real-world web interaction with grounded language agents. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neu- ral Information Processing Systems, volume 35, pages 20744–20757. Curran Associates, Inc., 2022. URL https://proceed...
work page 2022
-
[42]
Mastering complex control in moba games with deep reinforcement learning
Deheng Ye, Zhao Liu, Mingfei Sun, Bei Shi, Peilin Zhao, Hao Wu, Hongsheng Yu, Shaojie Yang, Xipeng Wu, Qingwei Guo, et al. Mastering complex control in moba games with deep reinforcement learning. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 6672–6679, 2020. URLhttps://arxiv.org/abs/1912.09729
-
[43]
Spectral Norm Regularization for Improving the Generalizability of Deep Learning
Yuichi Yoshida and Takeru Miyato. Spectral norm regularization for improving the gen- eralizability of deep learning.arXiv preprint arXiv:1705.10941, 2017. URL https: //arxiv.org/abs/1705.10941
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[44]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025. URL https://arxiv.org/ abs/2503.14476
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Runzhe Zhan, Yafu Li, Zhi Wang, Xiaoye Qu, Dongrui Liu, Jing Shao, Derek F. Wong, and Yu Cheng. ExGRPO: Learning to reason from experience. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id= 701tjQXWVk
work page 2026
-
[46]
Hao Zhang, Mingjie Liu, Shaokun Zhang, Songyang Han, Jian Hu, Zhenghui Jin, Yuchi Zhang, Shizhe Diao, Ximing Lu, Binfeng Xu, et al. Prorl agent: Rollout-as-a-service for rl training of multi-turn llm agents.arXiv preprint arXiv:2603.18815, 2026. URL https: //arxiv.org/abs/2603.18815
-
[47]
Why are adaptive methods good for attention models? In H
Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank Reddi, Sanjiv Kumar, and Suvrit Sra. Why are adaptive methods good for attention models? In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neu- ral Information Processing Systems, volume 33, pages 15383–15393. Curran Associates, Inc., 2020. ...
work page 2020
-
[48]
American invitational mathematics examination (aime) 2025, 2025
Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2025, 2025
work page 2025
-
[49]
Zhi Zhang, Zhen Han, Costas Mavromatis, Qi Zhu, Yunyi Zhang, Sheng Guan, Dingmin Wang, Xiong Zhou, Shuai Wang, Soji Adeshina, et al. Train less, learn more: Adaptive efficient rollout optimization for group-based reinforcement learning.arXiv preprint arXiv:2602.14338, 2026. URLhttps://arxiv.org/abs/2602.14338. 13
-
[50]
Group Sequence Policy Optimization
Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025. URLhttps://arxiv.org/abs/2507.18071
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[51]
Haizhong Zheng, Jiawei Zhao, and Beidi Chen. Prosperity before collapse: How far can off-policy RL reach with stale data on LLMs? InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=IIgl5MWelz
work page 2026
-
[52]
Haizhong Zheng, Yang Zhou, Brian R. Bartoldson, Bhavya Kailkhura, Fan Lai, Jiawei Zhao, and Beidi Chen. Act only when it pays: Efficient reinforcement learning for LLM reasoning via selective rollouts. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview.net/forum?id=x5lITYXmW2. 14 Contents 1 Introduct...
work page 2026
-
[53]
Applying this identity to each rank-1 term in Eq. (58): ∥Eih⊤ L,i∥2 F =∥E i∥2 2 · ∥hL,i∥2 2.(59) Substituting the explicit form ofE i: ∥Ei∥2 2 =r 2 i ˆA2 i ∥eai −π θ(· |h L,i)∥2 2.(60) Combining these: ∥Glm∥2 F ≤ 1 T TX i=1 r2 i ˆA2 i ∥eai −π θ(· |h L,i)∥2 2∥hL,i∥2 2.(61) Step 3: Absorbing representation factors into cmax.By the definition of cmax, the pe...
work page 2048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.