Don't Let Gains FADE: Breaking Down Policy Gradient Weights in RL
Pith reviewed 2026-07-03 20:57 UTC · model grok-4.3
The pith
Decomposing any advantage into sign and difficulty axes reveals shifting preferences that a dynamic scheduler can exploit during LLM RL post-training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Any advantage decomposes into positive and negative gradient mass along the sign axis, where imbalance collapses either entropy or weight geometry, and along the difficulty axis, where hard-problem focus sharpens signal at the expense of sample size. These trade-offs shift during training: exploration favors balance and hard focus while exploitation favors suppression and medium focus. The resulting self-adapting advantage, which schedules gradient weight according to observed dynamics, produces earlier peaks in pass@1 and a superior accuracy-diversity curve.
What carries the argument
The two-axis decomposition of advantage functions into sign balance (positive versus negative mass) and difficulty focus (hard versus medium problems), which tracks the required shift between exploration and exploitation phases.
If this is right
- FADE reaches peak pass@1 20k steps earlier than the best static baseline at the 7B scale.
- FADE reaches peak pass@1 2k steps earlier than the best static baseline at the 32B scale.
- FADE produces the best accuracy-diversity trade-off across all pass@k on LiveCodeBench and AIME.
- Automatic scheduling of gradient mass along the two axes reduces both training instability and diversity collapse compared with fixed advantages.
Where Pith is reading between the lines
- The same two-axis decomposition could be used to diagnose training problems in RL settings outside language-model post-training.
- If the axes capture the main trade-offs, closed-form schedules derived from training progress might eventually replace dynamic reading of statistics.
- Testing the method on non-reasoning tasks would show whether the exploration-to-exploitation shift is specific to reasoning benchmarks.
Load-bearing premise
That sign balance and difficulty focus are the dominant drivers of instability and diversity collapse, so that automatically scheduling gradient mass along them will reliably improve outcomes without new failure modes.
What would settle it
Training runs that replace the dynamic scheduler with a single fixed weighting equal to the average values discovered by FADE and obtain matching or higher pass rates plus diversity on LiveCodeBench and AIME at both 7B and 32B scales.
read the original abstract
Reinforcement learning post-training dramatically improves LLM reasoning, but suffers from training instability and diversity collapse. Advantage functions offer an appealing fix: they reshape the training objective, reweight which rollouts drive learning, and are trivial to implement. Yet a proliferation of methods makes it unclear which advantage to use and when. We cut through the confusion with a unifying framework that decomposes any advantage into its positive and negative gradient mass along two orthogonal axes. On the sign axis, imbalanced updates collapse either entropy or weight geometry. On the difficulty axis, hard-problem focus sharpens signal but costs sample size. Both trade-offs shift during training: exploration favors balance and hard focus; exploitation favors suppression and medium focus. This motivates FADE (Focal Advantage with Dynamic Entropy), a self-adapting advantage that reads training dynamics to schedule the gradient weight automatically. FADE reaches peak pass@1 20k steps earlier than the best static baseline at the 7B scale and 2k steps earlier at the 32B , while achieving the best accuracy-diversity trade-off across all pass@k on LiveCodeBench and AIME.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that any advantage function can be decomposed into positive/negative gradient mass along two orthogonal axes (sign balance and difficulty focus), with training dynamics naturally shifting from balanced/hard-focus (exploration) to suppressed/medium-focus (exploitation). This decomposition motivates FADE, a self-adapting advantage that dynamically schedules gradient weights based on observed dynamics, yielding faster peak pass@1 (20k steps earlier at 7B scale, 2k at 32B) and superior accuracy-diversity trade-offs across pass@k on LiveCodeBench and AIME compared to static baselines.
Significance. If the orthogonality holds and the dynamic scheduler delivers robust gains without new instabilities, the work supplies a generalizable, dynamics-driven alternative to fixed advantage heuristics in LLM RL post-training, potentially improving sample efficiency and stability in a setting where diversity collapse is a known failure mode.
major comments (2)
- [Unifying framework (decomposition into axes)] The unifying framework asserts that the sign-balance and difficulty-focus axes are orthogonal, enabling independent control of the two trade-offs via the dynamic scheduler; however, no verification (e.g., correlation analysis between axes or ablation of joint effects) is supplied, so if the axes are correlated in practice the claimed independent scheduling and explanatory power for the reported speed-ups would not hold.
- [Experiments and results] The central empirical claims (earlier peak pass@1 and best accuracy-diversity Pareto front) rest on comparisons to static baselines, yet the manuscript supplies no definition of those baselines, number of seeds, statistical tests, or ablation removing the dynamic component, leaving open whether gains are attributable to the proposed mechanism.
minor comments (1)
- [Abstract] The abstract states gains on 'two model scales and two benchmarks' but does not name the exact model families or confirm whether AIME is the full benchmark or a subset.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the unifying framework and experimental details. We address each major comment below and will revise the manuscript accordingly to strengthen the claims.
read point-by-point responses
-
Referee: [Unifying framework (decomposition into axes)] The unifying framework asserts that the sign-balance and difficulty-focus axes are orthogonal, enabling independent control of the two trade-offs via the dynamic scheduler; however, no verification (e.g., correlation analysis between axes or ablation of joint effects) is supplied, so if the axes are correlated in practice the claimed independent scheduling and explanatory power for the reported speed-ups would not hold.
Authors: We agree that empirical verification of orthogonality would strengthen the presentation. The axes are constructed to be orthogonal by definition in the decomposition (sign-balance operates on the polarity of gradient contributions while difficulty-focus operates on the magnitude distribution of advantages, yielding independent control parameters). To confirm this holds in practice and rule out unintended correlations during training, we will add a correlation analysis between the two axes across training trajectories as well as an ablation isolating joint effects in the revised manuscript. revision: yes
-
Referee: [Experiments and results] The central empirical claims (earlier peak pass@1 and best accuracy-diversity Pareto front) rest on comparisons to static baselines, yet the manuscript supplies no definition of those baselines, number of seeds, statistical tests, or ablation removing the dynamic component, leaving open whether gains are attributable to the proposed mechanism.
Authors: We acknowledge these omissions reduce clarity. In the revision we will explicitly define all static baselines, report the number of seeds used (three independent runs), include statistical tests comparing peak performance and Pareto fronts, and add an ablation that disables the dynamic scheduler while retaining the underlying advantage decomposition. These additions will directly attribute the observed speed-ups and trade-off improvements to the proposed mechanism. revision: yes
Circularity Check
No circularity: framework is observational and empirical
full rationale
The paper introduces a unifying framework decomposing advantages into positive/negative gradient mass along sign and difficulty axes, motivated explicitly by observed training dynamics shifting from exploration to exploitation. No equations, derivations, or fitted parameters appear in the provided text; FADE is presented as a self-adapting scheduler reading those dynamics rather than reducing to any input by construction. No self-citations are load-bearing for the core claims, and performance results are reported as empirical outcomes on LiveCodeBench and AIME. The derivation chain is self-contained against external benchmarks with no reduction to inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Policy gradient methods can be improved by reshaping advantages along sign and difficulty axes
invented entities (1)
-
FADE (Focal Advantage with Dynamic Entropy)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Opencodereasoning-ii: A simple test time scaling approach via self-critique
Wasi Uddin Ahmad, Somshubra Majumdar, Aleksander Ficek, Sean Narenthiran, Mehrzad Samadi, Jocelyn Huang, Siddhartha Jain, Vahid Noroozi, and Boris Ginsburg. Opencodereasoning-ii: A simple test time scaling approach via self-critique. arXiv preprint arXiv:2507.09075, 2025
-
[2]
Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms
Arash Ahmadian, Chris Cremer, Matthias Gall \'e , Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet \"U st \"u n, and Sara Hooker. Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages ...
2024
-
[3]
Variational best-of-n alignment
Afra Amini, Tim Vieira, Elliott Ash, and Ryan Cotterell. Variational best-of-n alignment. arXiv preprint arXiv:2407.06057, 2024
-
[4]
What matters in on-policy reinforcement learning? a large-scale empirical study
Marcin Andrychowicz, Anton Raichuk, Piotr Sta \'n czyk, Manu Orsini, Sertan Girgin, Rapha \"e l Marinier, L \'e onard Hussenot, Matthieu Geist, Olivier Pietquin, Marcin Michalski, et al. What matters in on-policy reinforcement learning? a large-scale empirical study. In ICLR 2021-Ninth International Conference on Learning Representations, 2021
2021
-
[5]
Asymmetric reinforce for off-policy reinforcement learning: Balancing positive and negative rewards
Charles Arnal, Ga \"e tan Narozniak, Vivien Cabannes, Yunhao Tang, Julia Kempe, and Remi Munos. Asymmetric reinforce for off-policy reinforcement learning: Balancing positive and negative rewards. Advances in Neural Information Processing Systems, 38: 0 9640--9664, 2026
2026
-
[6]
Advantage updating
Leemon C Baird. Advantage updating. Technical report, Wright Laboratory, 1993
1993
-
[7]
Why pass@ k optimization can degrade pass@ 1: Prompt interference in llm post-training
Anas Barakat, Souradip Chakraborty, Khushbu Pahwa, and Amrit Singh Bedi. Why pass@ k optimization can degrade pass@ 1: Prompt interference in llm post-training. arXiv preprint arXiv:2602.21189, 2026
-
[8]
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher R \'e , and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling. arXiv preprint arXiv:2407.21787, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
On predictability of reinforcement learning dynamics for large language models
Yuchen Cai, Ding Cao, Xin Xu, Zijun Yao, Yuqing Huang, Zhenyu Tan, Benyi Zhang, Guangzhong Sun, Guiquan Liu, and Junfeng Fang. On predictability of reinforcement learning dynamics for large language models. arXiv preprint arXiv:2510.00553, 2025
-
[10]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[11]
Pass@k training for adaptively balancing exploration and exploitation of large reasoning models
Zhipeng Chen, Xiaobo Qin, Youbin Wu, Yue Ling, Qinghao Ye, Wayne Xin Zhao, and Guang Shi. Pass@ k training for adaptively balancing exploration and exploitation of large reasoning models. arXiv preprint arXiv:2508.10751, 2025
-
[12]
The Cancellation Hypothesis in Critic-Free RL: From Outcome Rewards to Token Credits
Tianhao Cheng, Zeyu Huang, Zihan Qiu, Yu Cheng, Edoardo Ponti, Yinghui Xu, Ivan Titov, and Zenglin Xu. The cancellation hypothesis in critic-free rl: From outcome rewards to token credits. arXiv preprint arXiv:2605.08666, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[13]
Inference-aware fine-tuning for best-of-n sampling in large language models, 2025
Yinlam Chow, Guy Tennenholtz, Izzeddin Gur, Vincent Zhuang, Bo Dai, Sridhar Thiagarajan, Craig Boutilier, Rishabh Agarwal, Aviral Kumar, and Aleksandra Faust. Inference-aware fine-tuning for best-of-n sampling in large language models. arXiv preprint arXiv:2412.15287, 2024
-
[14]
The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models
Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al. The entropy mechanism of reinforcement learning for reasoning language models. arXiv preprint arXiv:2505.22617, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Distributional reinforcement learning with quantile regression
Will Dabney, Mark Rowland, Marc Bellemare, and R \'e mi Munos. Distributional reinforcement learning with quantile regression. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018
2018
-
[16]
Damek Davis and Benjamin Recht. What is the objective of reasoning with reinforcement learning? arXiv preprint arXiv:2510.13651, 2025
-
[17]
FAIR CodeGen team , :, Jade Copet, Quentin Carbonneaux, Gal Cohen, Jonas Gehring, Jacob Kahn, Jannik Kossen, Felix Kreuk, Emily McMilin, Michel Meyer, Yuxiang Wei, David Zhang, Kunhao Zheng, Jordi Armengol-Estapé, Pedram Bashiri, Maximilian Beck, Pierre Chambon, Abhishek Charnalia, Chris Cummins, Juliette Decugis, Zacharias V. Fisches, François Fleuret, F...
-
[18]
Yulu Gan and Phillip Isola. Neural thickets: Diverse task experts are dense around pretrained weights. arXiv preprint arXiv:2603.12228, 2026
-
[19]
The peril of preference: Why grpo fails on ordinal rewards
Anisha Garg and Ganesh Venkatesh. The peril of preference: Why grpo fails on ordinal rewards. arXiv preprint arXiv:2511.04439, 2025
-
[20]
Rlef: Grounding code llms in execution feedback with reinforcement learning
Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Taco Cohen, and Gabriel Synnaeve. Rlef: Grounding code llms in execution feedback with reinforcement learning. In International Conference on Machine Learning, pages 19034--19055. PMLR, 2025
2025
-
[21]
Variance reduction techniques for gradient estimates in reinforcement learning
Evan Greensmith, Peter L Bartlett, and Jonathan Baxter. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5 0 (Nov): 0 1471--1530, 2004
2004
-
[22]
Deepseek-r1 incentivizes reasoning in llms through reinforcement learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature, 645 0 (8081): 0 633--638, 2025
2025
-
[23]
Rewarding the unlikely: Lifting grpo beyond distribution sharpening
Andre Wang He, Daniel Fried, and Sean Welleck. Rewarding the unlikely: Lifting grpo beyond distribution sharpening. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25559--25571, 2025
2025
-
[24]
Scaling laws for single-agent reinforcement learning
Jacob Hilton, Jie Tang, and John Schulman. Scaling laws for single-agent reinforcement learning. arXiv preprint arXiv:2301.13442, 2023
-
[25]
Self-improvement in language models: The sharpening mechanism
Audrey Huang, Adam Block, Dylan Foster, Dhruv Rohatgi, Cyril Zhang, Max Simchowitz, Jordan Ash, and Akshay Krishnamurthy. Self-improvement in language models: The sharpening mechanism. In International Conference on Learning Representations, volume 2025, pages 76687--76739, 2025
2025
-
[26]
Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
Risk-sensitive rl for alleviating exploration dilemmas in large language models
Yuhua Jiang, Jiawei Huang, Yufeng Yuan, Xin Mao, Yu Yue, Qianchuan Zhao, and Lin Yan. Risk-sensitive rl for alleviating exploration dilemmas in large language models. arXiv preprint arXiv:2509.24261, 2025
-
[28]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[29]
Reasoning with Sampling: Your Base Model is Smarter Than You Think
Aayush Karan and Yilun Du. Reasoning with sampling: Your base model is smarter than you think. arXiv preprint arXiv:2510.14901, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
The Art of Scaling Reinforcement Learning Compute for LLMs
Devvrit Khatri, Lovish Madaan, Rishabh Tiwari, Rachit Bansal, Sai Surya Duvvuri, Manzil Zaheer, Inderjit S Dhillon, David Brandfonbrener, and Rishabh Agarwal. The art of scaling reinforcement learning compute for llms. arXiv preprint arXiv:2510.13786, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
Mc-grpo: Median-centered group relative policy optimization for small-rollout reinforcement learning
Youngeun Kim. Mc-grpo: Median-centered group relative policy optimization for small-rollout reinforcement learning. arXiv preprint arXiv:2601.22582, 2026
-
[32]
Kimi Team , Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Buy 4 reinforce samples, get a baseline for free! arXiv preprint arXiv:1901.10280, 2019
Wouter Kool, Herke van Hoof, and Max Welling. Buy 4 reinforce samples, get a baseline for free! arXiv preprint arXiv:1901.10280, 2019
-
[34]
Implicit under-parameterization inhibits data-efficient deep reinforcement learning
Aviral Kumar, Rishabh Agarwal, Dibya Ghosh, and Sergey Levine. Implicit under-parameterization inhibits data-efficient deep reinforcement learning. arXiv preprint arXiv:2010.14498, 2020
-
[35]
Learning shrinks the hard tail: Training-dependent inference scaling in a solvable linear model
Noam Levi. Learning shrinks the hard tail: Training-dependent inference scaling in a solvable linear model. arXiv preprint, 2026
2026
-
[36]
Taco: Topics in algorithmic code generation dataset
Rongao Li, Jie Fu, Bo-Wen Zhang, Tao Huang, Zhihong Sun, Chen Lyu, Guang Liu, Zhi Jin, and Ge Li. Taco: Topics in algorithmic code generation dataset. arXiv preprint arXiv:2312.14852, 2023
-
[37]
Competition-level code generation with alphacode
Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, R \'e mi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with alphacode. Science, 378 0 (6624): 0 1092--1097, 2022
2022
-
[38]
Let's verify step by step
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. In The Twelfth International Conference on Learning Representations, 2024
2024
-
[39]
Focal loss for dense object detection
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll \'a r. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980--2988, 2017
2017
-
[40]
Boosting LLM Reasoning via Human-Inspired Reward Shaping
Wenze Lin, Zhen Yang, Xitai Jiang, Pony Ma, and Gao Huang. Thickening-to-thinning: Reward shaping via human-inspired learning dynamics for llm reasoning. arXiv preprint arXiv:2602.04265, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[42]
Understanding R1-Zero-Like Training: A Critical Perspective
Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective. arXiv preprint arXiv:2503.20783, 2025 b
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
Simulation-based optimization of markov reward processes
Peter Marbach and John N Tsitsiklis. Simulation-based optimization of markov reward processes. IEEE Transactions on Automatic Control, 46 0 (2): 0 191--209, 2001
2001
-
[44]
Steps toward artificial intelligence
Marvin Minsky. Steps toward artificial intelligence. Proceedings of the IRE, 49 0 (1): 0 8--30, 1961
1961
-
[45]
Variational inference for monte carlo objectives
Andriy Mnih and Danilo Rezende. Variational inference for monte carlo objectives. In International Conference on Machine Learning, pages 2188--2196. PMLR, 2016
2016
-
[46]
No representation, no trust: Connecting representation, collapse, and trust issues in ppo
Skander Moalla, Andrea Miele, Daniil Pyatko, Razvan Pascanu, and Caglar Gulcehre. No representation, no trust: Connecting representation, collapse, and trust issues in ppo. Advances in Neural Information Processing Systems, 37: 0 69652--69699, 2024
2024
-
[47]
Ivan Moshkov, Darragh Hanley, Ivan Sorokin, Shubham Toshniwal, Christof Henkel, Benedikt Schifferer, Wei Du, and Igor Gitman. Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset. arXiv preprint arXiv:2504.16891, 2025
-
[48]
Asynchronous rlhf: Faster and more efficient off-policy rl for language models
Michael Noukhovitch, Shengyi Huang, Sophie Xhonneux, Arian Hosseini, Rishabh Agarwal, and Aaron Courville. Asynchronous rlhf: Faster and more efficient off-policy rl for language models. In International Conference on Learning Representations, volume 2025, pages 4003--4029, 2025
2025
-
[49]
Learning to reason with LLM s
OpenAI . Learning to reason with LLM s. https://openai.com/index/learning-to-reason-with-llms/, September 2024
2024
-
[50]
Jaesung R Park, Junsu Kim, Gyeongman Kim, Jinyoung Jo, Sean Choi, Jaewoong Cho, and Ernest K Ryu. Clip-low increases entropy and clip-high decreases entropy in reinforcement learning of large language models. arXiv preprint arXiv:2509.26114, 2025
-
[51]
Beyond the Sampled Token: Preserving Candidate Support in RLVR
Ruotian Peng, Yi Ren, Zhouliang Yu, Weiyang Liu, and Yandong Wen. Simko: Simple pass@ k policy optimization. arXiv preprint arXiv:2510.14807, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[52]
F-GRPO: Don't Let Your Policy Learn the Obvious and Forget the Rare
Daniil Plyusov, Alexey Gorbatovski, Boris Shaposhnikov, Viacheslav Sinii, Alexey Malakhov, and Daniil Gavrilov. F-grpo: Don't let your policy learn the obvious and forget the rare. arXiv preprint arXiv:2602.06717, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[53]
Pope: Learning to reason on hard problems via privileged on-policy exploration, 2026
Yuxiao Qu, Amrith Setlur, Virginia Smith, Ruslan Salakhutdinov, and Aviral Kumar. Pope: Learning to reason on hard problems via privileged on-policy exploration. arXiv preprint arXiv:2601.18779, 2026
-
[54]
How do large language monkeys get their power (laws)? In International Conference on Machine Learning, pages 53132--53176
Rylan Schaeffer, Joshua Kazdan, John Hughes, Jordan Juravsky, Sara Price, Aengus Lynch, Erik Jones, Robert Kirk, Azalia Mirhoseini, and Sanmi Koyejo. How do large language monkeys get their power (laws)? In International Conference on Machine Learning, pages 53132--53176. PMLR, 2025
2025
-
[55]
High-Dimensional Continuous Control Using Generalized Advantage Estimation
John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[56]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[57]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[58]
Actor-critic policy optimization in partially observable multiagent environments
Sriram Srinivasan, Marc Lanctot, Vinicius Zambaldi, Julien P \'e rolat, Karl Tuyls, R \'e mi Munos, and Michael Bowling. Actor-critic policy optimization in partially observable multiagent environments. Advances in Neural Information Processing Systems, 31, 2018
2018
-
[59]
On the optimization dynamics of RLVR: Gradient gap and step size thresholds
Joe Suk and Yaqi Duan. On the optimization dynamics of rlvr: Gradient gap and step size thresholds. arXiv preprint arXiv:2510.08539, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[60]
Learning to predict by the methods of temporal differences
Richard S Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3 0 (1): 0 9--44, 1988
1988
-
[61]
Sutton, David McAllester, Satinder Singh, and Yishay Mansour
Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the 13th International Conference on Neural Information Processing Systems (NIPS), pages 1057--1063, Cambridge, MA, USA, 1999. MIT Press
1999
-
[62]
Maximum likelihood reinforcement learning
Fahim Tajwar, Guanning Zeng, Yueer Zhou, Yuda Song, Daman Arora, Yiding Jiang, Jeff Schneider, Ruslan Salakhutdinov, Haiwen Feng, and Andrea Zanette. Maximum likelihood reinforcement learning. arXiv preprint arXiv:2602.02710, 2026
-
[63]
Zelin Tan, Hejia Geng, Xiaohang Yu, Mulei Zhang, Guancheng Wan, Yifan Zhou, Qiang He, Xiangyuan Xue, Heng Zhou, Yutao Fan, et al. Scaling behaviors of llm reinforcement learning post-training: An empirical study in mathematical reasoning. arXiv preprint arXiv:2509.25300, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[64]
arXiv preprint arXiv:2503.19595 , year=
Yunhao Tang, Kunhao Zheng, Gabriel Synnaeve, and R \'e mi Munos. Optimizing language models for inference time objectives using reinforcement learning. arXiv preprint arXiv:2503.19595, 2025
-
[65]
Mai-thinking-1: Building a hill-climbing machine
The Microsoft AI Team . Mai-thinking-1: Building a hill-climbing machine. Technical report, Microsoft AI, 2026. https://microsoft.ai/pdf/mai-thinking-1.pdf
2026
-
[66]
Advantage shaping as surrogate reward maximization: Unifying pass@ k policy gradients
Christos Thrampoulidis, Sadegh Mahdavi, and Wenlong Deng. Advantage shaping as surrogate reward maximization: Unifying pass@ k policy gradients. arXiv preprint arXiv:2510.23049, 2025
-
[67]
Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems
Christian Walder and Deep Karkhanis. Pass@ k policy optimization: Solving harder reinforcement learning problems. arXiv preprint arXiv:2505.15201, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[68]
Self-consistency improves chain of thought reasoning in language models
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, 2023
2023
-
[69]
Dueling network architectures for deep reinforcement learning
Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Hasselt, Marc Lanctot, and Nando Freitas. Dueling network architectures for deep reinforcement learning. In International conference on machine learning, pages 1995--2003. PMLR, 2016
1995
-
[70]
Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8 0 (3--4): 0 229--256, 1992. doi:10.1007/BF00992696
-
[71]
Efficient reinforcement learning with large language model priors
Xue Yan, Yan Song, Xidong Feng, Mengyue Yang, Haifeng Zhang, Haitham Bou Ammar, and Jun Wang. Efficient reinforcement learning with large language model priors. In International Conference on Learning Representations, volume 2025, pages 48691--48715, 2025
2025
-
[72]
Your group-relative advantage is biased
Fengkai Yang, Zherui Chen, Xiaohan Wang, Xiaodong Lu, Jiajun Chai, Guojun Yin, Wei Lin, Shuai Ma, Fuzhen Zhuang, Deqing Wang, et al. Your group-relative advantage is biased. arXiv preprint arXiv:2601.08521, 2026
-
[73]
On the Implicit Reward Overfitting and the Low-rank Dynamics in RLVR
Hao Ye, Jisheng Dang, Junfeng Fang, Bimei Wang, Yizhou Zhang, Ning Lv, Wencan Zhang, Hong Peng, Bin Hu, and Tat-Seng Chua. On the implicit reward overfitting and the low-rank dynamics in rlvr. arXiv preprint arXiv:2605.06523, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[74]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[75]
From Reasoning to Agentic: Credit Assignment in Reinforcement Learning for Large Language Models
Chenchen Zhang. From reasoning to agentic: Credit assignment in reinforcement learning for large language models. arXiv preprint arXiv:2604.09459, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[76]
AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning
Haotian Zhao, Songlin Zhou, Yuxin Zhang, Stephen S-T Yau, Wenyu Zhang, Lun Tian, Tianshu Zhu, Yifeng Huang, Yucheng Zeng, Jingnan Gu, et al. Aem: Adaptive entropy modulation for multi-turn agentic reinforcement learning. arXiv preprint arXiv:2605.00425, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[77]
arXiv preprint arXiv:2506.01347 , year=
Xinyu Zhu, Mengzhou Xia, Zhepei Wei, Wei-Lin Chen, Danqi Chen, and Yu Meng. The surprising effectiveness of negative reinforcement in llm reasoning. arXiv preprint arXiv:2506.01347, 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.