Don't Let Gains FADE: Breaking Down Policy Gradient Weights in RL

Francis Bach; Gabriel Synnaeve; Juliette Decugis; Sean O'Brien; Taco Cohen

arxiv: 2607.01490 · v1 · pith:ZOO2NWSMnew · submitted 2026-07-01 · 💻 cs.LG · cs.AI

Don't Let Gains FADE: Breaking Down Policy Gradient Weights in RL

Juliette Decugis , Sean O'Brien , Francis Bach , Gabriel Synnaeve , Taco Cohen This is my paper

Pith reviewed 2026-07-03 20:57 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords reinforcement learningLLM post-trainingadvantage functionspolicy gradientstraining stabilitydiversity collapseFADE

0 comments

The pith

Decomposing any advantage into sign and difficulty axes reveals shifting preferences that a dynamic scheduler can exploit during LLM RL post-training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that advantage functions reshape which rollouts drive learning in RL by controlling positive and negative gradient mass along two axes. On the sign axis, balance between positive and negative updates prevents collapse of entropy or weight geometry. On the difficulty axis, focus on hard problems sharpens the learning signal but reduces effective sample size. These preferences change as training moves from exploration, which needs balance and hard focus, to exploitation, which favors suppression of negatives and medium focus. This decomposition directly motivates an automatic scheduler that reads the current dynamics to adjust the advantage weights on the fly.

Core claim

Any advantage decomposes into positive and negative gradient mass along the sign axis, where imbalance collapses either entropy or weight geometry, and along the difficulty axis, where hard-problem focus sharpens signal at the expense of sample size. These trade-offs shift during training: exploration favors balance and hard focus while exploitation favors suppression and medium focus. The resulting self-adapting advantage, which schedules gradient weight according to observed dynamics, produces earlier peaks in pass@1 and a superior accuracy-diversity curve.

What carries the argument

The two-axis decomposition of advantage functions into sign balance (positive versus negative mass) and difficulty focus (hard versus medium problems), which tracks the required shift between exploration and exploitation phases.

If this is right

FADE reaches peak pass@1 20k steps earlier than the best static baseline at the 7B scale.
FADE reaches peak pass@1 2k steps earlier than the best static baseline at the 32B scale.
FADE produces the best accuracy-diversity trade-off across all pass@k on LiveCodeBench and AIME.
Automatic scheduling of gradient mass along the two axes reduces both training instability and diversity collapse compared with fixed advantages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same two-axis decomposition could be used to diagnose training problems in RL settings outside language-model post-training.
If the axes capture the main trade-offs, closed-form schedules derived from training progress might eventually replace dynamic reading of statistics.
Testing the method on non-reasoning tasks would show whether the exploration-to-exploitation shift is specific to reasoning benchmarks.

Load-bearing premise

That sign balance and difficulty focus are the dominant drivers of instability and diversity collapse, so that automatically scheduling gradient mass along them will reliably improve outcomes without new failure modes.

What would settle it

Training runs that replace the dynamic scheduler with a single fixed weighting equal to the average values discovered by FADE and obtain matching or higher pass rates plus diversity on LiveCodeBench and AIME at both 7B and 32B scales.

read the original abstract

Reinforcement learning post-training dramatically improves LLM reasoning, but suffers from training instability and diversity collapse. Advantage functions offer an appealing fix: they reshape the training objective, reweight which rollouts drive learning, and are trivial to implement. Yet a proliferation of methods makes it unclear which advantage to use and when. We cut through the confusion with a unifying framework that decomposes any advantage into its positive and negative gradient mass along two orthogonal axes. On the sign axis, imbalanced updates collapse either entropy or weight geometry. On the difficulty axis, hard-problem focus sharpens signal but costs sample size. Both trade-offs shift during training: exploration favors balance and hard focus; exploitation favors suppression and medium focus. This motivates FADE (Focal Advantage with Dynamic Entropy), a self-adapting advantage that reads training dynamics to schedule the gradient weight automatically. FADE reaches peak pass@1 20k steps earlier than the best static baseline at the 7B scale and 2k steps earlier at the 32B , while achieving the best accuracy-diversity trade-off across all pass@k on LiveCodeBench and AIME.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FADE gives a usable dynamic scheduler from a sign-difficulty decomposition of advantages, but the claimed orthogonality is unverified and experiment details are thin.

read the letter

The main thing here is a two-axis decomposition of any advantage into sign balance and difficulty focus, which then drives a dynamic scheduler called FADE that adjusts gradient weights as training shifts from exploration to exploitation.

The decomposition organizes existing advantage methods under one view and supplies a concrete reason for moving away from static choices. FADE reads training dynamics to set the weights automatically. The reported results at 7B and 32B scales show earlier peaks on pass@1 and a better accuracy-diversity trade-off on LiveCodeBench and AIME, which matters for people actually running these post-training runs.

The framework is straightforward and the motivation from observed dynamics is clear. If the numbers hold, the practical payoff is real because it targets instability and collapse without adding much overhead.

The soft spot is the orthogonality claim. The stress-test note is right to flag that if sign imbalance and hard-problem focus tend to move together, the scheduler loses its independent control and the speed-ups could come from a different mechanism. The abstract also gives no variance numbers, baseline definitions, or ablation of the dynamic part, so it is difficult to judge robustness from what is shown.

This is for researchers doing RL post-training on LLMs who want a low-cost lever for stability. A reader already working at these scales will find the method easy to try.

Send it for peer review. The idea is concrete, the scale is relevant, and the open questions on the axes are fixable with the right experiments.

Referee Report

2 major / 1 minor

Summary. The paper claims that any advantage function can be decomposed into positive/negative gradient mass along two orthogonal axes (sign balance and difficulty focus), with training dynamics naturally shifting from balanced/hard-focus (exploration) to suppressed/medium-focus (exploitation). This decomposition motivates FADE, a self-adapting advantage that dynamically schedules gradient weights based on observed dynamics, yielding faster peak pass@1 (20k steps earlier at 7B scale, 2k at 32B) and superior accuracy-diversity trade-offs across pass@k on LiveCodeBench and AIME compared to static baselines.

Significance. If the orthogonality holds and the dynamic scheduler delivers robust gains without new instabilities, the work supplies a generalizable, dynamics-driven alternative to fixed advantage heuristics in LLM RL post-training, potentially improving sample efficiency and stability in a setting where diversity collapse is a known failure mode.

major comments (2)

[Unifying framework (decomposition into axes)] The unifying framework asserts that the sign-balance and difficulty-focus axes are orthogonal, enabling independent control of the two trade-offs via the dynamic scheduler; however, no verification (e.g., correlation analysis between axes or ablation of joint effects) is supplied, so if the axes are correlated in practice the claimed independent scheduling and explanatory power for the reported speed-ups would not hold.
[Experiments and results] The central empirical claims (earlier peak pass@1 and best accuracy-diversity Pareto front) rest on comparisons to static baselines, yet the manuscript supplies no definition of those baselines, number of seeds, statistical tests, or ablation removing the dynamic component, leaving open whether gains are attributable to the proposed mechanism.

minor comments (1)

[Abstract] The abstract states gains on 'two model scales and two benchmarks' but does not name the exact model families or confirm whether AIME is the full benchmark or a subset.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the unifying framework and experimental details. We address each major comment below and will revise the manuscript accordingly to strengthen the claims.

read point-by-point responses

Referee: [Unifying framework (decomposition into axes)] The unifying framework asserts that the sign-balance and difficulty-focus axes are orthogonal, enabling independent control of the two trade-offs via the dynamic scheduler; however, no verification (e.g., correlation analysis between axes or ablation of joint effects) is supplied, so if the axes are correlated in practice the claimed independent scheduling and explanatory power for the reported speed-ups would not hold.

Authors: We agree that empirical verification of orthogonality would strengthen the presentation. The axes are constructed to be orthogonal by definition in the decomposition (sign-balance operates on the polarity of gradient contributions while difficulty-focus operates on the magnitude distribution of advantages, yielding independent control parameters). To confirm this holds in practice and rule out unintended correlations during training, we will add a correlation analysis between the two axes across training trajectories as well as an ablation isolating joint effects in the revised manuscript. revision: yes
Referee: [Experiments and results] The central empirical claims (earlier peak pass@1 and best accuracy-diversity Pareto front) rest on comparisons to static baselines, yet the manuscript supplies no definition of those baselines, number of seeds, statistical tests, or ablation removing the dynamic component, leaving open whether gains are attributable to the proposed mechanism.

Authors: We acknowledge these omissions reduce clarity. In the revision we will explicitly define all static baselines, report the number of seeds used (three independent runs), include statistical tests comparing peak performance and Pareto fronts, and add an ablation that disables the dynamic scheduler while retaining the underlying advantage decomposition. These additions will directly attribute the observed speed-ups and trade-off improvements to the proposed mechanism. revision: yes

Circularity Check

0 steps flagged

No circularity: framework is observational and empirical

full rationale

The paper introduces a unifying framework decomposing advantages into positive/negative gradient mass along sign and difficulty axes, motivated explicitly by observed training dynamics shifting from exploration to exploitation. No equations, derivations, or fitted parameters appear in the provided text; FADE is presented as a self-adapting scheduler reading those dynamics rather than reducing to any input by construction. No self-citations are load-bearing for the core claims, and performance results are reported as empirical outcomes on LiveCodeBench and AIME. The derivation chain is self-contained against external benchmarks with no reduction to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only review yields minimal ledger entries; the framework rests on standard RL policy-gradient assumptions and introduces FADE as a new scheduling rule without explicit free parameters or external validation shown.

axioms (1)

domain assumption Policy gradient methods can be improved by reshaping advantages along sign and difficulty axes
Central organizing claim of the unifying framework.

invented entities (1)

FADE (Focal Advantage with Dynamic Entropy) no independent evidence
purpose: Self-adapting advantage that schedules gradient weights from training dynamics
New method introduced to address the identified trade-offs; no independent evidence supplied in abstract.

pith-pipeline@v0.9.1-grok · 5737 in / 1206 out tokens · 21754 ms · 2026-07-03T20:57:56.863463+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

76 extracted references · 48 canonical work pages · 23 internal anchors

[1]

Opencodereasoning-ii: A simple test time scaling approach via self-critique

Wasi Uddin Ahmad, Somshubra Majumdar, Aleksander Ficek, Sean Narenthiran, Mehrzad Samadi, Jocelyn Huang, Siddhartha Jain, Vahid Noroozi, and Boris Ginsburg. Opencodereasoning-ii: A simple test time scaling approach via self-critique. arXiv preprint arXiv:2507.09075, 2025

work page arXiv 2025
[2]

Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms

Arash Ahmadian, Chris Cremer, Matthias Gall \'e , Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet \"U st \"u n, and Sara Hooker. Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages ...

2024
[3]

Variational best-of-n alignment

Afra Amini, Tim Vieira, Elliott Ash, and Ryan Cotterell. Variational best-of-n alignment. arXiv preprint arXiv:2407.06057, 2024

work page arXiv 2024
[4]

What matters in on-policy reinforcement learning? a large-scale empirical study

Marcin Andrychowicz, Anton Raichuk, Piotr Sta \'n czyk, Manu Orsini, Sertan Girgin, Rapha \"e l Marinier, L \'e onard Hussenot, Matthieu Geist, Olivier Pietquin, Marcin Michalski, et al. What matters in on-policy reinforcement learning? a large-scale empirical study. In ICLR 2021-Ninth International Conference on Learning Representations, 2021

2021
[5]

Asymmetric reinforce for off-policy reinforcement learning: Balancing positive and negative rewards

Charles Arnal, Ga \"e tan Narozniak, Vivien Cabannes, Yunhao Tang, Julia Kempe, and Remi Munos. Asymmetric reinforce for off-policy reinforcement learning: Balancing positive and negative rewards. Advances in Neural Information Processing Systems, 38: 0 9640--9664, 2026

2026
[6]

Advantage updating

Leemon C Baird. Advantage updating. Technical report, Wright Laboratory, 1993

1993
[7]

Why pass@ k optimization can degrade pass@ 1: Prompt interference in llm post-training

Anas Barakat, Souradip Chakraborty, Khushbu Pahwa, and Amrit Singh Bedi. Why pass@ k optimization can degrade pass@ 1: Prompt interference in llm post-training. arXiv preprint arXiv:2602.21189, 2026

work page arXiv 2026
[8]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher R \'e , and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling. arXiv preprint arXiv:2407.21787, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

On predictability of reinforcement learning dynamics for large language models

Yuchen Cai, Ding Cao, Xin Xu, Zijun Yao, Yuqing Huang, Zhenyu Tan, Benyi Zhang, Guangzhong Sun, Guiquan Liu, and Junfeng Fang. On predictability of reinforcement learning dynamics for large language models. arXiv preprint arXiv:2510.00553, 2025

work page arXiv 2025
[10]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[11]

Pass@k training for adaptively balancing exploration and exploitation of large reasoning models

Zhipeng Chen, Xiaobo Qin, Youbin Wu, Yue Ling, Qinghao Ye, Wayne Xin Zhao, and Guang Shi. Pass@ k training for adaptively balancing exploration and exploitation of large reasoning models. arXiv preprint arXiv:2508.10751, 2025

work page arXiv 2025
[12]

The Cancellation Hypothesis in Critic-Free RL: From Outcome Rewards to Token Credits

Tianhao Cheng, Zeyu Huang, Zihan Qiu, Yu Cheng, Edoardo Ponti, Yinghui Xu, Ivan Titov, and Zenglin Xu. The cancellation hypothesis in critic-free rl: From outcome rewards to token credits. arXiv preprint arXiv:2605.08666, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[13]

Inference-aware fine-tuning for best-of-n sampling in large language models, 2025

Yinlam Chow, Guy Tennenholtz, Izzeddin Gur, Vincent Zhuang, Bo Dai, Sridhar Thiagarajan, Craig Boutilier, Rishabh Agarwal, Aviral Kumar, and Aleksandra Faust. Inference-aware fine-tuning for best-of-n sampling in large language models. arXiv preprint arXiv:2412.15287, 2024

work page arXiv 2024
[14]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al. The entropy mechanism of reinforcement learning for reasoning language models. arXiv preprint arXiv:2505.22617, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Distributional reinforcement learning with quantile regression

Will Dabney, Mark Rowland, Marc Bellemare, and R \'e mi Munos. Distributional reinforcement learning with quantile regression. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018

2018
[16]

What is the objective of reasoning with reinforcement learning? arXiv preprint arXiv:2510.13651, 2025

Damek Davis and Benjamin Recht. What is the objective of reasoning with reinforcement learning? arXiv preprint arXiv:2510.13651, 2025

work page arXiv 2025
[17]

FAIR CodeGen team , :, Jade Copet, Quentin Carbonneaux, Gal Cohen, Jonas Gehring, Jacob Kahn, Jannik Kossen, Felix Kreuk, Emily McMilin, Michel Meyer, Yuxiang Wei, David Zhang, Kunhao Zheng, Jordi Armengol-Estapé, Pedram Bashiri, Maximilian Beck, Pierre Chambon, Abhishek Charnalia, Chris Cummins, Juliette Decugis, Zacharias V. Fisches, François Fleuret, F...

work page arXiv 2025
[18]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z

Yulu Gan and Phillip Isola. Neural thickets: Diverse task experts are dense around pretrained weights. arXiv preprint arXiv:2603.12228, 2026

work page arXiv 2026
[19]

The peril of preference: Why grpo fails on ordinal rewards

Anisha Garg and Ganesh Venkatesh. The peril of preference: Why grpo fails on ordinal rewards. arXiv preprint arXiv:2511.04439, 2025

work page arXiv 2025
[20]

Rlef: Grounding code llms in execution feedback with reinforcement learning

Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Taco Cohen, and Gabriel Synnaeve. Rlef: Grounding code llms in execution feedback with reinforcement learning. In International Conference on Machine Learning, pages 19034--19055. PMLR, 2025

2025
[21]

Variance reduction techniques for gradient estimates in reinforcement learning

Evan Greensmith, Peter L Bartlett, and Jonathan Baxter. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5 0 (Nov): 0 1471--1530, 2004

2004
[22]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature, 645 0 (8081): 0 633--638, 2025

2025
[23]

Rewarding the unlikely: Lifting grpo beyond distribution sharpening

Andre Wang He, Daniel Fried, and Sean Welleck. Rewarding the unlikely: Lifting grpo beyond distribution sharpening. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25559--25571, 2025

2025
[24]

Scaling laws for single-agent reinforcement learning

Jacob Hilton, Jie Tang, and John Schulman. Scaling laws for single-agent reinforcement learning. arXiv preprint arXiv:2301.13442, 2023

work page arXiv 2023
[25]

Self-improvement in language models: The sharpening mechanism

Audrey Huang, Adam Block, Dylan Foster, Dhruv Rohatgi, Cyril Zhang, Max Simchowitz, Jordan Ash, and Akshay Krishnamurthy. Self-improvement in language models: The sharpening mechanism. In International Conference on Learning Representations, volume 2025, pages 76687--76739, 2025

2025
[26]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Risk-sensitive rl for alleviating exploration dilemmas in large language models

Yuhua Jiang, Jiawei Huang, Yufeng Yuan, Xin Mao, Yu Yue, Qianchuan Zhao, and Lin Yan. Risk-sensitive rl for alleviating exploration dilemmas in large language models. arXiv preprint arXiv:2509.24261, 2025

work page arXiv 2025
[28]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[29]

Reasoning with Sampling: Your Base Model is Smarter Than You Think

Aayush Karan and Yilun Du. Reasoning with sampling: Your base model is smarter than you think. arXiv preprint arXiv:2510.14901, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

The Art of Scaling Reinforcement Learning Compute for LLMs

Devvrit Khatri, Lovish Madaan, Rishabh Tiwari, Rachit Bansal, Sai Surya Duvvuri, Manzil Zaheer, Inderjit S Dhillon, David Brandfonbrener, and Rishabh Agarwal. The art of scaling reinforcement learning compute for llms. arXiv preprint arXiv:2510.13786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Mc-grpo: Median-centered group relative policy optimization for small-rollout reinforcement learning

Youngeun Kim. Mc-grpo: Median-centered group relative policy optimization for small-rollout reinforcement learning. arXiv preprint arXiv:2601.22582, 2026

work page arXiv 2026
[32]

Kimi Team , Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Buy 4 reinforce samples, get a baseline for free! arXiv preprint arXiv:1901.10280, 2019

Wouter Kool, Herke van Hoof, and Max Welling. Buy 4 reinforce samples, get a baseline for free! arXiv preprint arXiv:1901.10280, 2019

work page arXiv 1901
[34]

Implicit under-parameterization inhibits data-efficient deep reinforcement learning

Aviral Kumar, Rishabh Agarwal, Dibya Ghosh, and Sergey Levine. Implicit under-parameterization inhibits data-efficient deep reinforcement learning. arXiv preprint arXiv:2010.14498, 2020

work page arXiv 2010
[35]

Learning shrinks the hard tail: Training-dependent inference scaling in a solvable linear model

Noam Levi. Learning shrinks the hard tail: Training-dependent inference scaling in a solvable linear model. arXiv preprint, 2026

2026
[36]

Taco: Topics in algorithmic code generation dataset

Rongao Li, Jie Fu, Bo-Wen Zhang, Tao Huang, Zhihong Sun, Chen Lyu, Guang Liu, Zhi Jin, and Ge Li. Taco: Topics in algorithmic code generation dataset. arXiv preprint arXiv:2312.14852, 2023

work page arXiv 2023
[37]

Competition-level code generation with alphacode

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, R \'e mi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with alphacode. Science, 378 0 (6624): 0 1092--1097, 2022

2022
[38]

Let's verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. In The Twelfth International Conference on Learning Representations, 2024

2024
[39]

Focal loss for dense object detection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll \'a r. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980--2988, 2017

2017
[40]

Boosting LLM Reasoning via Human-Inspired Reward Shaping

Wenze Lin, Zhen Yang, Xitai Jiang, Pony Ma, and Gao Huang. Thickening-to-thinning: Reward shaping via human-inspired learning dynamics for llm reasoning. arXiv preprint arXiv:2602.04265, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[42]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective. arXiv preprint arXiv:2503.20783, 2025 b

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Simulation-based optimization of markov reward processes

Peter Marbach and John N Tsitsiklis. Simulation-based optimization of markov reward processes. IEEE Transactions on Automatic Control, 46 0 (2): 0 191--209, 2001

2001
[44]

Steps toward artificial intelligence

Marvin Minsky. Steps toward artificial intelligence. Proceedings of the IRE, 49 0 (1): 0 8--30, 1961

1961
[45]

Variational inference for monte carlo objectives

Andriy Mnih and Danilo Rezende. Variational inference for monte carlo objectives. In International Conference on Machine Learning, pages 2188--2196. PMLR, 2016

2016
[46]

No representation, no trust: Connecting representation, collapse, and trust issues in ppo

Skander Moalla, Andrea Miele, Daniil Pyatko, Razvan Pascanu, and Caglar Gulcehre. No representation, no trust: Connecting representation, collapse, and trust issues in ppo. Advances in Neural Information Processing Systems, 37: 0 69652--69699, 2024

2024
[47]

Aimo-2 winning solution: Building state- of-the-art mathematical reasoning models with openmathreasoning dataset

Ivan Moshkov, Darragh Hanley, Ivan Sorokin, Shubham Toshniwal, Christof Henkel, Benedikt Schifferer, Wei Du, and Igor Gitman. Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset. arXiv preprint arXiv:2504.16891, 2025

work page arXiv 2025
[48]

Asynchronous rlhf: Faster and more efficient off-policy rl for language models

Michael Noukhovitch, Shengyi Huang, Sophie Xhonneux, Arian Hosseini, Rishabh Agarwal, and Aaron Courville. Asynchronous rlhf: Faster and more efficient off-policy rl for language models. In International Conference on Learning Representations, volume 2025, pages 4003--4029, 2025

2025
[49]

Learning to reason with LLM s

OpenAI . Learning to reason with LLM s. https://openai.com/index/learning-to-reason-with-llms/, September 2024

2024
[50]

Clip-low increases entropy and clip-high decreases entropy in reinforcement learning of large language models

Jaesung R Park, Junsu Kim, Gyeongman Kim, Jinyoung Jo, Sean Choi, Jaewoong Cho, and Ernest K Ryu. Clip-low increases entropy and clip-high decreases entropy in reinforcement learning of large language models. arXiv preprint arXiv:2509.26114, 2025

work page arXiv 2025
[51]

Beyond the Sampled Token: Preserving Candidate Support in RLVR

Ruotian Peng, Yi Ren, Zhouliang Yu, Weiyang Liu, and Yandong Wen. Simko: Simple pass@ k policy optimization. arXiv preprint arXiv:2510.14807, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

F-GRPO: Don't Let Your Policy Learn the Obvious and Forget the Rare

Daniil Plyusov, Alexey Gorbatovski, Boris Shaposhnikov, Viacheslav Sinii, Alexey Malakhov, and Daniil Gavrilov. F-grpo: Don't let your policy learn the obvious and forget the rare. arXiv preprint arXiv:2602.06717, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[53]

Pope: Learning to reason on hard problems via privileged on-policy exploration, 2026

Yuxiao Qu, Amrith Setlur, Virginia Smith, Ruslan Salakhutdinov, and Aviral Kumar. Pope: Learning to reason on hard problems via privileged on-policy exploration. arXiv preprint arXiv:2601.18779, 2026

work page arXiv 2026
[54]

How do large language monkeys get their power (laws)? In International Conference on Machine Learning, pages 53132--53176

Rylan Schaeffer, Joshua Kazdan, John Hughes, Jordan Juravsky, Sara Price, Aengus Lynch, Erik Jones, Robert Kirk, Azalia Mirhoseini, and Sanmi Koyejo. How do large language monkeys get their power (laws)? In International Conference on Machine Learning, pages 53132--53176. PMLR, 2025

2025
[55]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[56]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[57]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[58]

Actor-critic policy optimization in partially observable multiagent environments

Sriram Srinivasan, Marc Lanctot, Vinicius Zambaldi, Julien P \'e rolat, Karl Tuyls, R \'e mi Munos, and Michael Bowling. Actor-critic policy optimization in partially observable multiagent environments. Advances in Neural Information Processing Systems, 31, 2018

2018
[59]

On the optimization dynamics of RLVR: Gradient gap and step size thresholds

Joe Suk and Yaqi Duan. On the optimization dynamics of rlvr: Gradient gap and step size thresholds. arXiv preprint arXiv:2510.08539, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[60]

Learning to predict by the methods of temporal differences

Richard S Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3 0 (1): 0 9--44, 1988

1988
[61]

Sutton, David McAllester, Satinder Singh, and Yishay Mansour

Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the 13th International Conference on Neural Information Processing Systems (NIPS), pages 1057--1063, Cambridge, MA, USA, 1999. MIT Press

1999
[62]

Maximum likelihood reinforcement learning

Fahim Tajwar, Guanning Zeng, Yueer Zhou, Yuda Song, Daman Arora, Yiding Jiang, Jeff Schneider, Ruslan Salakhutdinov, Haiwen Feng, and Andrea Zanette. Maximum likelihood reinforcement learning. arXiv preprint arXiv:2602.02710, 2026

work page arXiv 2026
[63]

Scaling Behaviors of LLM Reinforcement Learning Post-Training: An Empirical Study in Mathematical Reasoning

Zelin Tan, Hejia Geng, Xiaohang Yu, Mulei Zhang, Guancheng Wan, Yifan Zhou, Qiang He, Xiangyuan Xue, Heng Zhou, Yutao Fan, et al. Scaling behaviors of llm reinforcement learning post-training: An empirical study in mathematical reasoning. arXiv preprint arXiv:2509.25300, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[64]

arXiv preprint arXiv:2503.19595 , year=

Yunhao Tang, Kunhao Zheng, Gabriel Synnaeve, and R \'e mi Munos. Optimizing language models for inference time objectives using reinforcement learning. arXiv preprint arXiv:2503.19595, 2025

work page arXiv 2025
[65]

Mai-thinking-1: Building a hill-climbing machine

The Microsoft AI Team . Mai-thinking-1: Building a hill-climbing machine. Technical report, Microsoft AI, 2026. https://microsoft.ai/pdf/mai-thinking-1.pdf

2026
[66]

Advantage shaping as surrogate reward maximization: Unifying pass@ k policy gradients

Christos Thrampoulidis, Sadegh Mahdavi, and Wenlong Deng. Advantage shaping as surrogate reward maximization: Unifying pass@ k policy gradients. arXiv preprint arXiv:2510.23049, 2025

work page arXiv 2025
[67]

Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems

Christian Walder and Deep Karkhanis. Pass@ k policy optimization: Solving harder reinforcement learning problems. arXiv preprint arXiv:2505.15201, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[68]

Self-consistency improves chain of thought reasoning in language models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, 2023

2023
[69]

Dueling network architectures for deep reinforcement learning

Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Hasselt, Marc Lanctot, and Nando Freitas. Dueling network architectures for deep reinforcement learning. In International conference on machine learning, pages 1995--2003. PMLR, 2016

1995
[70]

Williams

Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8 0 (3--4): 0 229--256, 1992. doi:10.1007/BF00992696

work page doi:10.1007/bf00992696 1992
[71]

Efficient reinforcement learning with large language model priors

Xue Yan, Yan Song, Xidong Feng, Mengyue Yang, Haifeng Zhang, Haitham Bou Ammar, and Jun Wang. Efficient reinforcement learning with large language model priors. In International Conference on Learning Representations, volume 2025, pages 48691--48715, 2025

2025
[72]

Your group-relative advantage is biased

Fengkai Yang, Zherui Chen, Xiaohan Wang, Xiaodong Lu, Jiajun Chai, Guojun Yin, Wei Lin, Shuai Ma, Fuzhen Zhuang, Deqing Wang, et al. Your group-relative advantage is biased. arXiv preprint arXiv:2601.08521, 2026

work page arXiv 2026
[73]

On the Implicit Reward Overfitting and the Low-rank Dynamics in RLVR

Hao Ye, Jisheng Dang, Junfeng Fang, Bimei Wang, Yizhou Zhang, Ning Lv, Wencan Zhang, Hong Peng, Bin Hu, and Tat-Seng Chua. On the implicit reward overfitting and the low-rank dynamics in rlvr. arXiv preprint arXiv:2605.06523, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[74]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[75]

From Reasoning to Agentic: Credit Assignment in Reinforcement Learning for Large Language Models

Chenchen Zhang. From reasoning to agentic: Credit assignment in reinforcement learning for large language models. arXiv preprint arXiv:2604.09459, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[76]

AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning

Haotian Zhao, Songlin Zhou, Yuxin Zhang, Stephen S-T Yau, Wenyu Zhang, Lun Tian, Tianshu Zhu, Yifeng Huang, Yucheng Zeng, Jingnan Gu, et al. Aem: Adaptive entropy modulation for multi-turn agentic reinforcement learning. arXiv preprint arXiv:2605.00425, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[77]

arXiv preprint arXiv:2506.01347 , year=

Xinyu Zhu, Mengzhou Xia, Zhepei Wei, Wei-Lin Chen, Danqi Chen, and Yu Meng. The surprising effectiveness of negative reinforcement in llm reasoning. arXiv preprint arXiv:2506.01347, 2025

work page arXiv 2025

[1] [1]

Opencodereasoning-ii: A simple test time scaling approach via self-critique

Wasi Uddin Ahmad, Somshubra Majumdar, Aleksander Ficek, Sean Narenthiran, Mehrzad Samadi, Jocelyn Huang, Siddhartha Jain, Vahid Noroozi, and Boris Ginsburg. Opencodereasoning-ii: A simple test time scaling approach via self-critique. arXiv preprint arXiv:2507.09075, 2025

work page arXiv 2025

[2] [2]

Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms

Arash Ahmadian, Chris Cremer, Matthias Gall \'e , Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet \"U st \"u n, and Sara Hooker. Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages ...

2024

[3] [3]

Variational best-of-n alignment

Afra Amini, Tim Vieira, Elliott Ash, and Ryan Cotterell. Variational best-of-n alignment. arXiv preprint arXiv:2407.06057, 2024

work page arXiv 2024

[4] [4]

What matters in on-policy reinforcement learning? a large-scale empirical study

Marcin Andrychowicz, Anton Raichuk, Piotr Sta \'n czyk, Manu Orsini, Sertan Girgin, Rapha \"e l Marinier, L \'e onard Hussenot, Matthieu Geist, Olivier Pietquin, Marcin Michalski, et al. What matters in on-policy reinforcement learning? a large-scale empirical study. In ICLR 2021-Ninth International Conference on Learning Representations, 2021

2021

[5] [5]

Asymmetric reinforce for off-policy reinforcement learning: Balancing positive and negative rewards

Charles Arnal, Ga \"e tan Narozniak, Vivien Cabannes, Yunhao Tang, Julia Kempe, and Remi Munos. Asymmetric reinforce for off-policy reinforcement learning: Balancing positive and negative rewards. Advances in Neural Information Processing Systems, 38: 0 9640--9664, 2026

2026

[6] [6]

Advantage updating

Leemon C Baird. Advantage updating. Technical report, Wright Laboratory, 1993

1993

[7] [7]

Why pass@ k optimization can degrade pass@ 1: Prompt interference in llm post-training

Anas Barakat, Souradip Chakraborty, Khushbu Pahwa, and Amrit Singh Bedi. Why pass@ k optimization can degrade pass@ 1: Prompt interference in llm post-training. arXiv preprint arXiv:2602.21189, 2026

work page arXiv 2026

[8] [8]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher R \'e , and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling. arXiv preprint arXiv:2407.21787, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

On predictability of reinforcement learning dynamics for large language models

Yuchen Cai, Ding Cao, Xin Xu, Zijun Yao, Yuqing Huang, Zhenyu Tan, Benyi Zhang, Guangzhong Sun, Guiquan Liu, and Junfeng Fang. On predictability of reinforcement learning dynamics for large language models. arXiv preprint arXiv:2510.00553, 2025

work page arXiv 2025

[10] [10]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[11] [11]

Pass@k training for adaptively balancing exploration and exploitation of large reasoning models

Zhipeng Chen, Xiaobo Qin, Youbin Wu, Yue Ling, Qinghao Ye, Wayne Xin Zhao, and Guang Shi. Pass@ k training for adaptively balancing exploration and exploitation of large reasoning models. arXiv preprint arXiv:2508.10751, 2025

work page arXiv 2025

[12] [12]

The Cancellation Hypothesis in Critic-Free RL: From Outcome Rewards to Token Credits

Tianhao Cheng, Zeyu Huang, Zihan Qiu, Yu Cheng, Edoardo Ponti, Yinghui Xu, Ivan Titov, and Zenglin Xu. The cancellation hypothesis in critic-free rl: From outcome rewards to token credits. arXiv preprint arXiv:2605.08666, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[13] [13]

Inference-aware fine-tuning for best-of-n sampling in large language models, 2025

Yinlam Chow, Guy Tennenholtz, Izzeddin Gur, Vincent Zhuang, Bo Dai, Sridhar Thiagarajan, Craig Boutilier, Rishabh Agarwal, Aviral Kumar, and Aleksandra Faust. Inference-aware fine-tuning for best-of-n sampling in large language models. arXiv preprint arXiv:2412.15287, 2024

work page arXiv 2024

[14] [14]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al. The entropy mechanism of reinforcement learning for reasoning language models. arXiv preprint arXiv:2505.22617, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

Distributional reinforcement learning with quantile regression

Will Dabney, Mark Rowland, Marc Bellemare, and R \'e mi Munos. Distributional reinforcement learning with quantile regression. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018

2018

[16] [16]

What is the objective of reasoning with reinforcement learning? arXiv preprint arXiv:2510.13651, 2025

Damek Davis and Benjamin Recht. What is the objective of reasoning with reinforcement learning? arXiv preprint arXiv:2510.13651, 2025

work page arXiv 2025

[17] [17]

FAIR CodeGen team , :, Jade Copet, Quentin Carbonneaux, Gal Cohen, Jonas Gehring, Jacob Kahn, Jannik Kossen, Felix Kreuk, Emily McMilin, Michel Meyer, Yuxiang Wei, David Zhang, Kunhao Zheng, Jordi Armengol-Estapé, Pedram Bashiri, Maximilian Beck, Pierre Chambon, Abhishek Charnalia, Chris Cummins, Juliette Decugis, Zacharias V. Fisches, François Fleuret, F...

work page arXiv 2025

[18] [18]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z

Yulu Gan and Phillip Isola. Neural thickets: Diverse task experts are dense around pretrained weights. arXiv preprint arXiv:2603.12228, 2026

work page arXiv 2026

[19] [19]

The peril of preference: Why grpo fails on ordinal rewards

Anisha Garg and Ganesh Venkatesh. The peril of preference: Why grpo fails on ordinal rewards. arXiv preprint arXiv:2511.04439, 2025

work page arXiv 2025

[20] [20]

Rlef: Grounding code llms in execution feedback with reinforcement learning

Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Taco Cohen, and Gabriel Synnaeve. Rlef: Grounding code llms in execution feedback with reinforcement learning. In International Conference on Machine Learning, pages 19034--19055. PMLR, 2025

2025

[21] [21]

Variance reduction techniques for gradient estimates in reinforcement learning

Evan Greensmith, Peter L Bartlett, and Jonathan Baxter. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5 0 (Nov): 0 1471--1530, 2004

2004

[22] [22]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature, 645 0 (8081): 0 633--638, 2025

2025

[23] [23]

Rewarding the unlikely: Lifting grpo beyond distribution sharpening

Andre Wang He, Daniel Fried, and Sean Welleck. Rewarding the unlikely: Lifting grpo beyond distribution sharpening. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25559--25571, 2025

2025

[24] [24]

Scaling laws for single-agent reinforcement learning

Jacob Hilton, Jie Tang, and John Schulman. Scaling laws for single-agent reinforcement learning. arXiv preprint arXiv:2301.13442, 2023

work page arXiv 2023

[25] [25]

Self-improvement in language models: The sharpening mechanism

Audrey Huang, Adam Block, Dylan Foster, Dhruv Rohatgi, Cyril Zhang, Max Simchowitz, Jordan Ash, and Akshay Krishnamurthy. Self-improvement in language models: The sharpening mechanism. In International Conference on Learning Representations, volume 2025, pages 76687--76739, 2025

2025

[26] [26]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

Risk-sensitive rl for alleviating exploration dilemmas in large language models

Yuhua Jiang, Jiawei Huang, Yufeng Yuan, Xin Mao, Yu Yue, Qianchuan Zhao, and Lin Yan. Risk-sensitive rl for alleviating exploration dilemmas in large language models. arXiv preprint arXiv:2509.24261, 2025

work page arXiv 2025

[28] [28]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[29] [29]

Reasoning with Sampling: Your Base Model is Smarter Than You Think

Aayush Karan and Yilun Du. Reasoning with sampling: Your base model is smarter than you think. arXiv preprint arXiv:2510.14901, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

The Art of Scaling Reinforcement Learning Compute for LLMs

Devvrit Khatri, Lovish Madaan, Rishabh Tiwari, Rachit Bansal, Sai Surya Duvvuri, Manzil Zaheer, Inderjit S Dhillon, David Brandfonbrener, and Rishabh Agarwal. The art of scaling reinforcement learning compute for llms. arXiv preprint arXiv:2510.13786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

Mc-grpo: Median-centered group relative policy optimization for small-rollout reinforcement learning

Youngeun Kim. Mc-grpo: Median-centered group relative policy optimization for small-rollout reinforcement learning. arXiv preprint arXiv:2601.22582, 2026

work page arXiv 2026

[32] [32]

Kimi Team , Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

Buy 4 reinforce samples, get a baseline for free! arXiv preprint arXiv:1901.10280, 2019

Wouter Kool, Herke van Hoof, and Max Welling. Buy 4 reinforce samples, get a baseline for free! arXiv preprint arXiv:1901.10280, 2019

work page arXiv 1901

[34] [34]

Implicit under-parameterization inhibits data-efficient deep reinforcement learning

Aviral Kumar, Rishabh Agarwal, Dibya Ghosh, and Sergey Levine. Implicit under-parameterization inhibits data-efficient deep reinforcement learning. arXiv preprint arXiv:2010.14498, 2020

work page arXiv 2010

[35] [35]

Learning shrinks the hard tail: Training-dependent inference scaling in a solvable linear model

Noam Levi. Learning shrinks the hard tail: Training-dependent inference scaling in a solvable linear model. arXiv preprint, 2026

2026

[36] [36]

Taco: Topics in algorithmic code generation dataset

Rongao Li, Jie Fu, Bo-Wen Zhang, Tao Huang, Zhihong Sun, Chen Lyu, Guang Liu, Zhi Jin, and Ge Li. Taco: Topics in algorithmic code generation dataset. arXiv preprint arXiv:2312.14852, 2023

work page arXiv 2023

[37] [37]

Competition-level code generation with alphacode

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, R \'e mi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with alphacode. Science, 378 0 (6624): 0 1092--1097, 2022

2022

[38] [38]

Let's verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. In The Twelfth International Conference on Learning Representations, 2024

2024

[39] [39]

Focal loss for dense object detection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll \'a r. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980--2988, 2017

2017

[40] [40]

Boosting LLM Reasoning via Human-Inspired Reward Shaping

Wenze Lin, Zhen Yang, Xitai Jiang, Pony Ma, and Gao Huang. Thickening-to-thinning: Reward shaping via human-inspired learning dynamics for llm reasoning. arXiv preprint arXiv:2602.04265, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[41] [42]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective. arXiv preprint arXiv:2503.20783, 2025 b

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [43]

Simulation-based optimization of markov reward processes

Peter Marbach and John N Tsitsiklis. Simulation-based optimization of markov reward processes. IEEE Transactions on Automatic Control, 46 0 (2): 0 191--209, 2001

2001

[43] [44]

Steps toward artificial intelligence

Marvin Minsky. Steps toward artificial intelligence. Proceedings of the IRE, 49 0 (1): 0 8--30, 1961

1961

[44] [45]

Variational inference for monte carlo objectives

Andriy Mnih and Danilo Rezende. Variational inference for monte carlo objectives. In International Conference on Machine Learning, pages 2188--2196. PMLR, 2016

2016

[45] [46]

No representation, no trust: Connecting representation, collapse, and trust issues in ppo

Skander Moalla, Andrea Miele, Daniil Pyatko, Razvan Pascanu, and Caglar Gulcehre. No representation, no trust: Connecting representation, collapse, and trust issues in ppo. Advances in Neural Information Processing Systems, 37: 0 69652--69699, 2024

2024

[46] [47]

Aimo-2 winning solution: Building state- of-the-art mathematical reasoning models with openmathreasoning dataset

Ivan Moshkov, Darragh Hanley, Ivan Sorokin, Shubham Toshniwal, Christof Henkel, Benedikt Schifferer, Wei Du, and Igor Gitman. Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset. arXiv preprint arXiv:2504.16891, 2025

work page arXiv 2025

[47] [48]

Asynchronous rlhf: Faster and more efficient off-policy rl for language models

Michael Noukhovitch, Shengyi Huang, Sophie Xhonneux, Arian Hosseini, Rishabh Agarwal, and Aaron Courville. Asynchronous rlhf: Faster and more efficient off-policy rl for language models. In International Conference on Learning Representations, volume 2025, pages 4003--4029, 2025

2025

[48] [49]

Learning to reason with LLM s

OpenAI . Learning to reason with LLM s. https://openai.com/index/learning-to-reason-with-llms/, September 2024

2024

[49] [50]

Clip-low increases entropy and clip-high decreases entropy in reinforcement learning of large language models

Jaesung R Park, Junsu Kim, Gyeongman Kim, Jinyoung Jo, Sean Choi, Jaewoong Cho, and Ernest K Ryu. Clip-low increases entropy and clip-high decreases entropy in reinforcement learning of large language models. arXiv preprint arXiv:2509.26114, 2025

work page arXiv 2025

[50] [51]

Beyond the Sampled Token: Preserving Candidate Support in RLVR

Ruotian Peng, Yi Ren, Zhouliang Yu, Weiyang Liu, and Yandong Wen. Simko: Simple pass@ k policy optimization. arXiv preprint arXiv:2510.14807, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[51] [52]

F-GRPO: Don't Let Your Policy Learn the Obvious and Forget the Rare

Daniil Plyusov, Alexey Gorbatovski, Boris Shaposhnikov, Viacheslav Sinii, Alexey Malakhov, and Daniil Gavrilov. F-grpo: Don't let your policy learn the obvious and forget the rare. arXiv preprint arXiv:2602.06717, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[52] [53]

Pope: Learning to reason on hard problems via privileged on-policy exploration, 2026

Yuxiao Qu, Amrith Setlur, Virginia Smith, Ruslan Salakhutdinov, and Aviral Kumar. Pope: Learning to reason on hard problems via privileged on-policy exploration. arXiv preprint arXiv:2601.18779, 2026

work page arXiv 2026

[53] [54]

How do large language monkeys get their power (laws)? In International Conference on Machine Learning, pages 53132--53176

Rylan Schaeffer, Joshua Kazdan, John Hughes, Jordan Juravsky, Sara Price, Aengus Lynch, Erik Jones, Robert Kirk, Azalia Mirhoseini, and Sanmi Koyejo. How do large language monkeys get their power (laws)? In International Conference on Machine Learning, pages 53132--53176. PMLR, 2025

2025

[54] [55]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[55] [56]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[56] [57]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[57] [58]

Actor-critic policy optimization in partially observable multiagent environments

Sriram Srinivasan, Marc Lanctot, Vinicius Zambaldi, Julien P \'e rolat, Karl Tuyls, R \'e mi Munos, and Michael Bowling. Actor-critic policy optimization in partially observable multiagent environments. Advances in Neural Information Processing Systems, 31, 2018

2018

[58] [59]

On the optimization dynamics of RLVR: Gradient gap and step size thresholds

Joe Suk and Yaqi Duan. On the optimization dynamics of rlvr: Gradient gap and step size thresholds. arXiv preprint arXiv:2510.08539, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[59] [60]

Learning to predict by the methods of temporal differences

Richard S Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3 0 (1): 0 9--44, 1988

1988

[60] [61]

Sutton, David McAllester, Satinder Singh, and Yishay Mansour

Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the 13th International Conference on Neural Information Processing Systems (NIPS), pages 1057--1063, Cambridge, MA, USA, 1999. MIT Press

1999

[61] [62]

Maximum likelihood reinforcement learning

Fahim Tajwar, Guanning Zeng, Yueer Zhou, Yuda Song, Daman Arora, Yiding Jiang, Jeff Schneider, Ruslan Salakhutdinov, Haiwen Feng, and Andrea Zanette. Maximum likelihood reinforcement learning. arXiv preprint arXiv:2602.02710, 2026

work page arXiv 2026

[62] [63]

Scaling Behaviors of LLM Reinforcement Learning Post-Training: An Empirical Study in Mathematical Reasoning

Zelin Tan, Hejia Geng, Xiaohang Yu, Mulei Zhang, Guancheng Wan, Yifan Zhou, Qiang He, Xiangyuan Xue, Heng Zhou, Yutao Fan, et al. Scaling behaviors of llm reinforcement learning post-training: An empirical study in mathematical reasoning. arXiv preprint arXiv:2509.25300, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[63] [64]

arXiv preprint arXiv:2503.19595 , year=

Yunhao Tang, Kunhao Zheng, Gabriel Synnaeve, and R \'e mi Munos. Optimizing language models for inference time objectives using reinforcement learning. arXiv preprint arXiv:2503.19595, 2025

work page arXiv 2025

[64] [65]

Mai-thinking-1: Building a hill-climbing machine

The Microsoft AI Team . Mai-thinking-1: Building a hill-climbing machine. Technical report, Microsoft AI, 2026. https://microsoft.ai/pdf/mai-thinking-1.pdf

2026

[65] [66]

Advantage shaping as surrogate reward maximization: Unifying pass@ k policy gradients

Christos Thrampoulidis, Sadegh Mahdavi, and Wenlong Deng. Advantage shaping as surrogate reward maximization: Unifying pass@ k policy gradients. arXiv preprint arXiv:2510.23049, 2025

work page arXiv 2025

[66] [67]

Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems

Christian Walder and Deep Karkhanis. Pass@ k policy optimization: Solving harder reinforcement learning problems. arXiv preprint arXiv:2505.15201, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[67] [68]

Self-consistency improves chain of thought reasoning in language models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, 2023

2023

[68] [69]

Dueling network architectures for deep reinforcement learning

Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Hasselt, Marc Lanctot, and Nando Freitas. Dueling network architectures for deep reinforcement learning. In International conference on machine learning, pages 1995--2003. PMLR, 2016

1995

[69] [70]

Williams

Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8 0 (3--4): 0 229--256, 1992. doi:10.1007/BF00992696

work page doi:10.1007/bf00992696 1992

[70] [71]

Efficient reinforcement learning with large language model priors

Xue Yan, Yan Song, Xidong Feng, Mengyue Yang, Haifeng Zhang, Haitham Bou Ammar, and Jun Wang. Efficient reinforcement learning with large language model priors. In International Conference on Learning Representations, volume 2025, pages 48691--48715, 2025

2025

[71] [72]

Your group-relative advantage is biased

Fengkai Yang, Zherui Chen, Xiaohan Wang, Xiaodong Lu, Jiajun Chai, Guojun Yin, Wei Lin, Shuai Ma, Fuzhen Zhuang, Deqing Wang, et al. Your group-relative advantage is biased. arXiv preprint arXiv:2601.08521, 2026

work page arXiv 2026

[72] [73]

On the Implicit Reward Overfitting and the Low-rank Dynamics in RLVR

Hao Ye, Jisheng Dang, Junfeng Fang, Bimei Wang, Yizhou Zhang, Ning Lv, Wencan Zhang, Hong Peng, Bin Hu, and Tat-Seng Chua. On the implicit reward overfitting and the low-rank dynamics in rlvr. arXiv preprint arXiv:2605.06523, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[73] [74]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[74] [75]

From Reasoning to Agentic: Credit Assignment in Reinforcement Learning for Large Language Models

Chenchen Zhang. From reasoning to agentic: Credit assignment in reinforcement learning for large language models. arXiv preprint arXiv:2604.09459, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[75] [76]

AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning

Haotian Zhao, Songlin Zhou, Yuxin Zhang, Stephen S-T Yau, Wenyu Zhang, Lun Tian, Tianshu Zhu, Yifeng Huang, Yucheng Zeng, Jingnan Gu, et al. Aem: Adaptive entropy modulation for multi-turn agentic reinforcement learning. arXiv preprint arXiv:2605.00425, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[76] [77]

arXiv preprint arXiv:2506.01347 , year=

Xinyu Zhu, Mengzhou Xia, Zhepei Wei, Wei-Lin Chen, Danqi Chen, and Yu Meng. The surprising effectiveness of negative reinforcement in llm reasoning. arXiv preprint arXiv:2506.01347, 2025

work page arXiv 2025