KLip-PPO: A per-sample KL perspective on PPO-Clip
Pith reviewed 2026-06-26 08:29 UTC · model grok-4.3
The pith
PPO-Clip's gradient is exactly reproduced by a per-sample KL surrogate whose coefficient depends on the importance ratio and advantage.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The gradient of the clipped surrogate is reproduced exactly by a Kullback-Leibler surrogate whose coefficient varies per sample, with closed-form dependence on the importance ratio and the advantage. The identity holds at every minibatch step and across the entire inner loop, and on five MuJoCo continuous-control benchmarks the two losses produce indistinguishable training curves. The reformulation exposes a structural feature of the clipped surrogate that the min notation hides.
What carries the argument
The per-sample KL coefficient, a closed-form function of the importance ratio and advantage that makes the KL surrogate gradient identical to the clipped surrogate gradient at every update.
If this is right
- The min notation in the clipped surrogate hides an implicit per-sample penalty that is a step function at the trust region boundary.
- The shape of this per-sample coefficient is the natural design axis for generalising the PPO algorithm.
- The two loss formulations produce identical policy updates and training dynamics on standard benchmarks.
Where Pith is reading between the lines
- This equivalence suggests designing new PPO variants by selecting different functional forms for the per-sample coefficient.
- It may account for the similar performance often seen between clipped and KL versions of PPO in practice.
- The identity could be checked on discrete-action tasks or non-MuJoCo environments to test its scope.
Load-bearing premise
The derivation assumes the standard PPO advantage estimator and the usual definition of the importance ratio.
What would settle it
If the gradient of the per-sample KL loss with the derived coefficient fails to match the gradient of the clipped loss on any minibatch, the claimed identity is false.
Figures
read the original abstract
Proximal Policy Optimization (PPO) is the standard policy-gradient algorithm for on-policy reinforcement learning. The literature presents it in two forms, a clipped surrogate that bounds the importance ratio between successive policies and a Kullback-Leibler penalty between them. These forms are treated as separate algorithms with their own gradients, their own hyperparameters, and their own reference implementations, and a sizeable body of empirical work compares them. We show that the gradient of the clipped surrogate is reproduced exactly by a Kullback-Leibler surrogate whose coefficient varies per sample, with closed-form dependence on the importance ratio and the advantage. The identity holds at every minibatch step and across the entire inner loop, and on five MuJoCo continuous-control benchmarks the two losses produce indistinguishable training curves. The reformulation exposes a structural feature of the clipped surrogate that the min notation hides. PPO-Clip's implicit per-sample penalty is a step function at the boundary of the trust region, and the shape of this coefficient is the natural design axis for generalising the algorithm. We sketch the resulting follow-up directions in the discussion.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that the gradient of the PPO-Clip surrogate is exactly reproduced by a per-sample KL-penalized surrogate whose coefficient beta is a closed-form function of the importance ratio r and advantage A; this identity is asserted to hold at every minibatch step, and the resulting losses produce indistinguishable training curves on five MuJoCo continuous-control benchmarks. The reformulation is presented as exposing the clipped surrogate's implicit step-function penalty at the trust-region boundary.
Significance. If the claimed algebraic identity holds under the definitions and estimators actually used in the MuJoCo experiments, the work supplies a useful structural reinterpretation of PPO-Clip that could guide the design of generalized trust-region penalties. The empirical indistinguishability on standard benchmarks is consistent with the claim, though the absence of released code or step-by-step derivation limits immediate verification and reuse.
major comments (2)
- [Abstract, §3] Abstract and §3: The central claim of an 'exact identity' between the clipped-surrogate gradient and the per-sample KL gradient is asserted without any displayed algebraic steps or intermediate expressions for beta(r, A). Because this identity is the load-bearing result, the derivation (including the precise form of the KL term) must be supplied before the claim can be assessed.
- [§4] §4 (MuJoCo experiments): The reported indistinguishable curves are obtained with Gaussian policies, for which the conventional KL surrogate is the analytic divergence between state-conditional Gaussians. This analytic KL gradient does not factor as a scalar multiple of the action-dependent score function grad log pi(a|s), so no scalar beta can equate the vectors in general; the identity therefore cannot hold for the standard analytic KL unless the paper instead employs the Monte Carlo estimator r log r throughout the comparison.
minor comments (1)
- [§3] Notation for the per-sample coefficient should be introduced with an explicit equation number rather than inline prose.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We respond to each major comment below. Where the comments indicate that additional material is needed for clarity, we have revised the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract, §3] Abstract and §3: The central claim of an 'exact identity' between the clipped-surrogate gradient and the per-sample KL gradient is asserted without any displayed algebraic steps or intermediate expressions for beta(r, A). Because this identity is the load-bearing result, the derivation (including the precise form of the KL term) must be supplied before the claim can be assessed.
Authors: We agree that the algebraic derivation of the identity and the closed-form expression for beta(r, A) should be presented explicitly in the main text. In the revised manuscript, we have added a dedicated subsection in §3 that walks through the steps: starting from the gradient of the PPO-Clip surrogate, showing how it can be rewritten as a per-sample KL term with beta derived as a function of r and A. The precise form of the KL term used is the Monte Carlo estimator consistent with the importance sampling gradient. revision: yes
-
Referee: [§4] §4 (MuJoCo experiments): The reported indistinguishable curves are obtained with Gaussian policies, for which the conventional KL surrogate is the analytic divergence between state-conditional Gaussians. This analytic KL gradient does not factor as a scalar multiple of the action-dependent score function grad log pi(a|s), so no scalar beta can equate the vectors in general; the identity therefore cannot hold for the standard analytic KL unless the paper instead employs the Monte Carlo estimator r log r throughout the comparison.
Authors: We thank the referee for highlighting this important distinction. The per-sample KL surrogate in our work employs the Monte Carlo estimator of the form involving r log r (specifically, the term that aligns with the policy gradient structure), rather than the analytic KL divergence. This choice ensures the gradients can be equated via the closed-form beta. We have clarified this in the revised §4 and added a note in the methods section explaining why the sample-based estimator is used for the identity to hold. The experiments remain valid as both the PPO-Clip and the KLip-PPO use consistent estimators. revision: yes
Circularity Check
Algebraic identity derived directly from surrogate definitions; no circularity.
full rationale
The central claim is an explicit algebraic equivalence: the gradient of the PPO-Clip surrogate equals the gradient of a KL surrogate scaled by a per-sample coefficient eta(r, A) whose closed form is obtained by matching the two gradient expressions. This matching is performed by direct differentiation of the given loss definitions (importance ratio r, advantage A, clip operator) without any fitted parameters, data-dependent tuning, or load-bearing self-citations. The MuJoCo curves are presented only as empirical corroboration after the identity is stated. No step reduces to a renaming, a fitted input relabeled as prediction, or an ansatz imported via prior work by the same authors. The derivation is therefore self-contained against the paper's own equations.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard definitions of PPO clipped surrogate objective and KL penalty objective as used in reference implementations.
Reference graph
Works this paper leans on
-
[1]
What matters in on-policy reinforcement learning? a large-scale empirical study
Marcin Andrychowicz, Anton Raichuk, Piotr Sta´nczyk, Manu Orsini, Sertan Girgin, Raphael Marinier, Léonard Hussenot, Matthieu Geist, Olivier Pietquin, Marcin Michalski, Sylvain Gelly, and Olivier Bachem. What matters in on-policy reinforcement learning? a large-scale empirical study. InInternational Confer- ence on Learning Representations (ICLR), 2021
2021
-
[2]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Madeleine Dwyer, Adam Sobey, and Adriane Chapman. It’s not you, it’s clipping: A soft trust-region via probability smoothing for LLM RL.arXiv preprint arXiv:2509.21282, 2025
-
[4]
Implementation matters in deep RL: A case study on PPO and TRPO
Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos, Larry Rudolph, and Aleksander Madry. Implementation matters in deep RL: A case study on PPO and TRPO. InInternational Conference on Learning Representations (ICLR), 2020
2020
-
[5]
Nicolas Heess, Dhruva TB, Srinivasan Sriram, Jay Lemmon, Josh Merel, Greg Wayne, Yuval Tassa, Tom Erez, Ziyu Wang, S. M. Ali Eslami, Martin Riedmiller, and David Silver. Emergence of locomotion behaviours in rich environments.arXiv preprint arXiv:1707.02286, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[6]
Revisiting design choices in proximal policy optimization.arXiv preprint arXiv:2009.10897, 2020
Chloe Ching-Yun Hsu, Celestine Mendler-Dünner, and Moritz Hardt. Revisiting design choices in proximal policy optimization.arXiv preprint arXiv:2009.10897, 2020. 10
-
[7]
Shengyi Huang, Rousslan Fernand Julien Dossa, Chang Ye, Jeff Braga, Dipam Chakraborty, Kinal Mehta, and João G. M. Araújo. CleanRL: High-quality single-file implementations of deep reinforcement learning algorithms.Journal of Machine Learning Research, 23(274):1–18, 2022
2022
-
[8]
A closer look at deep policy gradients
Andrew Ilyas, Logan Engstrom, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos, Larry Rudolph, and Aleksander Madry. A closer look at deep policy gradients. InInternational Conference on Learning Representations (ICLR), 2020
2020
-
[9]
Approximately optimal approximate reinforcement learning
Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In Proceedings of the 19th International Conference on Machine Learning (ICML), pages 267–274, 2002
2002
-
[10]
Kezhao Liu, Jason Klein Liu, Mingtao Chen, and Yiming Liu. Rethinking KL regularization in RLHF: From value estimation to gradient optimization.arXiv preprint arXiv:2510.01555, 2025
-
[11]
Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022
2022
-
[12]
Stable-baselines3: Reliable reinforcement learning implementations.Journal of Machine Learning Research, 22(268):1–8, 2021
Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, and Noah Dormann. Stable-baselines3: Reliable reinforcement learning implementations.Journal of Machine Learning Research, 22(268):1–8, 2021
2021
-
[13]
Jordan, and Philipp Moritz
John Schulman, Sergey Levine, Pieter Abbeel, Michael I. Jordan, and Philipp Moritz. Trust region policy optimization. InProceedings of the 32nd International Conference on Machine Learning (ICML), pages 1889–1897, 2015
2015
-
[14]
Jordan, and Pieter Abbeel
John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. InInternational Conference on Learning Representations (ICLR), 2016
2016
-
[15]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[16]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
You may not need ratio clipping in PPO.arXiv preprint arXiv:2202.00079, 2022
Mingfei Sun, Vitaly Kurin, Guoqing Liu, Sam Devlin, Tao Qin, Katja Hofmann, and Shimon Whiteson. You may not need ratio clipping in PPO.arXiv preprint arXiv:2202.00079, 2022
-
[18]
Sutton and Andrew G
Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. MIT Press, 2nd edition, 2018
2018
-
[19]
Sutton, David McAllester, Satinder Singh, and Yishay Mansour
Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. InAdvances in Neural Information Processing Systems (NIPS), 2000
2000
-
[20]
MuJoCo: A physics engine for model-based control
Emanuel Todorov, Tom Erez, and Yuval Tassa. MuJoCo: A physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5026–5033, 2012
2012
-
[21]
Trust region-guided proximal policy optimization
Yuhui Wang, Hao He, Xiaoyang Tan, and Yaozhong Gan. Trust region-guided proximal policy optimization. InAdvances in Neural Information Processing Systems (NeurIPS), 2019
2019
-
[22]
Truly proximal policy optimization
Yuhui Wang, Hao He, and Xiaoyang Tan. Truly proximal policy optimization. InConference on Uncertainty in Artificial Intelligence (UAI), 2020
2020
-
[23]
Simple policy optimization
Zhengpeng Xie, Qiang Zhang, Fan Yang, Marco Hutter, and Renjing Xu. Simple policy optimization. In Proceedings of the 42nd International Conference on Machine Learning (ICML), pages 68813–68824, 2025
2025
-
[24]
On the design of KL-regularized policy gradient algorithms for LLM reasoning
Yifan Zhang, Yifeng Liu, Huizhuo Yuan, Yang Yuan, Quanquan Gu, and Andrew Chi-Chih Yao. On the design of KL-regularized policy gradient algorithms for LLM reasoning. InInternational Conference on Learning Representations (ICLR), 2026. 11 A Weight-space dual: PPO-Clip as aΦpenalty The per-sample βt identity of the main text places PPO-Clip in distribution ...
2026
-
[25]
[15]: LCLIP =E t min wt ˆAt,clip(w t) ˆAt
theminformulation of Schulman et al. [15]: LCLIP =E t min wt ˆAt,clip(w t) ˆAt
-
[26]
the weight-space form of Theorem 2: LCLIP =E t wt ˆAt −Φ(w t, ˆAt) , with penalty acting on|w t −1|in importance-weight space
-
[27]
The first two forms are equal as functions and differ only in surface notation; the third matches them in per-sample gradient and places PPO-Clip inside the PPO-KL family
the per-sample KL form of the main theorem, which matches PPO-Clip in gradient: ∇θ′LCLIP =∇ θ′ Et wt ˆAt +β t logπ θ′(at |s t) , β t =−w t ˆAt ·1[(i, t)∈ I kill], with penalty acting in distribution space. The first two forms are equal as functions and differ only in surface notation; the third matches them in per-sample gradient and places PPO-Clip insid...
2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.