pith. sign in

arxiv: 2606.23932 · v1 · pith:SVM75E4Gnew · submitted 2026-06-22 · 💻 cs.LG

KLip-PPO: A per-sample KL perspective on PPO-Clip

Pith reviewed 2026-06-26 08:29 UTC · model grok-4.3

classification 💻 cs.LG
keywords proximal policy optimizationppo-clipkl penaltyimportance ratiopolicy gradientsurrogate objectivereinforcement learning
0
0 comments X

The pith

PPO-Clip's gradient is exactly reproduced by a per-sample KL surrogate whose coefficient depends on the importance ratio and advantage.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that the clipped surrogate objective in Proximal Policy Optimization can be rewritten as a Kullback-Leibler penalty with a sample-specific coefficient. This coefficient has a closed-form expression based on the importance sampling ratio and the advantage value. The equivalence holds exactly at each minibatch step and across the full inner optimization loop. Training curves on five continuous control benchmarks are indistinguishable between the two formulations. This perspective shows that the clipping operation implicitly applies a step-function penalty outside the trust region.

Core claim

The gradient of the clipped surrogate is reproduced exactly by a Kullback-Leibler surrogate whose coefficient varies per sample, with closed-form dependence on the importance ratio and the advantage. The identity holds at every minibatch step and across the entire inner loop, and on five MuJoCo continuous-control benchmarks the two losses produce indistinguishable training curves. The reformulation exposes a structural feature of the clipped surrogate that the min notation hides.

What carries the argument

The per-sample KL coefficient, a closed-form function of the importance ratio and advantage that makes the KL surrogate gradient identical to the clipped surrogate gradient at every update.

If this is right

  • The min notation in the clipped surrogate hides an implicit per-sample penalty that is a step function at the trust region boundary.
  • The shape of this per-sample coefficient is the natural design axis for generalising the PPO algorithm.
  • The two loss formulations produce identical policy updates and training dynamics on standard benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This equivalence suggests designing new PPO variants by selecting different functional forms for the per-sample coefficient.
  • It may account for the similar performance often seen between clipped and KL versions of PPO in practice.
  • The identity could be checked on discrete-action tasks or non-MuJoCo environments to test its scope.

Load-bearing premise

The derivation assumes the standard PPO advantage estimator and the usual definition of the importance ratio.

What would settle it

If the gradient of the per-sample KL loss with the derived coefficient fails to match the gradient of the clipped loss on any minibatch, the claimed identity is false.

Figures

Figures reproduced from arXiv: 2606.23932 by Riccardo Colletti, Robin Holzinger.

Figure 1
Figure 1. Figure 1: Summary of the per-sample equivalence. In the green rows the PPO-Clip gradient is [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Episode return on the five MuJoCo tasks (mean over [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The clipping function wc = clip(w, 1 − ϵ, 1 + ϵ). Inside the band [1 − ϵ, 1 + ϵ] the clipped weight equals the true importance ratio and the gradient flows normally; outside the band the clipped weight is constant and the gradient with respect to θ ′ vanishes. B.2 The PPO-Clip surrogate Positive advantage (A >ˆ 0) w objective 1−ϵ 1 1+ϵ wAˆ wcAˆ min (PPO) bounded Negative advantage (A <ˆ 0) w objective 1−ϵ … view at source ↗
Figure 4
Figure 4. Figure 4: The PPO-Clip surrogate, split by the sign of [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: PPO-Clip. The outer loop collects rollouts and fits the critic; the inner loop takes [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: PPO-KL. The objective is the unclipped importance-weighted advantage plus a KL penalty [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Per-sample effective gradient multiplier as a function of the importance ratio [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Morphing from scalar to per-sample β, shown for the case A >ˆ 0. (a) A constant scalar β produces the smooth curve Aˆ + β/w, which can never coincide with the PPO-Clip step. (b) Letting β adapt in time but stay scalar shifts the curve along its family but cannot reshape it. (c) Letting β become per-sample, with βt = −wtAˆ t exactly on Ikill and zero elsewhere, snaps the curve onto the PPO-Clip step functio… view at source ↗
Figure 9
Figure 9. Figure 9: The βt landscape in (w, Aˆ) space. Vertical dashed lines mark the trust region [1 − ϵ, 1 + ϵ]. Two corners (top-right w > 1 + ϵ, A >ˆ 0 and bottom-left w < 1 − ϵ, A <ˆ 0) carry the per-sample coefficient βt = −wAˆ, with shading indicating |βt|. Everywhere else βt = 0. The support and value of βt reproduce the PPO-Clip gradient sample by sample. B.8 The same per-sample gradient via two surrogates sample (st… view at source ↗
Figure 10
Figure 10. Figure 10: The per-sample gradient under the two surrogates. The same sample feeds both per-sample [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: PPO-Clip and per-sample PPO-KL on each task, mean over [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Final return (mean ± standard error over 5 seeds) on CartPole-v1, HalfCheetah-v4, and Hopper-v4 as a function of each trust-region knob: clip ϵ for PPO-Clip, fixed β for PPO-KL, and the KL target for adaptive PPO-KL. C.3 Clipping partition [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: shows, for PPO-Clip, the fraction of each minibatch that falls in Ikill and Ipass over training. This is the empirical view of the partition on which the identity rests: the penalty βt is non-zero only on Ikill, and its reach grows with task difficulty, exceeding half the batch on Humanoid-v4. 0.0 0.2 0.4 0.6 0.8 1.0 environment step 1e6 0.00 0.02 0.04 0.06 0.08 0.10 0.12 batch fraction kill pass (a) Ho… view at source ↗
Figure 14
Figure 14. Figure 14: Per-sample coefficient βt over training for the per-sample variant. The median is zero because βt vanishes on every sample outside the kill region; the shaded bands show how far the active coefficient −wtAˆ t reaches on the samples in Ikill, widening on the high-dimensional tasks. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗
read the original abstract

Proximal Policy Optimization (PPO) is the standard policy-gradient algorithm for on-policy reinforcement learning. The literature presents it in two forms, a clipped surrogate that bounds the importance ratio between successive policies and a Kullback-Leibler penalty between them. These forms are treated as separate algorithms with their own gradients, their own hyperparameters, and their own reference implementations, and a sizeable body of empirical work compares them. We show that the gradient of the clipped surrogate is reproduced exactly by a Kullback-Leibler surrogate whose coefficient varies per sample, with closed-form dependence on the importance ratio and the advantage. The identity holds at every minibatch step and across the entire inner loop, and on five MuJoCo continuous-control benchmarks the two losses produce indistinguishable training curves. The reformulation exposes a structural feature of the clipped surrogate that the min notation hides. PPO-Clip's implicit per-sample penalty is a step function at the boundary of the trust region, and the shape of this coefficient is the natural design axis for generalising the algorithm. We sketch the resulting follow-up directions in the discussion.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript claims that the gradient of the PPO-Clip surrogate is exactly reproduced by a per-sample KL-penalized surrogate whose coefficient beta is a closed-form function of the importance ratio r and advantage A; this identity is asserted to hold at every minibatch step, and the resulting losses produce indistinguishable training curves on five MuJoCo continuous-control benchmarks. The reformulation is presented as exposing the clipped surrogate's implicit step-function penalty at the trust-region boundary.

Significance. If the claimed algebraic identity holds under the definitions and estimators actually used in the MuJoCo experiments, the work supplies a useful structural reinterpretation of PPO-Clip that could guide the design of generalized trust-region penalties. The empirical indistinguishability on standard benchmarks is consistent with the claim, though the absence of released code or step-by-step derivation limits immediate verification and reuse.

major comments (2)
  1. [Abstract, §3] Abstract and §3: The central claim of an 'exact identity' between the clipped-surrogate gradient and the per-sample KL gradient is asserted without any displayed algebraic steps or intermediate expressions for beta(r, A). Because this identity is the load-bearing result, the derivation (including the precise form of the KL term) must be supplied before the claim can be assessed.
  2. [§4] §4 (MuJoCo experiments): The reported indistinguishable curves are obtained with Gaussian policies, for which the conventional KL surrogate is the analytic divergence between state-conditional Gaussians. This analytic KL gradient does not factor as a scalar multiple of the action-dependent score function grad log pi(a|s), so no scalar beta can equate the vectors in general; the identity therefore cannot hold for the standard analytic KL unless the paper instead employs the Monte Carlo estimator r log r throughout the comparison.
minor comments (1)
  1. [§3] Notation for the per-sample coefficient should be introduced with an explicit equation number rather than inline prose.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We respond to each major comment below. Where the comments indicate that additional material is needed for clarity, we have revised the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract, §3] Abstract and §3: The central claim of an 'exact identity' between the clipped-surrogate gradient and the per-sample KL gradient is asserted without any displayed algebraic steps or intermediate expressions for beta(r, A). Because this identity is the load-bearing result, the derivation (including the precise form of the KL term) must be supplied before the claim can be assessed.

    Authors: We agree that the algebraic derivation of the identity and the closed-form expression for beta(r, A) should be presented explicitly in the main text. In the revised manuscript, we have added a dedicated subsection in §3 that walks through the steps: starting from the gradient of the PPO-Clip surrogate, showing how it can be rewritten as a per-sample KL term with beta derived as a function of r and A. The precise form of the KL term used is the Monte Carlo estimator consistent with the importance sampling gradient. revision: yes

  2. Referee: [§4] §4 (MuJoCo experiments): The reported indistinguishable curves are obtained with Gaussian policies, for which the conventional KL surrogate is the analytic divergence between state-conditional Gaussians. This analytic KL gradient does not factor as a scalar multiple of the action-dependent score function grad log pi(a|s), so no scalar beta can equate the vectors in general; the identity therefore cannot hold for the standard analytic KL unless the paper instead employs the Monte Carlo estimator r log r throughout the comparison.

    Authors: We thank the referee for highlighting this important distinction. The per-sample KL surrogate in our work employs the Monte Carlo estimator of the form involving r log r (specifically, the term that aligns with the policy gradient structure), rather than the analytic KL divergence. This choice ensures the gradients can be equated via the closed-form beta. We have clarified this in the revised §4 and added a note in the methods section explaining why the sample-based estimator is used for the identity to hold. The experiments remain valid as both the PPO-Clip and the KLip-PPO use consistent estimators. revision: yes

Circularity Check

0 steps flagged

Algebraic identity derived directly from surrogate definitions; no circularity.

full rationale

The central claim is an explicit algebraic equivalence: the gradient of the PPO-Clip surrogate equals the gradient of a KL surrogate scaled by a per-sample coefficient eta(r, A) whose closed form is obtained by matching the two gradient expressions. This matching is performed by direct differentiation of the given loss definitions (importance ratio r, advantage A, clip operator) without any fitted parameters, data-dependent tuning, or load-bearing self-citations. The MuJoCo curves are presented only as empirical corroboration after the identity is stated. No step reduces to a renaming, a fitted input relabeled as prediction, or an ansatz imported via prior work by the same authors. The derivation is therefore self-contained against the paper's own equations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard definitions of the PPO clipped surrogate and the KL penalty; no additional free parameters or invented entities are introduced in the abstract.

axioms (1)
  • standard math Standard definitions of PPO clipped surrogate objective and KL penalty objective as used in reference implementations.
    The equivalence is stated to hold for these conventional definitions.

pith-pipeline@v0.9.1-grok · 5719 in / 1234 out tokens · 20381 ms · 2026-06-26T08:29:54.440753+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 8 canonical work pages · 4 internal anchors

  1. [1]

    What matters in on-policy reinforcement learning? a large-scale empirical study

    Marcin Andrychowicz, Anton Raichuk, Piotr Sta´nczyk, Manu Orsini, Sertan Girgin, Raphael Marinier, Léonard Hussenot, Matthieu Geist, Olivier Pietquin, Marcin Michalski, Sylvain Gelly, and Olivier Bachem. What matters in on-policy reinforcement learning? a large-scale empirical study. InInternational Confer- ence on Learning Representations (ICLR), 2021

  2. [2]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  3. [3]

    It’s not you, it’s clipping: A soft trust-region via probability smoothing for LLM RL.arXiv preprint arXiv:2509.21282, 2025

    Madeleine Dwyer, Adam Sobey, and Adriane Chapman. It’s not you, it’s clipping: A soft trust-region via probability smoothing for LLM RL.arXiv preprint arXiv:2509.21282, 2025

  4. [4]

    Implementation matters in deep RL: A case study on PPO and TRPO

    Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos, Larry Rudolph, and Aleksander Madry. Implementation matters in deep RL: A case study on PPO and TRPO. InInternational Conference on Learning Representations (ICLR), 2020

  5. [5]

    Nicolas Heess, Dhruva TB, Srinivasan Sriram, Jay Lemmon, Josh Merel, Greg Wayne, Yuval Tassa, Tom Erez, Ziyu Wang, S. M. Ali Eslami, Martin Riedmiller, and David Silver. Emergence of locomotion behaviours in rich environments.arXiv preprint arXiv:1707.02286, 2017

  6. [6]

    Revisiting design choices in proximal policy optimization.arXiv preprint arXiv:2009.10897, 2020

    Chloe Ching-Yun Hsu, Celestine Mendler-Dünner, and Moritz Hardt. Revisiting design choices in proximal policy optimization.arXiv preprint arXiv:2009.10897, 2020. 10

  7. [7]

    Shengyi Huang, Rousslan Fernand Julien Dossa, Chang Ye, Jeff Braga, Dipam Chakraborty, Kinal Mehta, and João G. M. Araújo. CleanRL: High-quality single-file implementations of deep reinforcement learning algorithms.Journal of Machine Learning Research, 23(274):1–18, 2022

  8. [8]

    A closer look at deep policy gradients

    Andrew Ilyas, Logan Engstrom, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos, Larry Rudolph, and Aleksander Madry. A closer look at deep policy gradients. InInternational Conference on Learning Representations (ICLR), 2020

  9. [9]

    Approximately optimal approximate reinforcement learning

    Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In Proceedings of the 19th International Conference on Machine Learning (ICML), pages 267–274, 2002

  10. [10]

    Rethinking KL regularization in RLHF: From value estimation to gradient optimization.arXiv preprint arXiv:2510.01555, 2025

    Kezhao Liu, Jason Klein Liu, Mingtao Chen, and Yiming Liu. Rethinking KL regularization in RLHF: From value estimation to gradient optimization.arXiv preprint arXiv:2510.01555, 2025

  11. [11]

    Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022

  12. [12]

    Stable-baselines3: Reliable reinforcement learning implementations.Journal of Machine Learning Research, 22(268):1–8, 2021

    Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, and Noah Dormann. Stable-baselines3: Reliable reinforcement learning implementations.Journal of Machine Learning Research, 22(268):1–8, 2021

  13. [13]

    Jordan, and Philipp Moritz

    John Schulman, Sergey Levine, Pieter Abbeel, Michael I. Jordan, and Philipp Moritz. Trust region policy optimization. InProceedings of the 32nd International Conference on Machine Learning (ICML), pages 1889–1897, 2015

  14. [14]

    Jordan, and Pieter Abbeel

    John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. InInternational Conference on Learning Representations (ICLR), 2016

  15. [15]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  16. [16]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  17. [17]

    You may not need ratio clipping in PPO.arXiv preprint arXiv:2202.00079, 2022

    Mingfei Sun, Vitaly Kurin, Guoqing Liu, Sam Devlin, Tao Qin, Katja Hofmann, and Shimon Whiteson. You may not need ratio clipping in PPO.arXiv preprint arXiv:2202.00079, 2022

  18. [18]

    Sutton and Andrew G

    Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. MIT Press, 2nd edition, 2018

  19. [19]

    Sutton, David McAllester, Satinder Singh, and Yishay Mansour

    Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. InAdvances in Neural Information Processing Systems (NIPS), 2000

  20. [20]

    MuJoCo: A physics engine for model-based control

    Emanuel Todorov, Tom Erez, and Yuval Tassa. MuJoCo: A physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5026–5033, 2012

  21. [21]

    Trust region-guided proximal policy optimization

    Yuhui Wang, Hao He, Xiaoyang Tan, and Yaozhong Gan. Trust region-guided proximal policy optimization. InAdvances in Neural Information Processing Systems (NeurIPS), 2019

  22. [22]

    Truly proximal policy optimization

    Yuhui Wang, Hao He, and Xiaoyang Tan. Truly proximal policy optimization. InConference on Uncertainty in Artificial Intelligence (UAI), 2020

  23. [23]

    Simple policy optimization

    Zhengpeng Xie, Qiang Zhang, Fan Yang, Marco Hutter, and Renjing Xu. Simple policy optimization. In Proceedings of the 42nd International Conference on Machine Learning (ICML), pages 68813–68824, 2025

  24. [24]

    On the design of KL-regularized policy gradient algorithms for LLM reasoning

    Yifan Zhang, Yifeng Liu, Huizhuo Yuan, Yang Yuan, Quanquan Gu, and Andrew Chi-Chih Yao. On the design of KL-regularized policy gradient algorithms for LLM reasoning. InInternational Conference on Learning Representations (ICLR), 2026. 11 A Weight-space dual: PPO-Clip as aΦpenalty The per-sample βt identity of the main text places PPO-Clip in distribution ...

  25. [25]

    [15]: LCLIP =E t min wt ˆAt,clip(w t) ˆAt

    theminformulation of Schulman et al. [15]: LCLIP =E t min wt ˆAt,clip(w t) ˆAt

  26. [26]

    the weight-space form of Theorem 2: LCLIP =E t wt ˆAt −Φ(w t, ˆAt) , with penalty acting on|w t −1|in importance-weight space

  27. [27]

    The first two forms are equal as functions and differ only in surface notation; the third matches them in per-sample gradient and places PPO-Clip inside the PPO-KL family

    the per-sample KL form of the main theorem, which matches PPO-Clip in gradient: ∇θ′LCLIP =∇ θ′ Et wt ˆAt +β t logπ θ′(at |s t) , β t =−w t ˆAt ·1[(i, t)∈ I kill], with penalty acting in distribution space. The first two forms are equal as functions and differ only in surface notation; the third matches them in per-sample gradient and places PPO-Clip insid...