KLip-PPO: A per-sample KL perspective on PPO-Clip

Riccardo Colletti; Robin Holzinger

arxiv: 2606.23932 · v1 · pith:SVM75E4Gnew · submitted 2026-06-22 · 💻 cs.LG

KLip-PPO: A per-sample KL perspective on PPO-Clip

Riccardo Colletti , Robin Holzinger This is my paper

Pith reviewed 2026-06-26 08:29 UTC · model grok-4.3

classification 💻 cs.LG

keywords proximal policy optimizationppo-clipkl penaltyimportance ratiopolicy gradientsurrogate objectivereinforcement learning

0 comments

The pith

PPO-Clip's gradient is exactly reproduced by a per-sample KL surrogate whose coefficient depends on the importance ratio and advantage.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that the clipped surrogate objective in Proximal Policy Optimization can be rewritten as a Kullback-Leibler penalty with a sample-specific coefficient. This coefficient has a closed-form expression based on the importance sampling ratio and the advantage value. The equivalence holds exactly at each minibatch step and across the full inner optimization loop. Training curves on five continuous control benchmarks are indistinguishable between the two formulations. This perspective shows that the clipping operation implicitly applies a step-function penalty outside the trust region.

Core claim

The gradient of the clipped surrogate is reproduced exactly by a Kullback-Leibler surrogate whose coefficient varies per sample, with closed-form dependence on the importance ratio and the advantage. The identity holds at every minibatch step and across the entire inner loop, and on five MuJoCo continuous-control benchmarks the two losses produce indistinguishable training curves. The reformulation exposes a structural feature of the clipped surrogate that the min notation hides.

What carries the argument

The per-sample KL coefficient, a closed-form function of the importance ratio and advantage that makes the KL surrogate gradient identical to the clipped surrogate gradient at every update.

If this is right

The min notation in the clipped surrogate hides an implicit per-sample penalty that is a step function at the trust region boundary.
The shape of this per-sample coefficient is the natural design axis for generalising the PPO algorithm.
The two loss formulations produce identical policy updates and training dynamics on standard benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This equivalence suggests designing new PPO variants by selecting different functional forms for the per-sample coefficient.
It may account for the similar performance often seen between clipped and KL versions of PPO in practice.
The identity could be checked on discrete-action tasks or non-MuJoCo environments to test its scope.

Load-bearing premise

The derivation assumes the standard PPO advantage estimator and the usual definition of the importance ratio.

What would settle it

If the gradient of the per-sample KL loss with the derived coefficient fails to match the gradient of the clipped loss on any minibatch, the claimed identity is false.

Figures

Figures reproduced from arXiv: 2606.23932 by Riccardo Colletti, Robin Holzinger.

**Figure 2.** Figure 2: Episode return on the five MuJoCo tasks (mean over [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: The clipping function wc = clip(w, 1 − ϵ, 1 + ϵ). Inside the band [1 − ϵ, 1 + ϵ] the clipped weight equals the true importance ratio and the gradient flows normally; outside the band the clipped weight is constant and the gradient with respect to θ ′ vanishes. B.2 The PPO-Clip surrogate Positive advantage (A >ˆ 0) w objective 1−ϵ 1 1+ϵ wAˆ wcAˆ min (PPO) bounded Negative advantage (A <ˆ 0) w objective 1−ϵ … view at source ↗

**Figure 4.** Figure 4: The PPO-Clip surrogate, split by the sign of [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: PPO-Clip. The outer loop collects rollouts and fits the critic; the inner loop takes [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: PPO-KL. The objective is the unclipped importance-weighted advantage plus a KL penalty [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Per-sample effective gradient multiplier as a function of the importance ratio [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Morphing from scalar to per-sample β, shown for the case A >ˆ 0. (a) A constant scalar β produces the smooth curve Aˆ + β/w, which can never coincide with the PPO-Clip step. (b) Letting β adapt in time but stay scalar shifts the curve along its family but cannot reshape it. (c) Letting β become per-sample, with βt = −wtAˆ t exactly on Ikill and zero elsewhere, snaps the curve onto the PPO-Clip step functio… view at source ↗

**Figure 9.** Figure 9: The βt landscape in (w, Aˆ) space. Vertical dashed lines mark the trust region [1 − ϵ, 1 + ϵ]. Two corners (top-right w > 1 + ϵ, A >ˆ 0 and bottom-left w < 1 − ϵ, A <ˆ 0) carry the per-sample coefficient βt = −wAˆ, with shading indicating |βt|. Everywhere else βt = 0. The support and value of βt reproduce the PPO-Clip gradient sample by sample. B.8 The same per-sample gradient via two surrogates sample (st… view at source ↗

**Figure 10.** Figure 10: The per-sample gradient under the two surrogates. The same sample feeds both per-sample [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: PPO-Clip and per-sample PPO-KL on each task, mean over [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: Final return (mean ± standard error over 5 seeds) on CartPole-v1, HalfCheetah-v4, and Hopper-v4 as a function of each trust-region knob: clip ϵ for PPO-Clip, fixed β for PPO-KL, and the KL target for adaptive PPO-KL. C.3 Clipping partition [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: shows, for PPO-Clip, the fraction of each minibatch that falls in Ikill and Ipass over training. This is the empirical view of the partition on which the identity rests: the penalty βt is non-zero only on Ikill, and its reach grows with task difficulty, exceeding half the batch on Humanoid-v4. 0.0 0.2 0.4 0.6 0.8 1.0 environment step 1e6 0.00 0.02 0.04 0.06 0.08 0.10 0.12 batch fraction kill pass (a) Ho… view at source ↗

**Figure 14.** Figure 14: Per-sample coefficient βt over training for the per-sample variant. The median is zero because βt vanishes on every sample outside the kill region; the shaded bands show how far the active coefficient −wtAˆ t reaches on the samples in Ikill, widening on the high-dimensional tasks. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

read the original abstract

Proximal Policy Optimization (PPO) is the standard policy-gradient algorithm for on-policy reinforcement learning. The literature presents it in two forms, a clipped surrogate that bounds the importance ratio between successive policies and a Kullback-Leibler penalty between them. These forms are treated as separate algorithms with their own gradients, their own hyperparameters, and their own reference implementations, and a sizeable body of empirical work compares them. We show that the gradient of the clipped surrogate is reproduced exactly by a Kullback-Leibler surrogate whose coefficient varies per sample, with closed-form dependence on the importance ratio and the advantage. The identity holds at every minibatch step and across the entire inner loop, and on five MuJoCo continuous-control benchmarks the two losses produce indistinguishable training curves. The reformulation exposes a structural feature of the clipped surrogate that the min notation hides. PPO-Clip's implicit per-sample penalty is a step function at the boundary of the trust region, and the shape of this coefficient is the natural design axis for generalising the algorithm. We sketch the resulting follow-up directions in the discussion.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper derives a per-sample KL coefficient that exactly matches the PPO clip gradient and shows matching MuJoCo curves, but the analytic KL issue needs direct checking.

read the letter

The main thing here is that the authors give a closed-form per-sample coefficient for a KL term whose gradient exactly equals the PPO clipped surrogate gradient at every minibatch step. They back this with indistinguishable training curves on five MuJoCo tasks and note that the clip behaves like a step-function penalty.

The useful part is the unification itself. Framing the clip as an implicit per-sample penalty with a specific shape is a clean way to see the two PPO variants as instances of the same idea, and it suggests natural ways to generalize the penalty. The empirical match on the benchmarks is at least consistent with the claim for the code they ran.

The soft spots are the missing derivation steps and the open question on the KL estimator. The abstract states an exact identity but shows none of the algebra. The stress-test point lands: for the Monte Carlo KL term the gradients share the score factor and a scalar beta works, but analytic KL on Gaussian parameters does not factor the same way and no single coefficient can align the vectors. MuJoCo reference implementations use analytic KL, so the paper must clarify which version it used and why the curves still matched. Without that, the central claim stays hard to verify.

This is a paper for people who work directly on on-policy RL and want a tighter view of PPO internals. It organizes existing practice more than it opens new territory. It deserves peer review because the unification is precise if the math holds and the experiments are at least confirmatory. Ask for the derivation and the exact KL form in revision.

Referee Report

2 major / 1 minor

Summary. The manuscript claims that the gradient of the PPO-Clip surrogate is exactly reproduced by a per-sample KL-penalized surrogate whose coefficient beta is a closed-form function of the importance ratio r and advantage A; this identity is asserted to hold at every minibatch step, and the resulting losses produce indistinguishable training curves on five MuJoCo continuous-control benchmarks. The reformulation is presented as exposing the clipped surrogate's implicit step-function penalty at the trust-region boundary.

Significance. If the claimed algebraic identity holds under the definitions and estimators actually used in the MuJoCo experiments, the work supplies a useful structural reinterpretation of PPO-Clip that could guide the design of generalized trust-region penalties. The empirical indistinguishability on standard benchmarks is consistent with the claim, though the absence of released code or step-by-step derivation limits immediate verification and reuse.

major comments (2)

[Abstract, §3] Abstract and §3: The central claim of an 'exact identity' between the clipped-surrogate gradient and the per-sample KL gradient is asserted without any displayed algebraic steps or intermediate expressions for beta(r, A). Because this identity is the load-bearing result, the derivation (including the precise form of the KL term) must be supplied before the claim can be assessed.
[§4] §4 (MuJoCo experiments): The reported indistinguishable curves are obtained with Gaussian policies, for which the conventional KL surrogate is the analytic divergence between state-conditional Gaussians. This analytic KL gradient does not factor as a scalar multiple of the action-dependent score function grad log pi(a|s), so no scalar beta can equate the vectors in general; the identity therefore cannot hold for the standard analytic KL unless the paper instead employs the Monte Carlo estimator r log r throughout the comparison.

minor comments (1)

[§3] Notation for the per-sample coefficient should be introduced with an explicit equation number rather than inline prose.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We respond to each major comment below. Where the comments indicate that additional material is needed for clarity, we have revised the manuscript accordingly.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3: The central claim of an 'exact identity' between the clipped-surrogate gradient and the per-sample KL gradient is asserted without any displayed algebraic steps or intermediate expressions for beta(r, A). Because this identity is the load-bearing result, the derivation (including the precise form of the KL term) must be supplied before the claim can be assessed.

Authors: We agree that the algebraic derivation of the identity and the closed-form expression for beta(r, A) should be presented explicitly in the main text. In the revised manuscript, we have added a dedicated subsection in §3 that walks through the steps: starting from the gradient of the PPO-Clip surrogate, showing how it can be rewritten as a per-sample KL term with beta derived as a function of r and A. The precise form of the KL term used is the Monte Carlo estimator consistent with the importance sampling gradient. revision: yes
Referee: [§4] §4 (MuJoCo experiments): The reported indistinguishable curves are obtained with Gaussian policies, for which the conventional KL surrogate is the analytic divergence between state-conditional Gaussians. This analytic KL gradient does not factor as a scalar multiple of the action-dependent score function grad log pi(a|s), so no scalar beta can equate the vectors in general; the identity therefore cannot hold for the standard analytic KL unless the paper instead employs the Monte Carlo estimator r log r throughout the comparison.

Authors: We thank the referee for highlighting this important distinction. The per-sample KL surrogate in our work employs the Monte Carlo estimator of the form involving r log r (specifically, the term that aligns with the policy gradient structure), rather than the analytic KL divergence. This choice ensures the gradients can be equated via the closed-form beta. We have clarified this in the revised §4 and added a note in the methods section explaining why the sample-based estimator is used for the identity to hold. The experiments remain valid as both the PPO-Clip and the KLip-PPO use consistent estimators. revision: yes

Circularity Check

0 steps flagged

Algebraic identity derived directly from surrogate definitions; no circularity.

full rationale

The central claim is an explicit algebraic equivalence: the gradient of the PPO-Clip surrogate equals the gradient of a KL surrogate scaled by a per-sample coefficient eta(r, A) whose closed form is obtained by matching the two gradient expressions. This matching is performed by direct differentiation of the given loss definitions (importance ratio r, advantage A, clip operator) without any fitted parameters, data-dependent tuning, or load-bearing self-citations. The MuJoCo curves are presented only as empirical corroboration after the identity is stated. No step reduces to a renaming, a fitted input relabeled as prediction, or an ansatz imported via prior work by the same authors. The derivation is therefore self-contained against the paper's own equations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard definitions of the PPO clipped surrogate and the KL penalty; no additional free parameters or invented entities are introduced in the abstract.

axioms (1)

standard math Standard definitions of PPO clipped surrogate objective and KL penalty objective as used in reference implementations.
The equivalence is stated to hold for these conventional definitions.

pith-pipeline@v0.9.1-grok · 5719 in / 1234 out tokens · 20381 ms · 2026-06-26T08:29:54.440753+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 8 canonical work pages · 4 internal anchors

[1]

What matters in on-policy reinforcement learning? a large-scale empirical study

Marcin Andrychowicz, Anton Raichuk, Piotr Sta´nczyk, Manu Orsini, Sertan Girgin, Raphael Marinier, Léonard Hussenot, Matthieu Geist, Olivier Pietquin, Marcin Michalski, Sylvain Gelly, and Olivier Bachem. What matters in on-policy reinforcement learning? a large-scale empirical study. InInternational Confer- ence on Learning Representations (ICLR), 2021

2021
[2]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

It’s not you, it’s clipping: A soft trust-region via probability smoothing for LLM RL.arXiv preprint arXiv:2509.21282, 2025

Madeleine Dwyer, Adam Sobey, and Adriane Chapman. It’s not you, it’s clipping: A soft trust-region via probability smoothing for LLM RL.arXiv preprint arXiv:2509.21282, 2025

work page arXiv 2025
[4]

Implementation matters in deep RL: A case study on PPO and TRPO

Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos, Larry Rudolph, and Aleksander Madry. Implementation matters in deep RL: A case study on PPO and TRPO. InInternational Conference on Learning Representations (ICLR), 2020

2020
[5]

Nicolas Heess, Dhruva TB, Srinivasan Sriram, Jay Lemmon, Josh Merel, Greg Wayne, Yuval Tassa, Tom Erez, Ziyu Wang, S. M. Ali Eslami, Martin Riedmiller, and David Silver. Emergence of locomotion behaviours in rich environments.arXiv preprint arXiv:1707.02286, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[6]

Revisiting design choices in proximal policy optimization.arXiv preprint arXiv:2009.10897, 2020

Chloe Ching-Yun Hsu, Celestine Mendler-Dünner, and Moritz Hardt. Revisiting design choices in proximal policy optimization.arXiv preprint arXiv:2009.10897, 2020. 10

work page arXiv 2009
[7]

Shengyi Huang, Rousslan Fernand Julien Dossa, Chang Ye, Jeff Braga, Dipam Chakraborty, Kinal Mehta, and João G. M. Araújo. CleanRL: High-quality single-file implementations of deep reinforcement learning algorithms.Journal of Machine Learning Research, 23(274):1–18, 2022

2022
[8]

A closer look at deep policy gradients

Andrew Ilyas, Logan Engstrom, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos, Larry Rudolph, and Aleksander Madry. A closer look at deep policy gradients. InInternational Conference on Learning Representations (ICLR), 2020

2020
[9]

Approximately optimal approximate reinforcement learning

Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In Proceedings of the 19th International Conference on Machine Learning (ICML), pages 267–274, 2002

2002
[10]

Rethinking KL regularization in RLHF: From value estimation to gradient optimization.arXiv preprint arXiv:2510.01555, 2025

Kezhao Liu, Jason Klein Liu, Mingtao Chen, and Yiming Liu. Rethinking KL regularization in RLHF: From value estimation to gradient optimization.arXiv preprint arXiv:2510.01555, 2025

work page arXiv 2025
[11]

Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022

2022
[12]

Stable-baselines3: Reliable reinforcement learning implementations.Journal of Machine Learning Research, 22(268):1–8, 2021

Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, and Noah Dormann. Stable-baselines3: Reliable reinforcement learning implementations.Journal of Machine Learning Research, 22(268):1–8, 2021

2021
[13]

Jordan, and Philipp Moritz

John Schulman, Sergey Levine, Pieter Abbeel, Michael I. Jordan, and Philipp Moritz. Trust region policy optimization. InProceedings of the 32nd International Conference on Machine Learning (ICML), pages 1889–1897, 2015

2015
[14]

Jordan, and Pieter Abbeel

John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. InInternational Conference on Learning Representations (ICLR), 2016

2016
[15]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[16]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

You may not need ratio clipping in PPO.arXiv preprint arXiv:2202.00079, 2022

Mingfei Sun, Vitaly Kurin, Guoqing Liu, Sam Devlin, Tao Qin, Katja Hofmann, and Shimon Whiteson. You may not need ratio clipping in PPO.arXiv preprint arXiv:2202.00079, 2022

work page arXiv 2022
[18]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. MIT Press, 2nd edition, 2018

2018
[19]

Sutton, David McAllester, Satinder Singh, and Yishay Mansour

Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. InAdvances in Neural Information Processing Systems (NIPS), 2000

2000
[20]

MuJoCo: A physics engine for model-based control

Emanuel Todorov, Tom Erez, and Yuval Tassa. MuJoCo: A physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5026–5033, 2012

2012
[21]

Trust region-guided proximal policy optimization

Yuhui Wang, Hao He, Xiaoyang Tan, and Yaozhong Gan. Trust region-guided proximal policy optimization. InAdvances in Neural Information Processing Systems (NeurIPS), 2019

2019
[22]

Truly proximal policy optimization

Yuhui Wang, Hao He, and Xiaoyang Tan. Truly proximal policy optimization. InConference on Uncertainty in Artificial Intelligence (UAI), 2020

2020
[23]

Simple policy optimization

Zhengpeng Xie, Qiang Zhang, Fan Yang, Marco Hutter, and Renjing Xu. Simple policy optimization. In Proceedings of the 42nd International Conference on Machine Learning (ICML), pages 68813–68824, 2025

2025
[24]

On the design of KL-regularized policy gradient algorithms for LLM reasoning

Yifan Zhang, Yifeng Liu, Huizhuo Yuan, Yang Yuan, Quanquan Gu, and Andrew Chi-Chih Yao. On the design of KL-regularized policy gradient algorithms for LLM reasoning. InInternational Conference on Learning Representations (ICLR), 2026. 11 A Weight-space dual: PPO-Clip as aΦpenalty The per-sample βt identity of the main text places PPO-Clip in distribution ...

2026
[25]

[15]: LCLIP =E t min wt ˆAt,clip(w t) ˆAt

theminformulation of Schulman et al. [15]: LCLIP =E t min wt ˆAt,clip(w t) ˆAt
[26]

the weight-space form of Theorem 2: LCLIP =E t wt ˆAt −Φ(w t, ˆAt) , with penalty acting on|w t −1|in importance-weight space
[27]

The first two forms are equal as functions and differ only in surface notation; the third matches them in per-sample gradient and places PPO-Clip inside the PPO-KL family

the per-sample KL form of the main theorem, which matches PPO-Clip in gradient: ∇θ′LCLIP =∇ θ′ Et wt ˆAt +β t logπ θ′(at |s t) , β t =−w t ˆAt ·1[(i, t)∈ I kill], with penalty acting in distribution space. The first two forms are equal as functions and differ only in surface notation; the third matches them in per-sample gradient and places PPO-Clip insid...

2000

[1] [1]

What matters in on-policy reinforcement learning? a large-scale empirical study

Marcin Andrychowicz, Anton Raichuk, Piotr Sta´nczyk, Manu Orsini, Sertan Girgin, Raphael Marinier, Léonard Hussenot, Matthieu Geist, Olivier Pietquin, Marcin Michalski, Sylvain Gelly, and Olivier Bachem. What matters in on-policy reinforcement learning? a large-scale empirical study. InInternational Confer- ence on Learning Representations (ICLR), 2021

2021

[2] [2]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

It’s not you, it’s clipping: A soft trust-region via probability smoothing for LLM RL.arXiv preprint arXiv:2509.21282, 2025

Madeleine Dwyer, Adam Sobey, and Adriane Chapman. It’s not you, it’s clipping: A soft trust-region via probability smoothing for LLM RL.arXiv preprint arXiv:2509.21282, 2025

work page arXiv 2025

[4] [4]

Implementation matters in deep RL: A case study on PPO and TRPO

Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos, Larry Rudolph, and Aleksander Madry. Implementation matters in deep RL: A case study on PPO and TRPO. InInternational Conference on Learning Representations (ICLR), 2020

2020

[5] [5]

Nicolas Heess, Dhruva TB, Srinivasan Sriram, Jay Lemmon, Josh Merel, Greg Wayne, Yuval Tassa, Tom Erez, Ziyu Wang, S. M. Ali Eslami, Martin Riedmiller, and David Silver. Emergence of locomotion behaviours in rich environments.arXiv preprint arXiv:1707.02286, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[6] [6]

Revisiting design choices in proximal policy optimization.arXiv preprint arXiv:2009.10897, 2020

Chloe Ching-Yun Hsu, Celestine Mendler-Dünner, and Moritz Hardt. Revisiting design choices in proximal policy optimization.arXiv preprint arXiv:2009.10897, 2020. 10

work page arXiv 2009

[7] [7]

Shengyi Huang, Rousslan Fernand Julien Dossa, Chang Ye, Jeff Braga, Dipam Chakraborty, Kinal Mehta, and João G. M. Araújo. CleanRL: High-quality single-file implementations of deep reinforcement learning algorithms.Journal of Machine Learning Research, 23(274):1–18, 2022

2022

[8] [8]

A closer look at deep policy gradients

Andrew Ilyas, Logan Engstrom, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos, Larry Rudolph, and Aleksander Madry. A closer look at deep policy gradients. InInternational Conference on Learning Representations (ICLR), 2020

2020

[9] [9]

Approximately optimal approximate reinforcement learning

Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In Proceedings of the 19th International Conference on Machine Learning (ICML), pages 267–274, 2002

2002

[10] [10]

Rethinking KL regularization in RLHF: From value estimation to gradient optimization.arXiv preprint arXiv:2510.01555, 2025

Kezhao Liu, Jason Klein Liu, Mingtao Chen, and Yiming Liu. Rethinking KL regularization in RLHF: From value estimation to gradient optimization.arXiv preprint arXiv:2510.01555, 2025

work page arXiv 2025

[11] [11]

Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022

2022

[12] [12]

Stable-baselines3: Reliable reinforcement learning implementations.Journal of Machine Learning Research, 22(268):1–8, 2021

Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, and Noah Dormann. Stable-baselines3: Reliable reinforcement learning implementations.Journal of Machine Learning Research, 22(268):1–8, 2021

2021

[13] [13]

Jordan, and Philipp Moritz

John Schulman, Sergey Levine, Pieter Abbeel, Michael I. Jordan, and Philipp Moritz. Trust region policy optimization. InProceedings of the 32nd International Conference on Machine Learning (ICML), pages 1889–1897, 2015

2015

[14] [14]

Jordan, and Pieter Abbeel

John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. InInternational Conference on Learning Representations (ICLR), 2016

2016

[15] [15]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[16] [16]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

You may not need ratio clipping in PPO.arXiv preprint arXiv:2202.00079, 2022

Mingfei Sun, Vitaly Kurin, Guoqing Liu, Sam Devlin, Tao Qin, Katja Hofmann, and Shimon Whiteson. You may not need ratio clipping in PPO.arXiv preprint arXiv:2202.00079, 2022

work page arXiv 2022

[18] [18]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. MIT Press, 2nd edition, 2018

2018

[19] [19]

Sutton, David McAllester, Satinder Singh, and Yishay Mansour

Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. InAdvances in Neural Information Processing Systems (NIPS), 2000

2000

[20] [20]

MuJoCo: A physics engine for model-based control

Emanuel Todorov, Tom Erez, and Yuval Tassa. MuJoCo: A physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5026–5033, 2012

2012

[21] [21]

Trust region-guided proximal policy optimization

Yuhui Wang, Hao He, Xiaoyang Tan, and Yaozhong Gan. Trust region-guided proximal policy optimization. InAdvances in Neural Information Processing Systems (NeurIPS), 2019

2019

[22] [22]

Truly proximal policy optimization

Yuhui Wang, Hao He, and Xiaoyang Tan. Truly proximal policy optimization. InConference on Uncertainty in Artificial Intelligence (UAI), 2020

2020

[23] [23]

Simple policy optimization

Zhengpeng Xie, Qiang Zhang, Fan Yang, Marco Hutter, and Renjing Xu. Simple policy optimization. In Proceedings of the 42nd International Conference on Machine Learning (ICML), pages 68813–68824, 2025

2025

[24] [24]

On the design of KL-regularized policy gradient algorithms for LLM reasoning

Yifan Zhang, Yifeng Liu, Huizhuo Yuan, Yang Yuan, Quanquan Gu, and Andrew Chi-Chih Yao. On the design of KL-regularized policy gradient algorithms for LLM reasoning. InInternational Conference on Learning Representations (ICLR), 2026. 11 A Weight-space dual: PPO-Clip as aΦpenalty The per-sample βt identity of the main text places PPO-Clip in distribution ...

2026

[25] [25]

[15]: LCLIP =E t min wt ˆAt,clip(w t) ˆAt

theminformulation of Schulman et al. [15]: LCLIP =E t min wt ˆAt,clip(w t) ˆAt

[26] [26]

the weight-space form of Theorem 2: LCLIP =E t wt ˆAt −Φ(w t, ˆAt) , with penalty acting on|w t −1|in importance-weight space

[27] [27]

The first two forms are equal as functions and differ only in surface notation; the third matches them in per-sample gradient and places PPO-Clip inside the PPO-KL family

the per-sample KL form of the main theorem, which matches PPO-Clip in gradient: ∇θ′LCLIP =∇ θ′ Et wt ˆAt +β t logπ θ′(at |s t) , β t =−w t ˆAt ·1[(i, t)∈ I kill], with penalty acting in distribution space. The first two forms are equal as functions and differ only in surface notation; the third matches them in per-sample gradient and places PPO-Clip insid...

2000