pith. sign in

arxiv: 2407.15134 · v2 · submitted 2024-07-21 · 💻 cs.LG · cs.AI

Proximal Policy Distillation

Pith reviewed 2026-05-23 23:07 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords policy distillationproximal policy optimizationreinforcement learningsample efficiencyrobustnessimperfect demonstrationsATARIMujoco
0
0 comments X

The pith

Proximal Policy Distillation lets the student policy use its own collected rewards during training to achieve better sample efficiency and stronger final policies than standard distillation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Proximal Policy Distillation as a way to combine student-driven distillation with Proximal Policy Optimization. This setup lets the student draw on rewards from its own interactions instead of depending solely on the teacher signal. Tests across discrete and continuous control tasks show gains in sample efficiency and policy quality over two common baselines. The approach also holds up better when the teacher demonstrations contain errors. These outcomes point to a more practical route for transferring policies between networks of varying sizes.

Core claim

Proximal Policy Distillation integrates student-driven distillation with PPO so that the student can leverage additional rewards collected during its own rollouts; experiments on ATARI, Mujoco, and Procgen demonstrate that this produces higher sample efficiency and stronger student policies than student-distill or teacher-distill baselines, with added robustness when the teacher data is imperfect.

What carries the argument

Proximal Policy Distillation (PPD), a training loop that applies PPO updates to the student while performing distillation in a student-driven fashion.

If this is right

  • Student networks of different sizes can reach higher performance levels than the teacher when allowed to collect their own rewards.
  • Distillation remains effective even when teacher demonstrations contain errors or sub-optimal actions.
  • Fewer environment interactions are needed to reach a target policy quality compared with standard distillation.
  • The same procedure works for both discrete-action and continuous-control tasks without major changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same student-driven loop could be paired with other on-policy algorithms that already use clipped or proximal updates.
  • Domains that rely on noisy human demonstrations, such as physical robot control, may see larger relative gains from this robustness property.
  • The released library makes it straightforward to test whether the efficiency advantage persists when the student starts from random weights rather than a pre-trained initialization.

Load-bearing premise

Combining student-driven distillation with PPO will let the student safely use its own rewards without introducing instability or forcing environment-specific retuning that cancels the reported gains.

What would settle it

Running the same set of distillation experiments on the same environments and networks and finding that PPD shows no consistent improvement in sample efficiency or final policy returns over the student-distill and teacher-distill baselines.

Figures

Figures reproduced from arXiv: 2407.15134 by Giacomo Spigler.

Figure 1
Figure 1. Figure 1: Training curves of PPD and student-distill in the setting of distillation onto a larger student [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Performance of student models trained using PPD with different values of [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Training curves for all Atari teachers, averaged over 5 random seeds. Shaded areas denote standard [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Training curves for all Mujoco teachers, averaged over 5 random seeds. Shaded areas denote [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Training curves for all Procgen teachers, averaged over 5 random seeds. Shaded areas denote [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Same as Figure 1, but with distillation to a [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Same as Figure 1, but with distillation to a [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Same as Figure 1, showing distillation to a larger student, but also including teacher-distill. Note [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Individual per-environment (Atari) training curves during distillation with PPD and student-distill [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Individual per-environment (Mujoco) training curves during distillation with PPD and student [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Individual per-environment (Procgen) training curves during distillation with PPD and student [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
read the original abstract

We introduce Proximal Policy Distillation (PPD), a novel policy distillation method that integrates student-driven distillation and Proximal Policy Optimization (PPO) to increase sample efficiency and to leverage the additional rewards that the student policy collects during distillation. To assess the efficacy of our method, we compare PPD with two common alternatives, student-distill and teacher-distill, over a wide range of reinforcement learning environments that include discrete actions and continuous control (ATARI, Mujoco, and Procgen). For each environment and method, we perform distillation to a set of target student neural networks that are smaller, identical (self-distillation), or larger than the teacher network. Our findings indicate that PPD improves sample efficiency and produces better student policies compared to typical policy distillation approaches. Moreover, PPD demonstrates greater robustness than alternative methods when distilling policies from imperfect demonstrations. The code for the paper is released as part of a new Python library built on top of stable-baselines3 to facilitate policy distillation: `sb3-distill'.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Proximal Policy Distillation (PPD), which augments standard policy distillation by adding a student-driven distillation term to the PPO surrogate objective. This is intended to improve sample efficiency by allowing the student to collect and use its own rewards during distillation. The method is evaluated against student-distill and teacher-distill baselines across ATARI, MuJoCo, and Procgen, for student networks that are smaller, equal, or larger than the teacher, with additional experiments on imperfect teacher demonstrations. The authors release the sb3-distill library built on stable-baselines3.

Significance. If the reported gains in sample efficiency and robustness hold under rigorous statistical controls, PPD would provide a practical way to combine on-policy improvement with distillation without requiring separate data collection phases. The open release of the sb3-distill library is a concrete contribution that lowers the barrier for follow-up work.

major comments (3)
  1. [§4] §4 (Experiments) and associated figures/tables: the central claims of improved sample efficiency and superior final policies rest on aggregate performance numbers, yet the manuscript provides no information on the number of random seeds, whether error bars represent standard error or deviation, or any statistical significance tests. Without these, it is impossible to assess whether the reported advantages over student-distill and teacher-distill are reliable across the three benchmark suites.
  2. [§3] §3 (Method) and Eq. (combined objective): the joint loss is the PPO clipped surrogate plus a distillation term evaluated on trajectories collected by the student itself. No analysis, bound, or ablation is given to show that the two gradient directions remain compatible or that the PPO clipping mechanism continues to guarantee improvement when the distillation coefficient is held fixed across environments. This directly bears on the weakest assumption identified in the review.
  3. [§4.3] §4.3 (Imperfect demonstrations): the robustness claim is load-bearing for the paper’s broader contribution, yet the procedure used to generate the imperfect teacher policies (e.g., training duration, noise level, or reward corruption) is not specified with sufficient detail to allow reproduction or to judge how “imperfect” the teachers actually are.
minor comments (2)
  1. [Abstract] The abstract states that distillation is performed “to a set of target student neural networks,” but the precise architecture sizes (layer widths, activation functions) and whether they are held constant across methods are not tabulated.
  2. [§4] Hyperparameter tables list a single set of values for the distillation coefficient, clip range, and learning rate; it is unclear whether these were tuned once on a validation environment or re-used verbatim for all ATARI, MuJoCo, and Procgen runs.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve clarity, reproducibility, and analysis where needed.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments) and associated figures/tables: the central claims of improved sample efficiency and superior final policies rest on aggregate performance numbers, yet the manuscript provides no information on the number of random seeds, whether error bars represent standard error or deviation, or any statistical significance tests. Without these, it is impossible to assess whether the reported advantages over student-distill and teacher-distill are reliable across the three benchmark suites.

    Authors: We agree that details on random seeds, error bars, and statistical tests are necessary to support the claims. We will revise §4 and the associated figures to report the number of random seeds used, clarify that error bars represent standard error of the mean, and include statistical significance tests (e.g., paired t-tests) comparing PPD against the baselines. revision: yes

  2. Referee: [§3] §3 (Method) and Eq. (combined objective): the joint loss is the PPO clipped surrogate plus a distillation term evaluated on trajectories collected by the student itself. No analysis, bound, or ablation is given to show that the two gradient directions remain compatible or that the PPO clipping mechanism continues to guarantee improvement when the distillation coefficient is held fixed across environments. This directly bears on the weakest assumption identified in the review.

    Authors: The referee correctly identifies the absence of theoretical analysis or ablations on gradient compatibility and the effect of the fixed distillation coefficient on PPO's improvement guarantee. While the empirical results across environments support practical effectiveness, we will add a discussion in §3 along with an ablation on the distillation coefficient to address compatibility of the terms. revision: yes

  3. Referee: [§4.3] §4.3 (Imperfect demonstrations): the robustness claim is load-bearing for the paper’s broader contribution, yet the procedure used to generate the imperfect teacher policies (e.g., training duration, noise level, or reward corruption) is not specified with sufficient detail to allow reproduction or to judge how “imperfect” the teachers actually are.

    Authors: We agree that the current description of how imperfect teacher policies were generated is insufficient for reproducibility. We will expand §4.3 (and the appendix) with a complete specification of the procedure used to create the imperfect demonstrations. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with external benchmarks

full rationale

The paper proposes PPD as an integration of student-driven distillation with PPO and evaluates it via direct comparisons against student-distill and teacher-distill baselines on external standard environments (ATARI, MuJoCo, Procgen) using multiple student network sizes. No equations, fitted parameters, or central claims are shown to reduce by construction to quantities defined by the method itself; results rest on experimental outcomes rather than self-referential definitions, self-citation chains, or renamed known patterns. This satisfies the default expectation of non-circularity for method papers whose claims are externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard RL assumptions and the existing PPO algorithm; no new free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Environments are Markov Decision Processes where PPO can be applied to improve policies from collected rewards.
    The method description assumes standard RL dynamics and PPO applicability.

pith-pipeline@v0.9.0 · 5691 in / 1151 out tokens · 20073 ms · 2026-05-23T23:07:57.888021+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Jump-Start Reinforcement Learning with Vision-Language-Action Regularization

    cs.LG 2026-04 unverdicted novelty 5.0

    VLAJS augments PPO with sparse annealed VLA guidance through directional regularization to cut required interactions by over 50% on manipulation tasks and enable zero-shot sim-to-real transfer.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · cited by 1 Pith paper · 7 internal anchors

  1. [1]

    Imitating interactive intelligence.arXiv preprint arXiv:2012.05672,

    Josh Abramson, Arun Ahuja, Iain Barr, Arthur Brussee, Federico Carnevale, Mary Cassin, Rachita Chhaparia, Stephen Clark, Bogdan Damoc, Andrew Dudzik, et al. Imitating interactive intelligence.arXiv preprint arXiv:2012.05672,

  2. [2]

    Beyond tabula rasa: Reincarnating reinforcement learning.arXiv preprint arXiv:2206.01626,

    Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron Courville, and Marc G Bellemare. Beyond tabula rasa: Reincarnating reinforcement learning.arXiv preprint arXiv:2206.01626,

  3. [3]

    Solving Rubik's Cube with a Robot Hand

    Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej, Mateusz Litwin, Bob McGrew, Arthur Petron, Alex Paino, Matthias Plappert, Glenn Powell, Raphael Ribas, et al. Solving rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113,

  4. [4]

    Dota 2 with Large Scale Deep Reinforcement Learning

    Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław Dębiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680,

  5. [5]

    Leveraging Procedural Generation to Benchmark Reinforcement Learning , July 2020

    Karl Cobbe, Christopher Hesse, Jacob Hilton, and John Schulman. Leveraging procedural generation to benchmark reinforcement learning.arXiv preprint arXiv:1912.01588,

  6. [6]

    Born again neural networks

    2https://github.com/spiglerg/sb3_distill 9 Published in Transactions on Machine Learning Research (06/2025) Tommaso Furlanello, Zachary Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar. Born again neural networks. InInternational conference on machine learning, pp. 1607–1616,

  7. [7]

    Distillation Strategies for Proximal Policy Optimization

    Sam Green, Craig M Vineyard, and Cetin Kaya Koç. Distillation strategies for proximal policy optimization. arXiv preprint arXiv:1901.08128,

  8. [8]

    Population Based Training of Neural Networks

    Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M Czarnecki, Jeff Donahue, Ali Razavi, Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, et al. Population based training of neural networks. arXiv preprint arXiv:1711.09846,

  9. [9]

    Leveraging fully observable policies for learning under partial observability.arXiv preprint arXiv:2211.01991,

    Hai Nguyen, Andrea Baisero, Dian Wang, Christopher Amato, and Robert Platt. Leveraging fully observable policies for learning under partial observability.arXiv preprint arXiv:2211.01991,

  10. [10]

    Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations

    Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, John Schulman, Emanuel Todorov, and Sergey Levine. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. arXiv preprint arXiv:1709.10087,

  11. [11]

    A reduction of imitation learning and structured prediction to no-regret online learning

    10 Published in Transactions on Machine Learning Research (06/2025) Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 627–635. JMLR Workshop and Conference Proceedings,

  12. [12]

    Kickstarting Deep Reinforcement Learning

    Simon Schmitt, Jonathan J Hudson, Augustin Zidek, Simon Osindero, Carl Doersch, Wojciech M Czarnecki, Joel Z Leibo, Heinrich Kuttler, Andrew Zisserman, Karen Simonyan, et al. Kickstarting deep reinforcement learning. arXiv preprint arXiv:1803.03835,

  13. [13]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

  14. [14]

    Kazuma Tsuji, Ken’ichiro Tanaka, and Sebastian Pokutta

    URL https://zenodo.org/record/8127025. Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning.Nature, 575(7782):350–354,

  15. [15]

    PPD is shown in Algorithm 1, student-distill in Algorithm 2, and teacher-distill in Algorithm

    11 Published in Transactions on Machine Learning Research (06/2025) A Supplementary Methods A.1 Algorithm listings and baseline methods We include full algorithm listings for the three distillation methods compared in this work. PPD is shown in Algorithm 1, student-distill in Algorithm 2, and teacher-distill in Algorithm

  16. [16]

    for k = 1, 2,

    Algorithm 1Proximal Policy Distillation Input: teacher policyπteacher Initialize student policyπθand value functionVϕ. for k = 1, 2,... do Collect trajectories by running the student policyπθin the environment to fill a rollout bufferDk = {(si,ai,ri,s′ i)}with n environment steps. Compute returns ˆRi and then advantage estimates,ˆAi. for epoch = 1, 2,n ep...

  17. [17]

    • Mujoco (5 environments): Ant-v4, HalfCheetah-v4, Hopper-v4, Swimmer-v4, Humanoid-v4

    and Procgen (Cobbe et al., 2019): • Atari (11 environments): AtlantisNoFrameskip-v4, SeaquestNoFrameskip-v4, BeamRiderNoFrameskip-v4, EnduroNoFrameskip-v4, FreewayNoFrameskip-v4, MsPacmanNoFrameskip-v4, PongNoFrameskip-v4, QbertNoFrameskip-v4, ZaxxonNoFrameskip-v4, DemonAttackNoFrameskip-v4, CrazyClimberNoFrameskip-v4. • Mujoco (5 environments): Ant-v4, H...

  18. [18]

    Table 3: PPO hyperparameters used for training the teacher models

    Simple hyperparameter tuning was initially performed, starting from parameter values suggested from the stable-baselines3 model zoo (Raffin, 2020). Table 3: PPO hyperparameters used for training the teacher models. Hyperparameter Value n_envs 18 n_steps 256 (Atari, swimmer, hopper), 512 (Procgen, Mujoco) batch_size 512 γ 0.995 λ 0.9 lr 3e-4 n_epochs 4 ent...

  19. [19]

    with convolutional filters {32, 32, 32}(8s4, 4s2, 3s1) and a fully connected layer of 128 units, resulting in∼0.25x the number of parameters of the teacher. The larger networks for Atari and Procgen were IMPALA-CNNs with{32, 64, 64} 14 Published in Transactions on Machine Learning Research (06/2025) convolutional filters and 1024 units in the fully connec...

  20. [20]

    is worse than both other models, suggesting significant overfitting. We then report individual scores for all environments and students in Table 5, and the individual, per- environment training trajectories for distillation tolarger student networks for PPD and student-distill in Figures 9, 10, and 11, that is, the training curves that are combined to for...

  21. [21]

    is worse than both other models, suggesting significant overfitting. 18 Published in Transactions on Machine Learning Research (06/2025) Table 5: The table extends Table 1 from the main text by reporting results for each environment separately, averaged over 5 random seeds. Atari games are reported as human-normalized scores. env teacher smaller same-size...

  22. [22]

    Table 6: Full results for each environment, extending Table 2 from the main text

    The results from this table are aggregated and shown in Table 2 from the main text. Table 6: Full results for each environment, extending Table 2 from the main text. We show the performance of student models, trained using the three distillation methods (PPD, student-distill, and teacher-distill) from ‘imperfect teachers’ that are artificially corrupted t...