Policy Distillation

Andrei A. Rusu , Sergio Gomez Colmenarejo , Caglar Gulcehre , Guillaume Desjardins , James Kirkpatrick , Razvan Pascanu , Volodymyr Mnih , Koray Kavukcuoglu

show 1 more author

Raia Hadsell

Authors on Pith no claims yet

classification 💻 cs.LG

keywords policyagentcalleddeepdistillationlearningmethodpolicies

0 comments

read the original abstract

Policies for complex visual tasks have been successfully learned with deep reinforcement learning, using an approach called deep Q-networks (DQN), but relatively large (task-specific) networks and extensive training are needed to achieve good performance. In this work, we present a novel method called policy distillation that can be used to extract the policy of a reinforcement learning agent and train a new network that performs at the expert level while being dramatically smaller and more efficient. Furthermore, the same method can be used to consolidate multiple task-specific policies into a single policy. We demonstrate these claims using the Atari domain and show that the multi-task distilled agent outperforms the single-task teachers as well as a jointly-trained DQN agent.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VISD: Enhancing Video Reasoning via Structured Self-Distillation
cs.CV 2026-05 unverdicted novelty 7.0

VISD improves VideoLLM reasoning performance and training efficiency by combining structured multi-dimensional self-distillation feedback with RL via direction-magnitude decoupling, curriculum scheduling, and EMA stab...
SafeAdapt: Provably Safe Policy Updates in Deep Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

SafeAdapt certifies a Rashomon set of safe policies from demonstration data and projects updates from arbitrary RL algorithms onto it to guarantee preservation of safety on source tasks.
Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models
cs.CL 2026-04 unverdicted novelty 7.0

OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.
Policy-DRIFT: Dynamic Reward-Informed Flow Trajectory Steering
physics.flu-dyn 2026-05 unverdicted novelty 6.0

Policy-DRIFT combines conditional flow matching with terminal reward guidance and decoupled DRL to achieve 49% drag reduction in Re_tau=180 channel flow, 16% above DRL benchmarks and with 37 times less actuation energy.
Demystifying Deep Reinforcement Learning: A Neuro-Symbolic Framework for Interpretable Open RAN Automation
cs.NI 2026-05 unverdicted novelty 6.0

DeRAN converts black-box DRL policies into interpretable symbolic representations for O-RAN automation, retaining 78-87% of original performance while adding built-in transparency.
Demystifying Deep Reinforcement Learning: A Neuro-Symbolic Framework for Interpretable Open RAN Automation
cs.NI 2026-05 unverdicted novelty 6.0

DeRAN converts opaque DRL policies for O-RAN tasks into interpretable symbolic policies via concept abstraction, deep symbolic regression, and neurally guided logic, retaining 78-87% of DRL performance on a live 5G testbed.
Precise Aggressive Aerial Maneuvers with Sensorimotor Policies
cs.RO 2026-04 unverdicted novelty 6.0

Reinforcement learning sensorimotor policies enable quadrotors to traverse narrow gaps at extreme tilts with 5 cm clearance using only vision and proprioception, including reactive traversal of moving gaps.
MiniLLM: On-Policy Distillation of Large Language Models
cs.CL 2023-06 conditional novelty 6.0

MiniLLM distills large language models into smaller ones via reverse KL divergence and on-policy optimization, yielding higher-quality responses with lower exposure bias than standard KD baselines.
Progressive Neural Networks
cs.LG 2016-06 unverdicted novelty 6.0

Progressive neural networks learn sequences of RL tasks without catastrophic forgetting by freezing prior columns and adding lateral connections for knowledge transfer.
Robot Squid Game: Quadrupedal Locomotion for Traversing Narrow Tunnels
cs.RO 2026-05 unverdicted novelty 5.0

A teacher-student RL policy distillation approach combined with procedural tunnel generation enables quadruped robots to traverse narrow tunnels consistently in both simulation and real-world tests.
VISD: Enhancing Video Reasoning via Structured Self-Distillation
cs.CV 2026-05 unverdicted novelty 5.0

VISD improves VideoLLM reasoning by adding multi-dimensional diagnostic self-distillation and RL decoupling, yielding higher accuracy, better grounding, and nearly 2x faster training convergence.
VISD: Enhancing Video Reasoning via Structured Self-Distillation
cs.CV 2026-05 unverdicted novelty 5.0

VISD adds structured privileged feedback from a judge model and a direction-magnitude decoupling trick to let VideoLLMs learn token-level credit assignment while keeping RL stable, yielding higher accuracy and roughly...
LANTERN: LLM-Augmented Neurosymbolic Transfer with Experience-Gated Reasoning Networks
cs.AI 2026-05 unverdicted novelty 5.0

LANTERN improves RL sample efficiency by 40-60% via LLM-generated task automata, semantic multi-source policy aggregation, and experience-gated adaptive transfer.
Combining Trained Models in Reinforcement Learning
cs.LG 2026-05 accept novelty 5.0

A review of 15 studies finds positive transfer in DRL mainly when source and target tasks share structure or include alignment mechanisms, but compute-matched comparisons against from-scratch baselines remain rare.
Detecting is Easy, Adapting is Hard: Local Expert Growth for Visual Model-Based Reinforcement Learning under Distribution Shift
cs.LG 2026-04 unverdicted novelty 5.0

JEPA-Indexed Local Expert Growth adds local action corrections for detected shift clusters and yields statistically significant OOD gains on four shift conditions while keeping in-distribution performance intact.
ReFineVLA: Multimodal Reasoning-Aware Generalist Robotic Policies via Teacher-Guided Fine-Tuning
cs.RO 2026-04 unverdicted novelty 5.0

ReFineVLA adds teacher-generated reasoning steps to VLA training and reports state-of-the-art success rates on SimplerEnv WidowX and Google Robot benchmarks.
Digital Guardians: The Past and The Future of Cyber-Physical Resilience
cs.CR 2026-04 unverdicted novelty 3.0

A survey frames CPS resilience through five themes and illustrates them in connected transportation and medical systems to provide a roadmap for real-world resilience.