hub Canonical reference

Policy distillation

Andrei A Rusu, Sergio Gomez Colmenarejo, Caglar Gulcehre, Guillaume Desjardins, James Kirkpatrick, Razvan Pascanu, V olodymyr Mnih, Koray Kavukcuoglu, Raia Hadsell · 2015 · cs.LG · arXiv 1511.06295

Canonical reference. 80% of citing Pith papers cite this work as background.

18 Pith papers citing it

Background 80% of classified citations

open full Pith review browse 18 citing papers arXiv PDF

abstract

Policies for complex visual tasks have been successfully learned with deep reinforcement learning, using an approach called deep Q-networks (DQN), but relatively large (task-specific) networks and extensive training are needed to achieve good performance. In this work, we present a novel method called policy distillation that can be used to extract the policy of a reinforcement learning agent and train a new network that performs at the expert level while being dramatically smaller and more efficient. Furthermore, the same method can be used to consolidate multiple task-specific policies into a single policy. We demonstrate these claims using the Atari domain and show that the multi-task distilled agent outperforms the single-task teachers as well as a jointly-trained DQN agent.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 method 1

citation-polarity summary

background 4 use method 1

representative citing papers

What Does Deep Hedging Actually Learn? Delta Corrections, Regime Fragility, and Symbolic Distillation

q-fin.RM · 2026-05-20 · unverdicted · novelty 7.0

Deep hedging agents learn a systematic delta haircut explained by spot-implied-volatility co-movement; symbolic regression distills the policies into formulas that retain reward and downside-variance advantages over Black-Scholes but inherit regime fragility.

SafeAdapt: Provably Safe Policy Updates in Deep Reinforcement Learning

cs.LG · 2026-04-10 · unverdicted · novelty 7.0

SafeAdapt certifies a Rashomon set of safe policies from demonstration data and projects updates from arbitrary RL algorithms onto it to guarantee preservation of safety on source tasks.

Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models

cs.CL · 2026-04-09 · unverdicted · novelty 7.0

OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.

Policy-DRIFT: Dynamic Reward-Informed Flow Trajectory Steering

physics.flu-dyn · 2026-05-13 · unverdicted · novelty 6.0

Policy-DRIFT combines conditional flow matching with terminal reward guidance and decoupled DRL to achieve 49% drag reduction in Re_tau=180 channel flow, 16% above DRL benchmarks and with 37 times less actuation energy.

Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning

cs.LG · 2024-10-10 · unverdicted · novelty 6.0

Process advantage verifiers trained to predict step-level progress under a distinct prover policy improve LLM reasoning accuracy by over 8% and sample efficiency by 5-6x over outcome reward models.

Continual Domain Randomization

cs.RO · 2024-03-18 · unverdicted · novelty 6.0

Continual Domain Randomization trains RL policies sequentially on randomization parameter subsets with continual learning to achieve robust sim-to-real transfer in robotic reaching and grasping.

Attentive Multi-Task Deep Reinforcement Learning

cs.LG · 2019-07-05 · unverdicted · novelty 6.0

Attention mechanism dynamically groups task knowledge at state granularity in multi-task DRL to enable positive transfer and avoid negative transfer, matching or exceeding prior methods with fewer parameters.

Demystifying Deep Reinforcement Learning: A Neuro-Symbolic Framework for Interpretable Open RAN Automation

cs.NI · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

DeRAN converts black-box DRL policies into interpretable symbolic representations for O-RAN automation, retaining 78-87% of original performance while adding built-in transparency.

Precise Aggressive Aerial Maneuvers with Sensorimotor Policies

cs.RO · 2026-04-07 · unverdicted · novelty 6.0

Reinforcement learning sensorimotor policies enable quadrotors to traverse narrow gaps at extreme tilts with 5 cm clearance using only vision and proprioception, including reactive traversal of moving gaps.

MiniLLM: On-Policy Distillation of Large Language Models

cs.CL · 2023-06-14 · conditional · novelty 6.0

MiniLLM distills large language models into smaller ones via reverse KL divergence and on-policy optimization, yielding higher-quality responses with lower exposure bias than standard KD baselines.

Progressive Neural Networks

cs.LG · 2016-06-15 · unverdicted · novelty 6.0

Progressive neural networks learn sequences of RL tasks without catastrophic forgetting by freezing prior columns and adding lateral connections for knowledge transfer.

Robot Squid Game: Quadrupedal Locomotion for Traversing Narrow Tunnels

cs.RO · 2026-05-13 · unverdicted · novelty 5.0

A teacher-student RL policy distillation approach combined with procedural tunnel generation enables quadruped robots to traverse narrow tunnels consistently in both simulation and real-world tests.

VISD: Enhancing Video Reasoning via Structured Self-Distillation

cs.CV · 2026-05-07 · unverdicted · novelty 5.0 · 4 refs

VISD proposes structured self-distillation with a multi-dimensional judge model and direction-magnitude decoupling to improve token-level credit assignment and convergence speed in VideoLLM reasoning training.

LANTERN: LLM-Augmented Neurosymbolic Transfer with Experience-Gated Reasoning Networks

cs.AI · 2026-05-06 · unverdicted · novelty 5.0

LANTERN improves RL sample efficiency by 40-60% via LLM-generated task automata, semantic multi-source policy aggregation, and experience-gated adaptive transfer.

Combining Trained Models in Reinforcement Learning

cs.LG · 2026-05-04 · accept · novelty 5.0

A review of 15 studies finds positive transfer in DRL mainly when source and target tasks share structure or include alignment mechanisms, but compute-matched comparisons against from-scratch baselines remain rare.

Detecting is Easy, Adapting is Hard: Local Expert Growth for Visual Model-Based Reinforcement Learning under Distribution Shift

cs.LG · 2026-04-30 · unverdicted · novelty 5.0

JEPA-Indexed Local Expert Growth adds local action corrections for detected shift clusters and yields statistically significant OOD gains on four shift conditions while keeping in-distribution performance intact.

ReFineVLA: Multimodal Reasoning-Aware Generalist Robotic Policies via Teacher-Guided Fine-Tuning

cs.RO · 2026-04-20 · unverdicted · novelty 5.0

ReFineVLA adds teacher-generated reasoning steps to VLA training and reports state-of-the-art success rates on SimplerEnv WidowX and Google Robot benchmarks.

Digital Guardians: The Past and The Future of Cyber-Physical Resilience

cs.CR · 2026-04-15 · unverdicted · novelty 3.0

A survey frames CPS resilience through five themes and illustrates them in connected transportation and medical systems to provide a roadmap for real-world resilience.

citing papers explorer

Showing 18 of 18 citing papers.

What Does Deep Hedging Actually Learn? Delta Corrections, Regime Fragility, and Symbolic Distillation q-fin.RM · 2026-05-20 · unverdicted · none · ref 9 · internal anchor
Deep hedging agents learn a systematic delta haircut explained by spot-implied-volatility co-movement; symbolic regression distills the policies into formulas that retain reward and downside-variance advantages over Black-Scholes but inherit regime fragility.
SafeAdapt: Provably Safe Policy Updates in Deep Reinforcement Learning cs.LG · 2026-04-10 · unverdicted · none · ref 30
SafeAdapt certifies a Rashomon set of safe policies from demonstration data and projects updates from arbitrary RL algorithms onto it to guarantee preservation of safety on source tasks.
Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models cs.CL · 2026-04-09 · unverdicted · none · ref 16
OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.
Policy-DRIFT: Dynamic Reward-Informed Flow Trajectory Steering physics.flu-dyn · 2026-05-13 · unverdicted · none · ref 7 · internal anchor
Policy-DRIFT combines conditional flow matching with terminal reward guidance and decoupled DRL to achieve 49% drag reduction in Re_tau=180 channel flow, 16% above DRL benchmarks and with 37 times less actuation energy.
Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning cs.LG · 2024-10-10 · unverdicted · none · ref 17 · internal anchor
Process advantage verifiers trained to predict step-level progress under a distinct prover policy improve LLM reasoning accuracy by over 8% and sample efficiency by 5-6x over outcome reward models.
Continual Domain Randomization cs.RO · 2024-03-18 · unverdicted · none · ref 28 · internal anchor
Continual Domain Randomization trains RL policies sequentially on randomization parameter subsets with continual learning to achieve robust sim-to-real transfer in robotic reaching and grasping.
Attentive Multi-Task Deep Reinforcement Learning cs.LG · 2019-07-05 · unverdicted · none · ref 25 · internal anchor
Attention mechanism dynamically groups task knowledge at state granularity in multi-task DRL to enable positive transfer and avoid negative transfer, matching or exceeding prior methods with fewer parameters.
Demystifying Deep Reinforcement Learning: A Neuro-Symbolic Framework for Interpretable Open RAN Automation cs.NI · 2026-05-11 · unverdicted · none · ref 20 · 2 links
DeRAN converts black-box DRL policies into interpretable symbolic representations for O-RAN automation, retaining 78-87% of original performance while adding built-in transparency.
Precise Aggressive Aerial Maneuvers with Sensorimotor Policies cs.RO · 2026-04-07 · unverdicted · none · ref 28
Reinforcement learning sensorimotor policies enable quadrotors to traverse narrow gaps at extreme tilts with 5 cm clearance using only vision and proprioception, including reactive traversal of moving gaps.
MiniLLM: On-Policy Distillation of Large Language Models cs.CL · 2023-06-14 · conditional · none · ref 16
MiniLLM distills large language models into smaller ones via reverse KL divergence and on-policy optimization, yielding higher-quality responses with lower exposure bias than standard KD baselines.
Progressive Neural Networks cs.LG · 2016-06-15 · unverdicted · none · ref 17
Progressive neural networks learn sequences of RL tasks without catastrophic forgetting by freezing prior columns and adding lateral connections for knowledge transfer.
Robot Squid Game: Quadrupedal Locomotion for Traversing Narrow Tunnels cs.RO · 2026-05-13 · unverdicted · none · ref 14 · internal anchor
A teacher-student RL policy distillation approach combined with procedural tunnel generation enables quadruped robots to traverse narrow tunnels consistently in both simulation and real-world tests.
VISD: Enhancing Video Reasoning via Structured Self-Distillation cs.CV · 2026-05-07 · unverdicted · none · ref 33 · 4 links · internal anchor
VISD proposes structured self-distillation with a multi-dimensional judge model and direction-magnitude decoupling to improve token-level credit assignment and convergence speed in VideoLLM reasoning training.
LANTERN: LLM-Augmented Neurosymbolic Transfer with Experience-Gated Reasoning Networks cs.AI · 2026-05-06 · unverdicted · none · ref 18
LANTERN improves RL sample efficiency by 40-60% via LLM-generated task automata, semantic multi-source policy aggregation, and experience-gated adaptive transfer.
Combining Trained Models in Reinforcement Learning cs.LG · 2026-05-04 · accept · none · ref 3
A review of 15 studies finds positive transfer in DRL mainly when source and target tasks share structure or include alignment mechanisms, but compute-matched comparisons against from-scratch baselines remain rare.
Detecting is Easy, Adapting is Hard: Local Expert Growth for Visual Model-Based Reinforcement Learning under Distribution Shift cs.LG · 2026-04-30 · unverdicted · none · ref 19
JEPA-Indexed Local Expert Growth adds local action corrections for detected shift clusters and yields statistically significant OOD gains on four shift conditions while keeping in-distribution performance intact.
ReFineVLA: Multimodal Reasoning-Aware Generalist Robotic Policies via Teacher-Guided Fine-Tuning cs.RO · 2026-04-20 · unverdicted · none · ref 30
ReFineVLA adds teacher-generated reasoning steps to VLA training and reports state-of-the-art success rates on SimplerEnv WidowX and Google Robot benchmarks.
Digital Guardians: The Past and The Future of Cyber-Physical Resilience cs.CR · 2026-04-15 · unverdicted · none · ref 196
A survey frames CPS resilience through five themes and illustrates them in connected transportation and medical systems to provide a roadmap for real-world resilience.

Policy distillation

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer