pith. machine review for the scientific record. sign in

arxiv: 1511.06295 · v2 · submitted 2015-11-19 · 💻 cs.LG

Recognition: unknown

Policy Distillation

Authors on Pith no claims yet
classification 💻 cs.LG
keywords policyagentcalleddeepdistillationlearningmethodpolicies
0
0 comments X
read the original abstract

Policies for complex visual tasks have been successfully learned with deep reinforcement learning, using an approach called deep Q-networks (DQN), but relatively large (task-specific) networks and extensive training are needed to achieve good performance. In this work, we present a novel method called policy distillation that can be used to extract the policy of a reinforcement learning agent and train a new network that performs at the expert level while being dramatically smaller and more efficient. Furthermore, the same method can be used to consolidate multiple task-specific policies into a single policy. We demonstrate these claims using the Atari domain and show that the multi-task distilled agent outperforms the single-task teachers as well as a jointly-trained DQN agent.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. VISD: Enhancing Video Reasoning via Structured Self-Distillation

    cs.CV 2026-05 unverdicted novelty 7.0

    VISD improves VideoLLM reasoning performance and training efficiency by combining structured multi-dimensional self-distillation feedback with RL via direction-magnitude decoupling, curriculum scheduling, and EMA stab...

  2. SafeAdapt: Provably Safe Policy Updates in Deep Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    SafeAdapt certifies a Rashomon set of safe policies from demonstration data and projects updates from arbitrary RL algorithms onto it to guarantee preservation of safety on source tasks.

  3. Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.

  4. Policy-DRIFT: Dynamic Reward-Informed Flow Trajectory Steering

    physics.flu-dyn 2026-05 unverdicted novelty 6.0

    Policy-DRIFT combines conditional flow matching with terminal reward guidance and decoupled DRL to achieve 49% drag reduction in Re_tau=180 channel flow, 16% above DRL benchmarks and with 37 times less actuation energy.

  5. Demystifying Deep Reinforcement Learning: A Neuro-Symbolic Framework for Interpretable Open RAN Automation

    cs.NI 2026-05 unverdicted novelty 6.0

    DeRAN converts black-box DRL policies into interpretable symbolic representations for O-RAN automation, retaining 78-87% of original performance while adding built-in transparency.

  6. Demystifying Deep Reinforcement Learning: A Neuro-Symbolic Framework for Interpretable Open RAN Automation

    cs.NI 2026-05 unverdicted novelty 6.0

    DeRAN converts opaque DRL policies for O-RAN tasks into interpretable symbolic policies via concept abstraction, deep symbolic regression, and neurally guided logic, retaining 78-87% of DRL performance on a live 5G testbed.

  7. Precise Aggressive Aerial Maneuvers with Sensorimotor Policies

    cs.RO 2026-04 unverdicted novelty 6.0

    Reinforcement learning sensorimotor policies enable quadrotors to traverse narrow gaps at extreme tilts with 5 cm clearance using only vision and proprioception, including reactive traversal of moving gaps.

  8. MiniLLM: On-Policy Distillation of Large Language Models

    cs.CL 2023-06 conditional novelty 6.0

    MiniLLM distills large language models into smaller ones via reverse KL divergence and on-policy optimization, yielding higher-quality responses with lower exposure bias than standard KD baselines.

  9. Progressive Neural Networks

    cs.LG 2016-06 unverdicted novelty 6.0

    Progressive neural networks learn sequences of RL tasks without catastrophic forgetting by freezing prior columns and adding lateral connections for knowledge transfer.

  10. Robot Squid Game: Quadrupedal Locomotion for Traversing Narrow Tunnels

    cs.RO 2026-05 unverdicted novelty 5.0

    A teacher-student RL policy distillation approach combined with procedural tunnel generation enables quadruped robots to traverse narrow tunnels consistently in both simulation and real-world tests.

  11. VISD: Enhancing Video Reasoning via Structured Self-Distillation

    cs.CV 2026-05 unverdicted novelty 5.0

    VISD improves VideoLLM reasoning by adding multi-dimensional diagnostic self-distillation and RL decoupling, yielding higher accuracy, better grounding, and nearly 2x faster training convergence.

  12. VISD: Enhancing Video Reasoning via Structured Self-Distillation

    cs.CV 2026-05 unverdicted novelty 5.0

    VISD adds structured privileged feedback from a judge model and a direction-magnitude decoupling trick to let VideoLLMs learn token-level credit assignment while keeping RL stable, yielding higher accuracy and roughly...

  13. LANTERN: LLM-Augmented Neurosymbolic Transfer with Experience-Gated Reasoning Networks

    cs.AI 2026-05 unverdicted novelty 5.0

    LANTERN improves RL sample efficiency by 40-60% via LLM-generated task automata, semantic multi-source policy aggregation, and experience-gated adaptive transfer.

  14. Combining Trained Models in Reinforcement Learning

    cs.LG 2026-05 accept novelty 5.0

    A review of 15 studies finds positive transfer in DRL mainly when source and target tasks share structure or include alignment mechanisms, but compute-matched comparisons against from-scratch baselines remain rare.

  15. Detecting is Easy, Adapting is Hard: Local Expert Growth for Visual Model-Based Reinforcement Learning under Distribution Shift

    cs.LG 2026-04 unverdicted novelty 5.0

    JEPA-Indexed Local Expert Growth adds local action corrections for detected shift clusters and yields statistically significant OOD gains on four shift conditions while keeping in-distribution performance intact.

  16. ReFineVLA: Multimodal Reasoning-Aware Generalist Robotic Policies via Teacher-Guided Fine-Tuning

    cs.RO 2026-04 unverdicted novelty 5.0

    ReFineVLA adds teacher-generated reasoning steps to VLA training and reports state-of-the-art success rates on SimplerEnv WidowX and Google Robot benchmarks.

  17. Digital Guardians: The Past and The Future of Cyber-Physical Resilience

    cs.CR 2026-04 unverdicted novelty 3.0

    A survey frames CPS resilience through five themes and illustrates them in connected transportation and medical systems to provide a roadmap for real-world resilience.