Self-Distilled Policy Gradient

Quanquan Gu; Shiyuan Zhang; Yifan Zhang; Yifeng Liu

arxiv: 2606.04036 · v1 · pith:CRXPDDUKnew · submitted 2026-06-02 · 💻 cs.LG

Self-Distilled Policy Gradient

Yifeng Liu , Shiyuan Zhang , Yifan Zhang , Quanquan Gu This is my paper

classification 💻 cs.LG

keywords sdpgself-distillationfull-vocabularyon-policyself-distilledactuallyadvantagesauxiliary

0 comments

read the original abstract

On-policy self-distillation, where a language model conditions on privileged context to supervise its own generations, is a promising source of dense supervision for sparse-reward reinforcement learning. Actually, it can be instantiated as an auxiliary full-vocabulary student-to-teacher reverse Kullback-Leibler divergence loss. We therefore propose SDPG, a self-distilled policy-gradient framework that combines group-relative verifier advantages with normalized standard deviation, exact full-vocabulary on-policy self-distillation, as well as reference-policy KL regularization. Empirically, SDPG improves stability and performance over RLVR and self-distillation baselines. The code is available at https://github.com/lauyikfung/SDPG.

This paper has not been read by Pith yet.

Self-Distilled Policy Gradient

discussion (0)