pith. sign in

arxiv: 2606.01039 · v1 · pith:Z2UJLEPKnew · submitted 2026-05-31 · 💻 cs.LG · cs.AI

OPD+: Rethinking the Advantage Design for On-Policy Distillation

classification 💻 cs.LG cs.AI
keywords studentdesignadvantagedistillationdivergencef-divergencegeneralgradient
0
0 comments X
read the original abstract

On-policy distillation (OPD) is a widely used technique to transfer capabilities from capable teacher language models to the base student models, and can be formulated in a reinforcement learning style objective using student generated rollouts. Yet, despite the divergence reward being dependent on student model likelihood, existing works usually adopt a stop gradient design primarily for stability, which makes the resulting advantage estimation questionable. In this work, we provide a generic optimization framework based on f-divergence between the student and teacher, and mathematically revisit whether such design space is valid. We prove that general stop-gradient operation would lead to biased estimates of the reward objective and corresponding gradient for general divergence functions. We propose OPD+, the corrected version of OPD that demonstrates improved performance over the baseline KL approach and also supports the choice of various f-divergence. We validate our findings on mathematical reasoning and tool-use benchmarks.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.