DiffusionOPD: A Unified Perspective of On-Policy Distillation in Diffusion Models

Junqiu Yu; Kaixun Jiang; Pandeng Li; Quanhao Li; Ruihang Chu; Shiwei Zhang; Yujie Wei; Yu Liu; Zhen Xing; Zuxuan Wu

arxiv: 2605.15055 · v1 · pith:6YDBO5B5new · submitted 2026-05-14 · 💻 cs.LG · cs.CV

DiffusionOPD: A Unified Perspective of On-Policy Distillation in Diffusion Models

Quanhao Li , Junqiu Yu , Kaixun Jiang , Yujie Wei , Zhen Xing , Pandeng Li , Ruihang Chu , Shiwei Zhang

show 2 more authors

Yu Liu Zuxuan Wu

This is my paper

classification 💻 cs.LG cs.CV

keywords diffusionopdmodelsoptimizationcascadediffusiondistillationmulti-taskpolicy

0 comments

read the original abstract

Reinforcement learning has emerged as a powerful tool for improving diffusion-based text-to-image models, but existing methods are largely limited to single-task optimization. Extending RL to multiple tasks is challenging: joint optimization suffers from cross-task interference and imbalance, while cascade RL is cumbersome and prone to catastrophic forgetting. We propose DiffusionOPD, a new multi-task training paradigm for diffusion models based on Online Policy Distillation (OPD). DiffusionOPD first trains task-specific teachers independently, then distills their capabilities into a unified student along the student own rollout trajectories. This decouples single-task exploration from multi-task integration and avoids the optimization burden of solving all tasks jointly from scratch. Theoretically, we lift the OPD framework from discrete tokens to continuous-state Markov processes, deriving a closed-form per-step KL objective that unifies both stochastic SDE and deterministic ODE refinement via mean-matching. We formally and empirically demonstrate that this analytic gradient provides lower variance and better generality compared to conventional PPO-style policy gradients. Extensive experiments show that DiffusionOPD consistently surpasses both multi-reward RL and cascade RL baselines in training efficiency and final performance, while achieving state-of-the-art results on all evaluated benchmarks.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DanceOPD: On-Policy Generative Field Distillation
cs.CV 2026-06 unverdicted novelty 5.0

DanceOPD routes samples across capability velocity fields in flow-matching models and trains via on-policy student-induced states to compose T2I, local editing, and global editing without mutual interference.
MaineCoon: Pursuing A Real-Time Audio-Visual Social World Model
cs.CV 2026-06 unverdicted novelty 5.0

MaineCoon is presented as the first 22B-parameter real-time streaming audio-visual autoregressive model optimized for social-interactive applications, using novel training techniques and an agentic inference framework.