Co-Evolving Policy Distillation

· 2026 · cs.LG · arXiv 2604.27083

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

RLVR and OPD have become standard paradigms for post-training. We provide a unified analysis of these two paradigms in consolidating multiple expert capabilities into a single model, identifying capability loss in different ways: mixed RLVR suffers from inter-capability divergence cost, while the pipeline of first training experts and then performing OPD, though avoiding divergence, fails to fully absorb teacher capabilities due to large behavioral pattern gaps between teacher and student. We propose Co-Evolving Policy Distillation (CoPD), which encourages parallel training of experts and introduces OPD during each expert's ongoing RLVR training rather than after complete expert training, with experts serving as mutual teachers (making OPD bidirectional) to co-evolve. This enables more consistent behavioral patterns among experts while maintaining sufficient complementary knowledge throughout. Experiments validate that CoPD achieves all-in-one integration of text, image, and video reasoning capabilities, significantly outperforming strong baselines such as mixed RLVR and MOPD, and even surpassing domain-specific experts. The model parallel training pattern offered by CoPD may inspire a novel training scaling paradigm.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Learning to Solve, Forgetting to Retain: Correct-Set Turnover in RLVR

cs.LG · 2026-06-02 · unverdicted · novelty 6.0

RLVR exhibits correct-set turnover where solved problems regress during training, and a periodic review mechanism exploiting a repair-window principle improves retention and performance over baselines.

D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

cs.CV · 2026-05-06 · unverdicted · novelty 6.0 · 3 refs

D-OPSD formulates supervised fine-tuning of step-distilled diffusion models as on-policy self-distillation by having the model act as both teacher (with multimodal context) and student (with text-only context) on its own roll-outs.

citing papers explorer

Showing 2 of 2 citing papers after filters.

Learning to Solve, Forgetting to Retain: Correct-Set Turnover in RLVR cs.LG · 2026-06-02 · unverdicted · none · ref 103 · internal anchor
RLVR exhibits correct-set turnover where solved problems regress during training, and a periodic review mechanism exploiting a repair-window principle improves retention and performance over baselines.
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models cs.CV · 2026-05-06 · unverdicted · none · ref 26 · 3 links · internal anchor
D-OPSD formulates supervised fine-tuning of step-distilled diffusion models as on-policy self-distillation by having the model act as both teacher (with multimodal context) and student (with text-only context) on its own roll-outs.

Co-Evolving Policy Distillation

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer