On-policy self-distillation with teacher flip rate yields better safety-reasoning tradeoffs than off-policy or external-teacher baselines across model scales.
Title resolution pending
4 Pith papers cite this work. Polarity classification is still indexing.
4
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
years
2026 4roles
background 1polarities
background 1representative citing papers
COPSD improves mathematical reasoning in low-resource languages by having LLMs self-distill from their own high-resource English behavior via token-level divergence on rollouts with privileged crosslingual context.
citing papers explorer
-
Reducing the Safety Tax in LLM Safety Alignment with On-Policy Self-Distillation
On-policy self-distillation with teacher flip rate yields better safety-reasoning tradeoffs than off-policy or external-teacher baselines across model scales.
-
Crosslingual On-Policy Self-Distillation for Multilingual Reasoning
COPSD improves mathematical reasoning in low-resource languages by having LLMs self-distill from their own high-resource English behavior via token-level divergence on rollouts with privileged crosslingual context.
- Multi-Rollout On-Policy Distillation via Peer Successes and Failures
- OGLS-SD: On-Policy Self-Distillation with Outcome-Guided Logit Steering for LLM Reasoning