ZEDA injects zero-output experts and uses two-stage self-distillation to adapt post-trained MoE models into dynamic ones that skip over half the experts, yielding 1.2x inference speedup with small accuracy drops.
Sequence-level knowledge distillation
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.LG 3years
2026 3verdicts
UNVERDICTED 3representative citing papers
Emergent and subliminal misalignment in LLMs arise from data structure interactions and transfer via benign distillation data, with stronger effects under shared functional structure and on-policy settings.
On-policy distillation works when student and teacher models share thinking patterns and the teacher adds new capabilities, with success tied to alignment on a small set of high-probability tokens.
citing papers explorer
-
Post-Trained MoE Can Skip Half Experts via Self-Distillation
ZEDA injects zero-output experts and uses two-stage self-distillation to adapt post-trained MoE models into dynamic ones that skip over half the experts, yielding 1.2x inference speedup with small accuracy drops.
-
Emergent and Subliminal Misalignment Through the Lens of Data-Mediated Transfer
Emergent and subliminal misalignment in LLMs arise from data structure interactions and transfer via benign distillation data, with stronger effects under shared functional structure and on-policy settings.
-
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe
On-policy distillation works when student and teacher models share thinking patterns and the teacher adds new capabilities, with success tied to alignment on a small set of high-probability tokens.