Shared MoE Cost Decomposition The MoE FFN and router costs have the same form in both stages; the only difference is the number of tokens processed in the current forward pass

· 2023

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Post-Trained MoE Can Skip Half Experts via Self-Distillation

cs.LG · 2026-05-18 · unverdicted · novelty 6.0

ZEDA injects zero-output experts and uses two-stage self-distillation to adapt post-trained MoE models into dynamic ones that skip over half the experts, yielding 1.2x inference speedup with small accuracy drops.

citing papers explorer

Showing 1 of 1 citing paper.

Post-Trained MoE Can Skip Half Experts via Self-Distillation cs.LG · 2026-05-18 · unverdicted · none · ref 44
ZEDA injects zero-output experts and uses two-stage self-distillation to adapt post-trained MoE models into dynamic ones that skip over half the experts, yielding 1.2x inference speedup with small accuracy drops.

Shared MoE Cost Decomposition The MoE FFN and router costs have the same form in both stages; the only difference is the number of tokens processed in the current forward pass

fields

years

verdicts

representative citing papers

citing papers explorer