Router fine-tuning that biases MoE models toward short-term expert reuse improves cache locality, delivering 26% higher reuse and 1.77-1.99x decode speedup under memory constraints without inference-time overhead.
Title resolution pending
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
ReMoE: Boosting Expert Reuse through Router Fine-Tuning in Memory-Constrained MoE LLM Inference
Router fine-tuning that biases MoE models toward short-term expert reuse improves cache locality, delivering 26% higher reuse and 1.77-1.99x decode speedup under memory constraints without inference-time overhead.