SpecMoE uses self-assisted speculative decoding on MoE models to deliver up to 4.3x higher inference throughput and lower memory and interconnect bandwidth use without retraining.
Efficient Memory Management for Large Language Model Serving with PagedAttention
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.AI 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
SpecMoE: A Fast and Efficient Mixture-of-Experts Inference via Self-Assisted Speculative Decoding
SpecMoE uses self-assisted speculative decoding on MoE models to deliver up to 4.3x higher inference throughput and lower memory and interconnect bandwidth use without retraining.