Utility-driven spec- ulative decoding for mixture-of-experts

Anish Saxena, Po-An Tsai, Hritvik Taneja, Aamer Jaleel, Moinuddin Qureshi · 2025 · arXiv 2506.20675

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

citation-role summary

dataset 1

citation-polarity summary

use dataset 1

representative citing papers

Making Every Verified Token Count: Adaptive Verification for MoE Speculative Decoding

cs.CL · 2026-05-01 · unverdicted · novelty 6.0

EVICT adaptively truncates draft trees in MoE speculative decoding by combining drafter signals with profiled costs to retain only cost-effective prefixes, delivering up to 2.35x speedup over autoregressive decoding.

Rethinking Network Topologies for Cost-Effective Mixture-of-Experts LLM Serving

cs.NI · 2026-04-30 · unverdicted · novelty 6.0

Switchless topologies such as 3D full-mesh are 20.6-56.2% more cost-effective than scale-up networks for MoE LLM serving, with current link bandwidths over-provisioned by up to 27%.

ELMoE-3D: Leveraging Intrinsic Elasticity of MoE for Hybrid-Bonding-Enabled Self-Speculative Decoding in On-Premises Serving

cs.LG · 2026-04-16 · unverdicted · novelty 6.0

ELMoE-3D achieves 6.6x average speedup and 4.4x energy efficiency gain for MoE serving on 3D hardware by scaling expert and bit elasticity for elastic self-speculative decoding.

citing papers explorer

Showing 3 of 3 citing papers.

Making Every Verified Token Count: Adaptive Verification for MoE Speculative Decoding cs.CL · 2026-05-01 · unverdicted · none · ref 27
EVICT adaptively truncates draft trees in MoE speculative decoding by combining drafter signals with profiled costs to retain only cost-effective prefixes, delivering up to 2.35x speedup over autoregressive decoding.
Rethinking Network Topologies for Cost-Effective Mixture-of-Experts LLM Serving cs.NI · 2026-04-30 · unverdicted · none · ref 50
Switchless topologies such as 3D full-mesh are 20.6-56.2% more cost-effective than scale-up networks for MoE LLM serving, with current link bandwidths over-provisioned by up to 27%.
ELMoE-3D: Leveraging Intrinsic Elasticity of MoE for Hybrid-Bonding-Enabled Self-Speculative Decoding in On-Premises Serving cs.LG · 2026-04-16 · unverdicted · none · ref 58
ELMoE-3D achieves 6.6x average speedup and 4.4x energy efficiency gain for MoE serving on 3D hardware by scaling expert and bit elasticity for elastic self-speculative decoding.

Utility-driven spec- ulative decoding for mixture-of-experts

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer