Fisher-MoE prunes sparse intermediate dimensions in MoE FFNs ranked by Fisher importance, delivering 50% compression that preserves capability while cutting memory ~45% and raising throughput 21%.
hub
org/CorpusID:235755472
13 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
EvoESAP uses evolutionary search guided by a speculative-decoding-inspired ESAP metric to discover non-uniform layer-wise sparsity allocations for MoE expert pruning, improving generation accuracy up to 19.6% at 50% sparsity.
DuoServe-MoE decouples prefill and decode phases in MoE LLM inference with a two-stream CUDA pipeline for prefill and an offline-trained predictor for decode, reporting up to 5.34x TTFT and 7.55x end-to-end latency gains.
SHAPE applies coalition-aware Shapley values to prune experts in MoE LLMs, retaining competitive accuracy at 20-40% pruning rates on Qwen3-30B-A3B, GPT-OSS-20B, and DeepSeek-V2-Lite without retraining.
Task-aware expert grouping derived from family-specific co-activation traces cuts average communication cost 31.39% versus task-agnostic baselines in multi-task MoE inference while maintaining Jain fairness near 1.0.
dMoE aggregates token expert distributions to block level in dLLMs, cutting unique experts from 69.5 to 14.6, memory by 76-80%, and latency by 1.14-1.66x while retaining 99.11% performance.
A systematic MoE-to-dense conversion via expert scoring, grouping, and distillation yields +6.3 pp average accuracy over dense-to-dense pruning at matched parameter count on tested models.
REAM merges experts in MoE LLMs rather than pruning them, often matching uncompressed performance by tuning the mix of calibration data.
FluxMoE decouples MoE expert weights from persistent GPU residency via on-demand paging, achieving up to 3x throughput gains over vLLM in memory-constrained inference without accuracy loss.
MaskPro learns categorical distributions over groups of M weights to generate exact (N:M) sparsity via N-way sampling without replacement and stabilizes training with a moving average tracker of loss residuals.
Lynx exploits training-induced batch-level expert activation skews via AffinityBinning to reduce invoked experts per batch, delivering up to 1.30x throughput with under 1% accuracy loss across four model families.
CAEE reduces MoE inference latency 8-18% on 671B DeepSeek-R1 by cost-aware expert pruning and low-overhead compensation while keeping accuracy drop under 1%.
GRAPE is a global redundancy-aware pruning strategy for sparse MoEs that dynamically allocates pruning budgets across layers and improves average accuracy by 1.40% over the best local baseline across tested models and settings.
citing papers explorer
-
EvoESAP: Non-Uniform Expert Pruning for Sparse MoE
EvoESAP uses evolutionary search guided by a speculative-decoding-inspired ESAP metric to discover non-uniform layer-wise sparsity allocations for MoE expert pruning, improving generation accuracy up to 19.6% at 50% sparsity.