hub

org/CorpusID:235755472

Task-specific expert pruning for sparse mixture-of-experts , author= · 2022 · arXiv 2206.00277

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

read on arXiv browse 13 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Less is MoE: Trimming Experts in Domain-Specialist Language Models

cs.LG · 2026-06-04 · unverdicted · novelty 7.0

Fisher-MoE prunes sparse intermediate dimensions in MoE FFNs ranked by Fisher importance, delivering 50% compression that preserves capability while cutting memory ~45% and raising throughput 21%.

EvoESAP: Non-Uniform Expert Pruning for Sparse MoE

cs.LG · 2026-03-06 · conditional · novelty 7.0

EvoESAP uses evolutionary search guided by a speculative-decoding-inspired ESAP metric to discover non-uniform layer-wise sparsity allocations for MoE expert pruning, improving generation accuracy up to 19.6% at 50% sparsity.

DuoServe-MoE: Dual-Phase Expert Prefetch and Caching for LLM Inference QoS Assurance

cs.DC · 2025-09-09 · unverdicted · novelty 7.0

DuoServe-MoE decouples prefill and decode phases in MoE LLM inference with a two-stream CUDA pipeline for prefill and an offline-trained predictor for decode, reporting up to 5.34x TTFT and 7.55x end-to-end latency gains.

SHAPE: Coalition-Aware Expert Pruning for Sparse Mixture-of-Experts LLMs

cs.LG · 2026-06-03 · unverdicted · novelty 6.0

SHAPE applies coalition-aware Shapley values to prune experts in MoE LLMs, retaining competitive accuracy at 20-40% pruning rates on Qwen3-30B-A3B, GPT-OSS-20B, and DeepSeek-V2-Lite without retraining.

Beyond Task-Agnostic: Task-Aware Grouping for Communication-Efficient Multi-Task MoE Inference

cs.LG · 2026-05-31 · unverdicted · novelty 6.0

Task-aware expert grouping derived from family-specific co-activation traces cuts average communication cost 31.39% versus task-agnostic baselines in multi-task MoE inference while maintaining Jain fairness near 1.0.

dMoE: dLLMs with Learnable Block Experts

cs.CL · 2026-05-29 · unverdicted · novelty 6.0

dMoE aggregates token expert distributions to block level in dLLMs, cutting unique experts from 69.5 to 14.6, memory by 76-80%, and latency by 1.14-1.66x while retaining 99.11% performance.

Pruning and Distilling Mixture-of-Experts into Dense Language Models

cs.CL · 2026-05-27 · unverdicted · novelty 6.0

A systematic MoE-to-dense conversion via expert scoring, grouping, and distillation yields +6.3 pp average accuracy over dense-to-dense pruning at matched parameter count on tested models.

REAM: Merging Improves Pruning of Experts in LLMs

cs.AI · 2026-04-06 · unverdicted · novelty 6.0

REAM merges experts in MoE LLMs rather than pruning them, often matching uncompressed performance by tuning the mix of calibration data.

FluxMoE: Decoupling Expert Residency for High-Performance MoE Serving

cs.LG · 2026-04-03 · unverdicted · novelty 6.0

FluxMoE decouples MoE expert weights from persistent GPU residency via on-demand paging, achieving up to 3x throughput gains over vLLM in memory-constrained inference without accuracy loss.

MaskPro: Linear-Space Probabilistic Learning for Strict (N:M)-Sparsity on LLMs

cs.LG · 2025-06-15 · unverdicted · novelty 6.0

MaskPro learns categorical distributions over groups of M weights to generate exact (N:M) sparsity via N-way sampling without replacement and stabilizes training with a moving average tracker of loss residuals.

Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert Selection

cs.LG · 2024-11-13 · unverdicted · novelty 6.0

Lynx exploits training-induced batch-level expert activation skews via AffinityBinning to reduce invoked experts per batch, delivering up to 1.30x throughput with under 1% accuracy loss across four model families.

Beyond Uniform Experts: Cost-Aware Expert Execution for Efficient Multi-Device MoE Inference

cs.DC · 2026-06-29 · unverdicted · novelty 5.0

CAEE reduces MoE inference latency 8-18% on 671B DeepSeek-R1 by cost-aware expert pruning and low-overhead compensation while keeping accuracy drop under 1%.

Does a Global Perspective Help Prune Sparse MoEs Elegantly?

cs.CL · 2026-04-08 · unverdicted · novelty 5.0

GRAPE is a global redundancy-aware pruning strategy for sparse MoEs that dynamically allocates pruning budgets across layers and improves average accuracy by 1.40% over the best local baseline across tested models and settings.

citing papers explorer

Showing 1 of 1 citing paper after filters.

EvoESAP: Non-Uniform Expert Pruning for Sparse MoE cs.LG · 2026-03-06 · conditional · none · ref 5
EvoESAP uses evolutionary search guided by a speculative-decoding-inspired ESAP metric to discover non-uniform layer-wise sparsity allocations for MoE expert pruning, improving generation accuracy up to 19.6% at 50% sparsity.

org/CorpusID:235755472

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer