hub Canonical reference

Not all experts are equal: Efficient expert pruning and skipping for mixture-of-experts large language models

Lu, X · 2024 · arXiv 2402.14800

Canonical reference. 100% of citing Pith papers cite this work as background.

15 Pith papers citing it

Background 100% of classified citations

read on arXiv browse 15 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5

citation-polarity summary

background 5

representative citing papers

Preserving Long-Tailed Expert Information in Mixture-of-Experts Tuning

cs.LG · 2026-04-24 · unverdicted · novelty 7.0

A new SFT framework for MoE models combines bias-driven sparsification with gated condenser experts to retain long-tailed expert information, outperforming DenseMixer and ESFT by over 2.5% on math reasoning and commonsense QA benchmarks.

Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts

cs.LG · 2026-04-21 · unverdicted · novelty 7.0 · 2 refs

Expert upcycling duplicates experts in an existing MoE checkpoint and continues pre-training to match fixed-size baseline performance with 32% less compute.

EvoESAP: Non-Uniform Expert Pruning for Sparse MoE

cs.LG · 2026-03-06 · conditional · novelty 7.0

EvoESAP uses evolutionary search guided by a speculative-decoding-inspired ESAP metric to discover non-uniform layer-wise sparsity allocations for MoE expert pruning, improving generation accuracy up to 19.6% at 50% sparsity.

GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs

cs.LG · 2026-05-21 · unverdicted · novelty 6.0

GEMQ applies global LP-based expert importance estimation and router fine-tuning within progressive quantization to cut memory and speed inference in MoE LLMs with little accuracy loss.

Temporally Extended Mixture-of-Experts Models

cs.LG · 2026-04-22 · unverdicted · novelty 6.0

Temporally extended MoE layers using the option-critic framework with deliberation costs cut switching rates below 5% while retaining most capability on MATH, MMLU, and MMMLU.

MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference

cs.LG · 2026-04-19 · unverdicted · novelty 6.0 · 2 refs

MACS improves inference speed in multimodal MoE models by entropy-weighted balancing of visual tokens and real-time modality-adaptive expert capacity allocation.

FluxMoE: Decoupling Expert Residency for High-Performance MoE Serving

cs.LG · 2026-04-03 · unverdicted · novelty 6.0

FluxMoE decouples MoE expert weights from persistent GPU residency via on-demand paging, achieving up to 3x throughput gains over vLLM in memory-constrained inference without accuracy loss.

Analytical FFN-to-MoE Restructuring via Activation Pattern Analysis

cs.LG · 2025-02-06 · unverdicted · novelty 6.0

An analytical post-training method restructures FFNs into MoE by partitioning neurons based on activation patterns and building a router from statistics, achieving 1.17x speedup with minimal resources.

Condense, Don't Just Prune: Enhancing Efficiency and Performance in MoE Layer Pruning

cs.LG · 2024-11-26 · unverdicted · novelty 6.0

CD-MoE condenses fine-grained MoE layers with shared experts into dense layers, retaining 90% accuracy with 27.5% memory cut and 1.26x speedup on DeepSeekMoE-16B, recovering 98% via brief fine-tuning.

Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert Selection

cs.LG · 2024-11-13 · unverdicted · novelty 6.0

Lynx exploits training-induced batch-level expert activation skews via AffinityBinning to reduce invoked experts per batch, delivering up to 1.30x throughput with under 1% accuracy loss across four model families.

When Does Sparse MoE Help in Vision? The Role of Backbone Compute Leverage in Sparse Routing

cs.CV · 2026-05-15 · unverdicted · novelty 5.0

Sparse MoE vision models show positive accuracy gaps only when routing a substantial compute fraction ρ and using k≥2 experts at large scale; batch-axis dispatch is identified as a key failure mode.

SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

cs.LG · 2026-05-09 · unverdicted · novelty 5.0 · 2 refs

Pruning pretrained MoE models outperforms training from scratch under fixed budget, different expert compression methods converge after continued training, and progressive pruning plus multi-token KD improves the final 23A2B model.

Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models

cs.CL · 2026-01-20 · unverdicted · novelty 5.0

The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.

LayerScope: Predictive Cross-Layer Scheduling for Efficient Multi-Batch MoE Inference on Legacy Servers

cs.LG · 2025-09-28 · unverdicted · novelty 4.0

PreScope combines a layer-aware activation predictor, cross-layer prefetch scheduling, and asynchronous I/O to deliver 141% higher throughput and 74.6% lower latency for MoE inference on legacy hardware.

A Survey on Efficient Inference for Large Language Models

cs.CL · 2024-04-22 · accept · novelty 3.0

The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.

citing papers explorer

Showing 15 of 15 citing papers.

Preserving Long-Tailed Expert Information in Mixture-of-Experts Tuning cs.LG · 2026-04-24 · unverdicted · none · ref 19
A new SFT framework for MoE models combines bias-driven sparsification with gated condenser experts to retain long-tailed expert information, outperforming DenseMixer and ESFT by over 2.5% on math reasoning and commonsense QA benchmarks.
Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts cs.LG · 2026-04-21 · unverdicted · none · ref 33 · 2 links
Expert upcycling duplicates experts in an existing MoE checkpoint and continues pre-training to match fixed-size baseline performance with 32% less compute.
EvoESAP: Non-Uniform Expert Pruning for Sparse MoE cs.LG · 2026-03-06 · conditional · none · ref 42
EvoESAP uses evolutionary search guided by a speculative-decoding-inspired ESAP metric to discover non-uniform layer-wise sparsity allocations for MoE expert pruning, improving generation accuracy up to 19.6% at 50% sparsity.
GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs cs.LG · 2026-05-21 · unverdicted · none · ref 19
GEMQ applies global LP-based expert importance estimation and router fine-tuning within progressive quantization to cut memory and speed inference in MoE LLMs with little accuracy loss.
Temporally Extended Mixture-of-Experts Models cs.LG · 2026-04-22 · unverdicted · none · ref 27
Temporally extended MoE layers using the option-critic framework with deliberation costs cut switching rates below 5% while retaining most capability on MATH, MMLU, and MMMLU.
MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference cs.LG · 2026-04-19 · unverdicted · none · ref 3 · 2 links
MACS improves inference speed in multimodal MoE models by entropy-weighted balancing of visual tokens and real-time modality-adaptive expert capacity allocation.
FluxMoE: Decoupling Expert Residency for High-Performance MoE Serving cs.LG · 2026-04-03 · unverdicted · none · ref 37
FluxMoE decouples MoE expert weights from persistent GPU residency via on-demand paging, achieving up to 3x throughput gains over vLLM in memory-constrained inference without accuracy loss.
Analytical FFN-to-MoE Restructuring via Activation Pattern Analysis cs.LG · 2025-02-06 · unverdicted · none · ref 24
An analytical post-training method restructures FFNs into MoE by partitioning neurons based on activation patterns and building a router from statistics, achieving 1.17x speedup with minimal resources.
Condense, Don't Just Prune: Enhancing Efficiency and Performance in MoE Layer Pruning cs.LG · 2024-11-26 · unverdicted · none · ref 22
CD-MoE condenses fine-grained MoE layers with shared experts into dense layers, retaining 90% accuracy with 27.5% memory cut and 1.26x speedup on DeepSeekMoE-16B, recovering 98% via brief fine-tuning.
Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert Selection cs.LG · 2024-11-13 · unverdicted · none · ref 15
Lynx exploits training-induced batch-level expert activation skews via AffinityBinning to reduce invoked experts per batch, delivering up to 1.30x throughput with under 1% accuracy loss across four model families.
When Does Sparse MoE Help in Vision? The Role of Backbone Compute Leverage in Sparse Routing cs.CV · 2026-05-15 · unverdicted · none · ref 58
Sparse MoE vision models show positive accuracy gaps only when routing a substantial compute fraction ρ and using k≥2 experts at large scale; batch-axis dispatch is identified as a key failure mode.
SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training cs.LG · 2026-05-09 · unverdicted · none · ref 46 · 2 links
Pruning pretrained MoE models outperforms training from scratch under fixed budget, different expert compression methods converge after continued training, and progressive pruning plus multi-token KD improves the final 23A2B model.
Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models cs.CL · 2026-01-20 · unverdicted · none · ref 199
The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.
LayerScope: Predictive Cross-Layer Scheduling for Efficient Multi-Batch MoE Inference on Legacy Servers cs.LG · 2025-09-28 · unverdicted · none · ref 33
PreScope combines a layer-aware activation predictor, cross-layer prefetch scheduling, and asynchronous I/O to deliver 141% higher throughput and 74.6% lower latency for MoE inference on legacy hardware.
A Survey on Efficient Inference for Large Language Models cs.CL · 2024-04-22 · accept · none · ref 183
The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.

Not all experts are equal: Efficient expert pruning and skipping for mixture-of-experts large language models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer