pith. sign in

arxiv: 2410.06270 · v2 · pith:PYSQMLKLnew · submitted 2024-10-08 · 💻 cs.LG · cs.CL

Mixture Compressor for Mixture-of-Experts LLMs Gains More

classification 💻 cs.LG cs.CL
keywords tokensactivatedexpertsmoe-llmsdynamicexpertonlyperformance
0
0 comments X
read the original abstract

Mixture-of-Experts large language models (MoE-LLMs) marks a significant step forward of language models, however, they encounter two critical challenges in practice: 1) expert parameters lead to considerable memory consumption and loading latency; and 2) the current activated experts are redundant, as many tokens may only require a single expert. Motivated by these issues, we investigate the MoE-LLMs and make two key observations: a) different experts exhibit varying behaviors on activation reconstruction error, routing scores, and activated frequencies, highlighting their differing importance, and b) not all tokens are equally important -- only a small subset is critical. Building on these insights, we propose MC, a training-free Mixture-Compressor for MoE-LLMs, which leverages the significance of both experts and tokens to achieve an extreme compression. First, to mitigate storage and loading overheads, we introduce Pre-Loading Mixed-Precision Quantization, which formulates the adaptive bit-width allocation as a Linear Programming problem, where the objective function balances multi-factors reflecting the importance of each expert. Additionally, we develop Online Dynamic Pruning, which identifies important tokens to retain and dynamically select activated experts for other tokens during inference to optimize efficiency while maintaining performance. Our MC integrates static quantization and dynamic pruning to collaboratively achieve extreme compression for MoE-LLMs with less accuracy loss, ensuring an optimal trade-off between performance and efficiency. Extensive experiments confirm the effectiveness of our approach. For instance, at 2.54 bits, MC compresses 76.6% of the model, with only a 3.8% average accuracy loss. During dynamic inference, we further reduce activated parameters by 15%, with a performance drop of less than 0.6%.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 11 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    LongLive-2.0 delivers an NVFP4 parallel infrastructure that enables direct training of long multi-shot autoregressive diffusion video models and achieves up to 2.15x training and 1.84x inference speedups on Blackwell ...

  2. Geometric Asymmetry in MoE Specialization: Functional Decorrelation and Representational Overlap

    cs.LG 2026-05 unverdicted novelty 7.0

    MoE experts in pretrained Transformers exhibit functional decorrelation with near-zero Jacobian alignment yet occupy partially overlapping representation subspaces, with routing sparsity modulating the geometry.

  3. Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

    cs.AI 2026-04 unverdicted novelty 7.0

    Proposes a levels x laws taxonomy for world models in AI agents, defining L1-L3 capabilities across physical, digital, social, and scientific regimes while reviewing over 400 works to outline a roadmap for advanced ag...

  4. EvoESAP: Non-Uniform Expert Pruning for Sparse MoE

    cs.LG 2026-03 conditional novelty 7.0

    EvoESAP uses evolutionary search guided by a speculative-decoding-inspired ESAP metric to discover non-uniform layer-wise sparsity allocations for MoE expert pruning, improving generation accuracy up to 19.6% at 50% sparsity.

  5. Attribution-Guided and Coverage-Maximized Pruning for Structural MoE Compression

    cs.LG 2026-06 unverdicted novelty 6.0

    A structural pruning framework for MoE models that solves channel-score coverage maximization via attribution approximation, preserving accuracy at 50% or 25% pruning plus 4-bit quantization on DeepSeek and Qwen models.

  6. GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs

    cs.LG 2026-05 unverdicted novelty 6.0

    GEMQ applies global LP-based expert importance estimation and router fine-tuning within progressive quantization to cut memory and speed inference in MoE LLMs with little accuracy loss.

  7. MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference

    cs.LG 2026-04 unverdicted novelty 6.0

    MACS improves inference speed in multimodal MoE models by entropy-weighted balancing of visual tokens and real-time modality-adaptive expert capacity allocation.

  8. MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference

    cs.LG 2026-04 unverdicted novelty 6.0

    MACS improves MoE MLLM inference efficiency via entropy-weighted token loads and dynamic modality-adaptive expert capacity allocation.

  9. FluxMoE: Decoupling Expert Residency for High-Performance MoE Serving

    cs.LG 2026-04 unverdicted novelty 6.0

    FluxMoE decouples MoE expert weights from persistent GPU residency via on-demand paging, achieving up to 3x throughput gains over vLLM in memory-constrained inference without accuracy loss.

  10. Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

    cs.AI 2026-04 conditional novelty 4.0

    A survey proposing a three-level capability taxonomy (L1 Predictor, L2 Simulator, L3 Evolver) for world models across physical, digital, social, and scientific domains.

  11. OmniPlan: An Adaptive Framework for Timely and Near-Optimal Network Planning Optimization

    cs.NI 2026-06 unverdicted novelty 3.0

    OmniPlan combines LLM intent parsing, mixture-of-experts selection among MIP/heuristic/DRL solvers, and DRL weight tuning to deliver timely near-optimal network planning, with reported latency reductions up to 97.8% o...