Analytical FFN-to-MoE Restructuring via Activation Pattern Analysis
read the original abstract
Scaling large language models (LLMs) improves performance but significantly increases inference costs, with feed-forward networks (FFNs) consuming the majority of computational resources. While Mixture-of-Experts (MoE) architectures can reduce this cost through sparse activation, restructuring existing dense models into MoEs typically requires extensive retraining on hundreds of billions of tokens. We propose an analytical post-training framework that rapidly restructures FFNs into sparse MoE architectures using only a small calibration dataset. The method analyzes neuron activation patterns to partition neurons into always-active shared experts and conditionally activated routed experts, then constructs a router analytically from representative neuron statistics, enabling immediate deployment or optional lightweight fine-tuning. This approach applies both to dense models and recursively to existing MoE models for hierarchical sparsity. Experiments demonstrate up to $1.17\times$ speedup in compute-bound scenarios with only minutes of processing and 2k-sample fine-tuning, outperforming methods requiring orders of magnitude more resources.
This paper has not been read by Pith yet.
Forward citations
Cited by 1 Pith paper
-
GEM: GPU-Variability-Aware Expert to GPU Mapping for MoE Systems
GEM is a GPU-variability-aware expert-to-GPU mapping framework for MoE inference that classifies experts as consistent or temporal and places them to equalize finish times across heterogeneous GPUs.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.