Analytical FFN-to-MoE Restructuring via Activation Pattern Analysis

Bei Yu; Hui-Ling Zhen; Lancheng Zou; Mingxuan Yuan; Sinno Jialin Pan; Wulong Liu; Xianzhi Yu; Zehua Pei

arxiv: 2502.04416 · v3 · submitted 2025-02-06 · 💻 cs.LG · cs.AI

Analytical FFN-to-MoE Restructuring via Activation Pattern Analysis

Zehua Pei , Hui-Ling Zhen , Lancheng Zou , Xianzhi Yu , Wulong Liu , Sinno Jialin Pan , Mingxuan Yuan , Bei Yu This is my paper

classification 💻 cs.LG cs.AI

keywords modelsactivationanalyticalarchitecturesdenseexistingexpertsffns

0 comments

read the original abstract

Scaling large language models (LLMs) improves performance but significantly increases inference costs, with feed-forward networks (FFNs) consuming the majority of computational resources. While Mixture-of-Experts (MoE) architectures can reduce this cost through sparse activation, restructuring existing dense models into MoEs typically requires extensive retraining on hundreds of billions of tokens. We propose an analytical post-training framework that rapidly restructures FFNs into sparse MoE architectures using only a small calibration dataset. The method analyzes neuron activation patterns to partition neurons into always-active shared experts and conditionally activated routed experts, then constructs a router analytically from representative neuron statistics, enabling immediate deployment or optional lightweight fine-tuning. This approach applies both to dense models and recursively to existing MoE models for hierarchical sparsity. Experiments demonstrate up to $1.17\times$ speedup in compute-bound scenarios with only minutes of processing and 2k-sample fine-tuning, outperforming methods requiring orders of magnitude more resources.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

GEM: GPU-Variability-Aware Expert to GPU Mapping for MoE Systems
cs.DC 2026-05 unverdicted novelty 6.0

GEM is a GPU-variability-aware expert-to-GPU mapping framework for MoE inference that classifies experts as consistent or temporal and places them to equalize finish times across heterogeneous GPUs.