Moefication: Transformer feed-forward layers are mixtures of experts

Zhengyan Zhang, Yankai Lin, Zhiyuan Liu, Peng Li, Maosong Sun, Jie Zhou · 2021 · arXiv 2110.01786

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

Mixture-of-Experts Can Surpass Dense LLMs Under Strictly Equal Resource

cs.CL · 2025-06-13 · conditional · novelty 6.0

MoE models with activation rates in an optimal region outperform dense LLMs of identical total parameter count, training compute, and data budget, with the optimal region consistent across scales.

Analytical FFN-to-MoE Restructuring via Activation Pattern Analysis

cs.LG · 2025-02-06 · unverdicted · novelty 6.0

An analytical post-training method restructures FFNs into MoE by partitioning neurons based on activation patterns and building a router from statistics, achieving 1.17x speedup with minimal resources.

citing papers explorer

Showing 2 of 2 citing papers.

Mixture-of-Experts Can Surpass Dense LLMs Under Strictly Equal Resource cs.CL · 2025-06-13 · conditional · none · ref 47
MoE models with activation rates in an optimal region outperform dense LLMs of identical total parameter count, training compute, and data budget, with the optimal region consistent across scales.
Analytical FFN-to-MoE Restructuring via Activation Pattern Analysis cs.LG · 2025-02-06 · unverdicted · none · ref 10
An analytical post-training method restructures FFNs into MoE by partitioning neurons based on activation patterns and building a router from statistics, achieving 1.17x speedup with minimal resources.

Moefication: Transformer feed-forward layers are mixtures of experts

fields

years

verdicts

representative citing papers

citing papers explorer