Experts Weights Averaging: A New General Training Scheme for Vision Transformers

Peng Ye; Sheng Li; Tao Chen; Tong He; Wanli Ouyang; Xiaoshui Huang; Yongqi Huang

arxiv: 2308.06093 · v2 · pith:4FSKBH7Fnew · submitted 2023-08-11 · 💻 cs.CV · cs.LG

Experts Weights Averaging: A New General Training Scheme for Vision Transformers

Yongqi Huang , Peng Ye , Xiaoshui Huang , Sheng Li , Tao Chen , Tong He , Wanli Ouyang This is my paper

classification 💻 cs.CV cs.LG

keywords trainingschemeexpertsvitsinferencevisualaveragingcost

0 comments

read the original abstract

Structural re-parameterization is a general training scheme for Convolutional Neural Networks (CNNs), which achieves performance improvement without increasing inference cost. As Vision Transformers (ViTs) are gradually surpassing CNNs in various visual tasks, one may question: if a training scheme specifically for ViTs exists that can also achieve performance improvement without increasing inference cost? Recently, Mixture-of-Experts (MoE) has attracted increasing attention, as it can efficiently scale up the capacity of Transformers at a fixed cost through sparsely activated experts. Considering that MoE can also be viewed as a multi-branch structure, can we utilize MoE to implement a ViT training scheme similar to structural re-parameterization? In this paper, we affirmatively answer these questions, with a new general training strategy for ViTs. Specifically, we decouple the training and inference phases of ViTs. During training, we replace some Feed-Forward Networks (FFNs) of the ViT with specially designed, more efficient MoEs that assign tokens to experts by random uniform partition, and perform Experts Weights Averaging (EWA) on these MoEs at the end of each iteration. After training, we convert each MoE into an FFN by averaging the experts, transforming the model back into original ViT for inference. We further provide a theoretical analysis to show why and how it works. Comprehensive experiments across various 2D and 3D visual tasks, ViT architectures, and datasets validate the effectiveness and generalizability of the proposed training scheme. Besides, our training scheme can also be applied to improve performance when fine-tuning ViTs. Lastly, but equally important, the proposed EWA technique can significantly improve the effectiveness of naive MoE in various 2D visual small datasets and 3D visual tasks.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AGoQ: Activation and Gradient Quantization for Memory-Efficient Distributed Training of LLMs
cs.CL 2026-05 unverdicted novelty 6.0

AGoQ delivers up to 52% lower memory use and 1.34x faster training for 8B-32B LLaMA models by using near-4-bit adaptive activations and 8-bit gradients while preserving pretraining convergence and downstream accuracy.
AGoQ: Activation and Gradient Quantization for Memory-Efficient Distributed Training of LLMs
cs.CL 2026-05 unverdicted novelty 5.0

AGoQ cuts LLM training memory by up to 52% and speeds it up by 1.34x using tailored 4-bit activations and 8-bit gradients with special communication, matching baseline accuracy on LLaMA models.