DAG-MoE: From Simple Mixture to Structural Aggregation in Mixture-of-Experts

Benyu Zhang; Dongqi Fu; Hanqing Zeng; Jiarui Feng; Jiayi Liu; Karish Grover; Qiang Zhang; Qifan Wang; Ren Chen; Ruizhong Qiu

arxiv: 2606.01062 · v1 · pith:SBA5HQL6new · submitted 2026-05-31 · 💻 cs.AI

DAG-MoE: From Simple Mixture to Structural Aggregation in Mixture-of-Experts

Jiarui Feng , Hanqing Zeng , Karish Grover , Ruizhong Qiu , Yinglong Xia , Qiang Zhang , Qifan Wang , Ren Chen

show 6 more authors

Dongqi Fu Jiayi Liu Zhoukai Zhao Xiangjun Fan Benyu Zhang Yixin Chen

This is my paper

classification 💻 cs.AI

keywords aggregationdag-moeexpertsexpertlanguagemixture-of-expertsmodelsperformance

0 comments

read the original abstract

Mixture-of-Experts (MoE) models have become a leading approach for decoupling parameter count from computational cost in large language models, yet effectively scaling MoE performance remains a challenge. Prior work shows that fine-grained experts enlarge the space of expert combinations and improve flexibility, but they also impose substantial routing overhead, creating a new scalability bottleneck. In this paper, we explore a complementary axis for scaling -- how expert outputs are aggregated. We theoretically show that replacing the standard weighted-summation aggregation with structural aggregation expands the expert-combination space without altering the experts or router, and enables possible multi-step reasoning within a single MoE layer. To this end, we propose DAG-MoE, a sparse MoE framework that employs a lightweight module to automatically learn the optimal aggregation structure among the selected experts. Extensive experiments under standard language modeling settings show that DAG-MoE consistently improves performance in both pretraining and fine-tuning, surpassing traditional MoE baselines.

This paper has not been read by Pith yet.

DAG-MoE: From Simple Mixture to Structural Aggregation in Mixture-of-Experts

discussion (0)