Hierarchical Mixture-of-Experts with Two-Stage Optimization

· 2026 · cs.LG · arXiv 2605.08292

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Sparse Mixture-of-Experts (MoE) models scale capacity by routing each token to a small subset of experts. However, their routers exhibit a fundamental trade-off: strong load balancing can suppress expert specialization, while aggressive diversity often causes routing collapse. We propose Hi-MoE, a grouped MoE framework that decomposes routing control into two coupled levels: (i) inter-group balancing that enforces fair traffic across expert groups, and (ii) intra-group specialization that promotes complementary expert behaviors while preventing within-group collapse. Our analysis provides a principled explanation of how our hierarchical objectives reshape the router, thereby promoting stable specialization and mitigating collapse. We observe consistent improvements over recent sparse-routing and grouped-MoE baselines across NLP and vision benchmarks, and confirm robustness via scaling studies (model size, expert count) and targeted ablations. In large-scale pre-training on 58B tokens, Hi-MoE-7B achieves a 5.6% perplexity reduction and a 40% improvement in expert balance over OLMoE-7B across diverse evaluation domains.

representative citing papers

Scalable Knowledge Editing for Mixture-of-Experts LLMs via Tensor-Structured Updates

cs.LG · 2026-05-15 · unverdicted · novelty 6.0

A MEMIT-style knowledge editing framework for MoE LLMs that formulates per-expert updates via tensor structure and applies Woodbury identity for low-rank inversions, achieving up to 6x speedup with comparable editing quality.

citing papers explorer

Showing 1 of 1 citing paper.

Scalable Knowledge Editing for Mixture-of-Experts LLMs via Tensor-Structured Updates cs.LG · 2026-05-15 · unverdicted · none · ref 11 · internal anchor
A MEMIT-style knowledge editing framework for MoE LLMs that formulates per-expert updates via tensor structure and applies Woodbury identity for low-rank inversions, achieving up to 6x speedup with comparable editing quality.

Hierarchical Mixture-of-Experts with Two-Stage Optimization

fields

years

verdicts

representative citing papers

citing papers explorer