MoE Routing Testbed: Studying Expert Specialization and Routing Behavior at Small Scale

Chandana Satya Prakash; Chankrisna Richy Meas; Gamaleldin F. Elsayed; Krishna Kompella; M Saiful Bari; Nicolas Anastassacos; Nitesh Sekhar; Samson Tan; Tobias Falke

arxiv: 2604.07030 · v1 · submitted 2026-04-08 · 💻 cs.LG

MoE Routing Testbed: Studying Expert Specialization and Routing Behavior at Small Scale

Tobias Falke , Nicolas Anastassacos , Samson Tan , Chankrisna Richy Meas , Chandana Satya Prakash , Nitesh Sekhar , M Saiful Bari , Krishna Kompella

show 1 more author

Gamaleldin F. Elsayed

This is my paper

Pith reviewed 2026-05-10 18:55 UTC · model grok-4.3

classification 💻 cs.LG

keywords mixture of expertsexpert routingspecializationrouting testbedlarge language modelsexpert utilizationscaling behavior

0 comments

The pith

A testbed for MoE routing shows that balancing assignment scope enables both expert specialization and high utilization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops the MoE Routing Testbed to examine how routing decisions affect expert specialization in small mixture-of-experts models trained on data with distinct domains. By introducing a reference router that assigns tokens ideally based on domain, the testbed creates a clear upper bound against which real routing methods can be measured for specialization and utilization. Experiments with multiple routing techniques demonstrate that the balance in the scope of expert assignments determines whether experts specialize effectively without some remaining underused. This balancing principle continues to hold when the same ideas are applied to models thirty-five times larger.

Core claim

The MoE Routing Testbed pairs a data mixture containing clearly distinguishable domains with a reference router that dictates perfect domain-based routing. This reference serves as an ideal benchmark for measuring how well actual routers achieve expert specialization. Testing various routing approaches reveals that balancing the scope of routing decisions is the key element allowing specialization while preserving high utilization across experts. The same balancing effect scales successfully to models that are 35 times larger.

What carries the argument

The MoE Routing Testbed, consisting of domain-distinguishable data and a domain-prescribing reference router that provides an upper bound for specialization metrics.

Load-bearing premise

The domain-based reference router must accurately reflect the best possible specialization that can be achieved during actual end-to-end training.

What would settle it

Training a 35x larger MoE model with a routing method that does not balance scope and finding that experts fail to specialize or that utilization remains low would contradict the generalization claim.

Figures

Figures reproduced from arXiv: 2604.07030 by Chandana Satya Prakash, Chankrisna Richy Meas, Gamaleldin F. Elsayed, Krishna Kompella, M Saiful Bari, Nicolas Anastassacos, Nitesh Sekhar, Samson Tan, Tobias Falke.

**Figure 1.** Figure 1: Testbed design: Domains (colorcoded) in the data mix are used to define a reference routing for domain-specific tokens (but not generic tokens). For learned routers, the domain purity of tokens sent to a specific expert can then capture the degree of specialization, along with its utilization. 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 Expert Utilization Expert Specialization Scope 1 sequence 2 sequences 8 se… view at source ↗

**Figure 3.** Figure 3: (Left) Validation loss under reference routing at varying token splits into [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Utilization-specialization trade-offs (left) and validation losses (right) captured by [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Distribution of validation loss over the utilization-specialization landscape under [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Normalized validation losses after 2T tokens for MoE with 0.8B active (9.6B total) [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: We compare our small-scale models trained on the routing testbed data mix [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Validation losses on Dolma3 validation sets after 2T tokens for 0.8B active MoE. [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Alternative testbed configurations with 32 experts trained on only 7 domains or 40 [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: As expert granularity increases, we observe a shift towards lower utilization for [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Per-layer utilization-specialization trade-offs for 8-layer transformers with al [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: Load balancing loss aggregation over local scope versus global scope. Global, in [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

read the original abstract

Sparse Mixture-of-Experts (MoE) architectures are increasingly popular for frontier large language models (LLM) but they introduce training challenges due to routing complexity. Fully leveraging parameters of an MoE model requires all experts to be well-trained and to specialize in non-redundant ways. Assessing this, however, is complicated due to lack of established metrics and, importantly, many routing techniques exhibit similar performance at smaller sizes, which is often not reflective of their behavior at large scale. To address this challenge, we propose the MoE Routing Testbed, a setup that gives clearer visibility into routing dynamics at small scale while using realistic data. The testbed pairs a data mix with clearly distinguishable domains with a reference router that prescribes ideal routing based on these domains, providing a well-defined upper bound for comparison. This enables quantifiable measurement of expert specialization. To demonstrate the value of the testbed, we compare various MoE routing approaches and show that balancing scope is the crucial factor that allows specialization while maintaining high expert utilization. We confirm that this observation generalizes to models 35x larger.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The testbed gives a workable small-scale way to quantify MoE specialization via domain data and a reference router, but the claim that balancing scope generalizes to 35x larger models sits on unshown details and a shaky transfer assumption.

read the letter

The main thing here is a new testbed that pairs domain-distinguishable data with a reference router to measure how close real routers get to ideal specialization. That setup lets them run controlled comparisons at small scale where most methods otherwise look alike. They report that balancing scope drives the tradeoff between specialization and expert utilization, and they checked the same pattern on models 35 times bigger.

Referee Report

2 major / 2 minor

Summary. The paper introduces the MoE Routing Testbed, a small-scale experimental setup that pairs a synthetic data mixture containing clearly separable domains with a domain-based reference router serving as an idealized upper bound on routing performance. Using this testbed, the authors compare multiple MoE routing algorithms and conclude that the scope of load balancing is the decisive factor enabling expert specialization while preserving high utilization rates. They further report that this balancing-scope observation generalizes when the same principles are applied to models 35 times larger.

Significance. If the testbed's empirical findings on balancing scope transfer reliably beyond the synthetic setting, the work could supply a practical diagnostic tool for developing routing strategies in production-scale MoE language models. The explicit reference-router baseline is a methodological strength that allows quantifiable measurement of specialization, and the emphasis on small-scale visibility before large-scale divergence is a sensible research direction.

major comments (2)

[Abstract and §5] Abstract and §5 (scaling experiments): the claim that the balancing-scope observation 'generalizes to models 35x larger' is presented without concrete metrics, error bars, exact model sizes, utilization rates, or specialization scores, so the central empirical result rests on unshown data.
[§3] §3 (testbed definition): the domain-based reference router is positioned as a realistic upper bound, yet the manuscript supplies no evidence that this idealized router is achievable under realistic token distributions or that small-scale domain separability predicts frontier-scale routing dynamics; this assumption is load-bearing for the generalization claim.

minor comments (2)

[§2] The definition of 'balancing scope' would benefit from an explicit equation or pseudocode in §2 to make cross-method comparisons unambiguous.
[Figures] Figure captions and axis labels in the experimental plots could be expanded to include the precise balancing-scope values and reference-router performance for immediate readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential of the MoE Routing Testbed as a diagnostic tool. We address the two major comments point by point below, indicating the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and §5] Abstract and §5 (scaling experiments): the claim that the balancing-scope observation 'generalizes to models 35x larger' is presented without concrete metrics, error bars, exact model sizes, utilization rates, or specialization scores, so the central empirical result rests on unshown data.

Authors: We agree that the generalization claim requires more detailed empirical support. In the revised version we will expand §5 to report the exact base and scaled model sizes, per-expert utilization rates, specialization scores (with the same metrics used in the small-scale experiments), and error bars computed across multiple random seeds. These additions will make the scaling results fully reproducible and allow readers to assess the strength of the transfer. revision: yes
Referee: [§3] §3 (testbed definition): the domain-based reference router is positioned as a realistic upper bound, yet the manuscript supplies no evidence that this idealized router is achievable under realistic token distributions or that small-scale domain separability predicts frontier-scale routing dynamics; this assumption is load-bearing for the generalization claim.

Authors: The reference router is explicitly constructed as an idealized upper bound that exploits the artificial domain separability built into the testbed; we do not claim it is achievable or optimal under arbitrary real-world token distributions. Its role is to provide a quantifiable benchmark for measuring how closely learned routers approach perfect specialization within this controlled environment. We acknowledge that the manuscript provides only preliminary evidence that the balancing-scope finding transfers beyond the synthetic setting. In revision we will (i) clarify the idealized nature of the bound in §3 and (ii) add an explicit limitations paragraph discussing the assumptions required for extrapolation to frontier-scale, non-synthetic data. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical comparisons against externally defined domain-based reference router

full rationale

The paper introduces an empirical testbed pairing a domain-distinguishable data mix with a reference router that assigns tokens based on those domains. All reported results consist of direct measurements of expert utilization, specialization metrics, and performance against this externally specified reference, without equations, parameter fitting that is then relabeled as prediction, or derivations. The claim that balancing scope enables specialization generalizes to 35x larger models is presented as the outcome of additional scaling experiments rather than a reduction to the small-scale inputs by construction. No self-citation chains, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The setup is therefore self-contained against its stated external benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard machine-learning assumptions about expert specialization being desirable and measurable; no new free parameters, axioms, or invented entities are introduced beyond the testbed itself.

axioms (1)

domain assumption A router that assigns inputs strictly by domain knowledge represents an achievable upper bound on expert specialization.
Invoked when the reference router is used to define the target for measurable specialization.

pith-pipeline@v0.9.0 · 5531 in / 1157 out tokens · 46003 ms · 2026-05-10T18:55:18.497969+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages · 2 internal anchors

[1]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

URLhttps://arxiv.org/abs/2507.06261. Damai Dai, Chengqi Deng, Chenggang Zhao, R.x. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu, Zhenda Xie, Y.k. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wenfeng Liang. DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models. In Lun-Wei Ku, Andre M...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.acl-long 2024
[2]

URLhttps://aclanthology.org/2024.acl-long.70/. DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2023.emnlp-main.583 2024
[3]

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen

URLhttps://arxiv.org/abs/2402.07871. Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. {GS}hard: Scaling giant models with conditional computation and automatic sharding. InInternational Conference on Learning Representations, 2021. URLhttps://openreview.net/forum?id=qrwe7X...

work page doi:10.18653/v1/2025.acl-long.249 2021
[4]

These ”dead experts” waste parameters and degrade model quality

Training stability:Without balancing, learned routers are prone to collapse: a small subset of experts attracts most tokens, starving other experts of gradient signal. These ”dead experts” waste parameters and degrade model quality. Auxiliary balancing losses prevent this, ensuring all experts see sufficient signal for stable training

work page
[5]

From a throughput perspective, perfect balance at every micro-batch is strictly optimal

Throughput:Expert parallelism is most efficient when each expert processes equal tokens. From a throughput perspective, perfect balance at every micro-batch is strictly optimal. Both observations pushed the field toward defaulting towards aggressive balancing but mixing them has obscured an important fact: preventing dead experts requires only that expert...

work page

[1] [1]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

URLhttps://arxiv.org/abs/2507.06261. Damai Dai, Chengqi Deng, Chenggang Zhao, R.x. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu, Zhenda Xie, Y.k. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wenfeng Liang. DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models. In Lun-Wei Ku, Andre M...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.acl-long 2024

[2] [2]

URLhttps://aclanthology.org/2024.acl-long.70/. DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2023.emnlp-main.583 2024

[3] [3]

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen

URLhttps://arxiv.org/abs/2402.07871. Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. {GS}hard: Scaling giant models with conditional computation and automatic sharding. InInternational Conference on Learning Representations, 2021. URLhttps://openreview.net/forum?id=qrwe7X...

work page doi:10.18653/v1/2025.acl-long.249 2021

[4] [4]

These ”dead experts” waste parameters and degrade model quality

Training stability:Without balancing, learned routers are prone to collapse: a small subset of experts attracts most tokens, starving other experts of gradient signal. These ”dead experts” waste parameters and degrade model quality. Auxiliary balancing losses prevent this, ensuring all experts see sufficient signal for stable training

work page

[5] [5]

From a throughput perspective, perfect balance at every micro-batch is strictly optimal

Throughput:Expert parallelism is most efficient when each expert processes equal tokens. From a throughput perspective, perfect balance at every micro-batch is strictly optimal. Both observations pushed the field toward defaulting towards aggressive balancing but mixing them has obscured an important fact: preventing dead experts requires only that expert...

work page