arxiv: 2605.11689 · v1 · submitted 2026-05-12 · 💻 cs.LG · cs.CL

Recognition: 2 theorem links

· Lean Theorem

Slicing and Dicing: Configuring Optimal Mixtures of Experts

Margaret Li , Sneha Kudugunta , Danielle Rothermel , Luke Zettlemoyer

Authors on Pith no claims yet

Pith reviewed 2026-05-13 07:32 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords mixture of expertsmoeexpert granularitymodel scalingpretrainingload balancingsparse activationlanguage model design

0 comments

The pith

Increasing total MoE parameters improves performance at every active-parameter scale studied.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper runs over two thousand pretraining experiments on Mixture-of-Experts models up to 6.6 billion parameters. It shows that quality keeps rising as the total number of expert parameters grows, even when the active-to-total ratio reaches extremes such as 128 to 1. The best size for each expert stays nearly constant once the active parameter count is fixed and does not shift with overall model size. Secondary choices including shared experts, heterogeneous expert sizes, and most load-balancing settings produce only small changes compared with expert count and granularity, though dropless routing gives a steady lift. The results point to a simpler design rule that centers on those two main levers.

Core claim

At every active-parameter scale studied, performance consistently improves with total MoE parameters even at extreme active expert parameter ratios like 128. The optimal expert size is nearly invariant to total parameter count and depends only on active parameter count. Other choices like shared experts, heterogeneous experts and load-balancing settings have small effects relative to expert count and granularity, although dropless routing yields a consistent gain.

What carries the argument

Exhaustive sweeps of total expert count, expert dimension, shared expert size, heterogeneous sizing, and load-balancing mechanisms across thousands of pretraining runs.

Load-bearing premise

That patterns seen in pretraining loss for models up to 6.6 billion parameters will continue at much larger scales and will predict gains on downstream tasks.

What would settle it

A pretraining run at a scale well above 6.6 billion parameters in which raising total MoE parameters at fixed active count stops improving loss, or in which downstream task accuracy diverges from the pretraining trends.

Figures

Figures reproduced from arXiv: 2605.11689 by Danielle Rothermel, Luke Zettlemoyer, Margaret Li, Sneha Kudugunta.

**Figure 1.** Figure 1: A Mixture of Experts Layer. In an MoE Transformer layer, each token first passes through the same self attention mechanism. Then, router(s) activate the highest affinity experts. Finally, the outputs of any activated FFN modules are combined in a weighted sum. A standard homogeneous MoE layer includes n experts of identical granularity g, of which k are activated. We introduce more flexible configurations … view at source ↗

**Figure 2.** Figure 2: Increasing inactive expert parameters via expert size (left) or total count (center) improves performance in MoEs (§3.1). This effect is seen both when holding total number of experts fixed (left) and when holding expert granularity fixed (center). In general, increasing total parameters results in improved performance. Additional model scales in Appendix §C.4. Optimal tradeoff between expert count and gra… view at source ↗

**Figure 4.** Figure 4: Heterogeneity of expert size alone does not improve MoE performance (§3.2). To explore the potential benefits of their architectural flexibility, we compare heterogeneous MoEs (indicated by dotted lines) to active- and total-parameter-matched homogeneous MoEs. Heterogeneity alone does not result in performance gains, as, at each activation sparsity s, heterogeneous MoEs with n1, n2 = a, b lie between or… view at source ↗

**Figure 5.** Figure 5: The inclusion of a generalist consistently degrades performance in homogeneous MoEs (§3.2). We train MoE LMs which consist of some routed experts with granularity g, as well as a generalist with granularity ggen ∈ { 1 2 , 1 4 , 1 8 }. We compare to settings with no generalist, only routed experts with granularity g. In all settings and configurations, the addition of any granularity generalist results in c… view at source ↗

**Figure 6.** Figure 6: Dropless routing outperforms default routing (§3.3). We compare dropless routing to the default setting, which allow tokens to be dropped. Across all scales, we find that dropless routing outperforms or performs comparably to default routing. Additional model scales in Appendix §C.4. Reweighting finegrained FFNs without sparsity underperforms We hypothesize that the MoE architecture provides performance im… view at source ↗

**Figure 7.** Figure 7: Load balancing mechanisms must be tuned correctly (§3.3). We consider load balancing loss weight αLB ∈ {1e−2, 1e−4} and loss-free load balancing with bias γ ∈ {0, 1e−3} (γ = 0 indicates no loss-free mechanism). Results show that poorly chosen hyperparameters, such as high bias γ = 1e − 3 with total experts n ≥ 512, may impair performance. However, all settings other than (αLB = 1e−2, γ = 1e−3) perform comp… view at source ↗

**Figure 8.** Figure 8: Homogeneous MoE configurations represented in §3 [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Train Cross-Entropy Loss (§3.1). We plot the train cross-entropy loss, averaged over the final 50 steps, at each of 5 active parameter model sizes. For each active parameter count (column), we show all four load balancing settings with (αLB ∈ {1e−4, 1e−2}, γ ∈ {0, 1e−3}). The trends seen in cross-entropy loss on train data closely follow those seen in validation data cross-entropy loss ( [PITH_FULL_IMAGE:… view at source ↗

**Figure 10.** Figure 10: Load Balancing Loss (§3.3). We plot the train load balancing loss as defined in §2, averaged over the final 50 steps, at each of 5 active parameter model sizes. For each active parameter count (column), we show all four load balancing settings with (αLB ∈ {1e−4, 1e−2}, γ ∈ {0, 1e−3}). Across all model scales, (αLB = 1e−4, γ = 0) results in higher load balancing loss overall. 24 [PITH_FULL_IMAGE:figures/f… view at source ↗

**Figure 11.** Figure 11: Load Imbalance (§3.3). We define load imbalance as the ratio between the maximum and mean expert loads in a batch. We plot the train load imbalance, averaged over the final 50 steps, at each of 5 active parameter model sizes. For each active parameter count (column), we show all four load balancing settings with (αLB ∈ {1e−4, 1e−2}, γ ∈ {0, 1e−3}). Across all model scales, (αLB = 1e−4, γ = 0) results in h… view at source ↗

**Figure 12.** Figure 12: At 10-20M scale, MoEs underperform dense baselines at all expert (count, granularity) configurations tested. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗

**Figure 13.** Figure 13: At 10-20M scale, MoEs do not improve over dense baselines, regardless of expert heterogeneity. 2 8 32 Sparsity 9.308 9.310 9.312 9.314 9.316 9.318 9.320 9.322 9.324 Expert Granularity (g) = 1/16 2 8 32 Sparsity 9.300 9.305 9.310 9.315 9.320 9.325 Expert Granularity (g) = 1/8 2 8 32 Sparsity 9.295 9.300 9.305 9.310 9.315 9.320 9.325 Expert Granularity (g) = 1/4 2 8 32 Sparsity 9.290 9.295 9.300 9.305 9.310… view at source ↗

**Figure 14.** Figure 14: At 10-20M scale, MoEs do not improve over dense baselines regardless of generalist inclusion. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗

**Figure 15.** Figure 15: At 10-20M scale, MoEs do not improve over dense baselines regardless of load balancing mechanisms. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_15.png] view at source ↗

**Figure 16.** Figure 16: MoEs outperform dense models, even at 10M active (10M - 15M total) parameters, given sufficient compute. At 10M active parameter scale, MoE LMs underperform dense counterparts, if pretrained with Chinchilla-optimal tokens (200M total tokens). However, when pretraining with a 20 times greater data budget (4B total tokens), MoEs outperform dense models, exhibiting trends more similar to those seen in our 5… view at source ↗

**Figure 17.** Figure 17: MoEs outperform dense models, even at 20M active (20M - 48M total) parameters, given sufficient compute. At 20M active parameter scale, MoE LMs underperform dense counterparts, if pretrained with Chinchilla-optimal tokens (400M total tokens). However, when pretraining with a 5 times greater data budget (2B total tokens), MoEs outperform dense models, exhibiting trends more similar to those seen in our 50… view at source ↗

**Figure 18.** Figure 18: Increasing inactive expert parameters via expert size (left) or total count (center) improves performance in MoEs (§3.1). This effect is seen both when holding total number of experts fixed (left) and when holding expert granularity fixed (center). In general, increasing total parameters results in improved performance. Optimal tradeoff between expert count and granularity varies in MoEs (right). (§3.1) A… view at source ↗

**Figure 19.** Figure 19: Heterogeneity of expert size alone does not improve MoE performance (§3.2). To explore the potential benefits of their architectural flexibility, we compare heterogeneous MoEs (indicated by dotted lines) to active- and total-parameter-matched homogeneous MoEs. Heterogeneity alone does not result in performance gains, as, at each activation sparsity s, heterogeneous MoEs with n1, n2 = a, b lie between or n… view at source ↗

**Figure 20.** Figure 20: The inclusion of a generalist consistently degrades performance in homogeneous MoEs (§3.2). We train MoE LMs which consist of some routed experts with granularity g, as well as a generalist with granularity ggen ∈ { 1 2 , 1 4 , 1 8 }. We compare to settings with no generalist, only routed experts with granularity g. In all settings and configurations, the addition of any granularity generalist results in … view at source ↗

**Figure 21.** Figure 21: The inclusion of a generalist consistently degrades performance in heterogeneous MoEs (§3.2). We train heterogeneous MoE LMs which consist of routed experts with granularity g1, g2, as well as a generalist with granularity ggen = 1 2 . We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 … view at source ↗

**Figure 22.** Figure 22: Dropless routing outperforms default routing (§3.3). We compare dropless routing to the default setting, which allow tokens to be dropped. Across all scales, we find that dropless routing outperforms or performs comparably to default routing. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_22.png] view at source ↗

**Figure 23.** Figure 23: Dropless routing, with bias γ = 1e−3 (§3.3). As in [PITH_FULL_IMAGE:figures/full_fig_p035_23.png] view at source ↗

**Figure 24.** Figure 24: Load balancing mechanisms must be tuned correctly (§3.3). We consider load balancing loss weight αLB ∈ {1e−2, 1e−4} and loss-free load balancing with bias γ ∈ {0, 1e−3} (γ = 0 indicates no loss-free mechanism). Results show that poorly chosen hyperparameters, such as high bias γ = 1e − 3 with total experts n ≥ 512, may impair performance. However, all settings other than (αLB = 1e−2, γ = 1e−3) perform com… view at source ↗

**Figure 25.** Figure 25: At sufficiently large compute scales, MoEs performance on Hellaswag accuracy mirrors cross-entropy loss (§3.3). 39 [PITH_FULL_IMAGE:figures/full_fig_p039_25.png] view at source ↗

**Figure 26.** Figure 26: Increasing inactive expert parameters via expert size (left) or total count (center) improves performance in MoEs (§3.1). This effect is seen both when holding total number of experts fixed (left) and when holding expert granularity fixed (center). In general, increasing total parameters results in improved performance. Optimal tradeoff between expert count and granularity varies in MoEs (right). (§3.1) A… view at source ↗

**Figure 27.** Figure 27: Heterogeneity of expert size alone does not improve MoE performance (§3.2). To explore the potential benefits of their architectural flexibility, we compare heterogeneous MoEs (indicated by dotted lines) to active- and total-parameter-matched homogeneous MoEs. Heterogeneity alone does not result in performance gains, as, at each activation sparsity s, heterogeneous MoEs with n1, n2 = a, b lie between or n… view at source ↗

**Figure 28.** Figure 28: The inclusion of a generalist consistently degrades performance in homogeneous MoEs (§3.2). We train MoE LMs which consist of some routed experts with granularity g, as well as a generalist with granularity ggen ∈ { 1 2 , 1 4 , 1 8 }. We compare to settings with no generalist, only routed experts with granularity g. In all settings and configurations, the addition of any granularity generalist results in … view at source ↗

**Figure 29.** Figure 29: The inclusion of a generalist consistently degrades performance in heterogeneous MoEs (§3.2). We train heterogeneous MoE LMs which consist of routed experts with granularity g1, g2, as well as a generalist with granularity ggen = 1 2 . We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 … view at source ↗

**Figure 30.** Figure 30: Dropless routing outperforms default routing (§3.3). We compare dropless routing to the default setting, which allow tokens to be dropped. Across all scales, we find that dropless routing outperforms or performs comparably to default routing. 43 [PITH_FULL_IMAGE:figures/full_fig_p043_30.png] view at source ↗

**Figure 31.** Figure 31: Dropless routing, with bias γ = 1e−3 (§3.3). As in [PITH_FULL_IMAGE:figures/full_fig_p044_31.png] view at source ↗

**Figure 32.** Figure 32: Load balancing mechanisms must be tuned correctly (§3.3). We consider load balancing loss weight αLB ∈ {1e−2, 1e−4} and loss-free load balancing with bias γ ∈ {0, 1e−3} (γ = 0 indicates no loss-free mechanism). Results show that poorly chosen hyperparameters, such as high bias γ = 1e − 3 with total experts n ≥ 512, may impair performance. However, all settings other than (αLB = 1e−2, γ = 1e−3) perform com… view at source ↗

**Figure 33.** Figure 33: Increasing inactive expert parameters via expert size (left) or total count (center) improves performance in MoEs (§3.1). This effect is seen both when holding total number of experts fixed (left) and when holding expert granularity fixed (center). In general, increasing total parameters results in improved performance. Optimal tradeoff between expert count and granularity varies in MoEs (right). (§3.1) A… view at source ↗

**Figure 34.** Figure 34: Heterogeneity of expert size alone does not improve MoE performance (§3.2). To explore the potential benefits of their architectural flexibility, we compare heterogeneous MoEs (indicated by dotted lines) to active- and total-parameter-matched homogeneous MoEs. Heterogeneity alone does not result in performance gains, as, at each activation sparsity s, heterogeneous MoEs with n1, n2 = a, b lie between or n… view at source ↗

**Figure 35.** Figure 35: The inclusion of a generalist consistently degrades performance in homogeneous MoEs (§3.2). We train MoE LMs which consist of some routed experts with granularity g, as well as a generalist with granularity ggen ∈ { 1 2 , 1 4 , 1 8 }. We compare to settings with no generalist, only routed experts with granularity g. In all settings and configurations, the addition of any granularity generalist results in … view at source ↗

**Figure 36.** Figure 36: The inclusion of a generalist consistently degrades performance in heterogeneous MoEs (§3.2). We train heterogeneous MoE LMs which consist of routed experts with granularity g1, g2, as well as a generalist with granularity ggen = 1 2 . We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 … view at source ↗

**Figure 37.** Figure 37: Dropless routing outperforms default routing (§3.3). We compare dropless routing to the default setting, which allow tokens to be dropped. Across all scales, we find that dropless routing outperforms or performs comparably to default routing. 50 [PITH_FULL_IMAGE:figures/full_fig_p050_37.png] view at source ↗

**Figure 38.** Figure 38: Dropless routing, with bias γ = 1e−3 (§3.3). As in [PITH_FULL_IMAGE:figures/full_fig_p051_38.png] view at source ↗

**Figure 39.** Figure 39: Load balancing mechanisms must be tuned correctly (§3.3). We consider load balancing loss weight αLB ∈ {1e−2, 1e−4} and loss-free load balancing with bias γ ∈ {0, 1e−3} (γ = 0 indicates no loss-free mechanism). Results show that poorly chosen hyperparameters, such as high bias γ = 1e − 3 with total experts n ≥ 512, may impair performance. However, all settings other than (αLB = 1e−2, γ = 1e−3) perform com… view at source ↗

**Figure 40.** Figure 40: Increasing inactive expert parameters via expert size (left) or total count (center) improves performance in MoEs (§3.1). This effect is seen both when holding total number of experts fixed (left) and when holding expert granularity fixed (center). In general, increasing total parameters results in improved performance. Optimal tradeoff between expert count and granularity varies in MoEs (right). (§3.1) A… view at source ↗

**Figure 41.** Figure 41: Heterogeneity of expert size alone does not improve MoE performance (§3.2). To explore the potential benefits of their architectural flexibility, we compare heterogeneous MoEs (indicated by dotted lines) to active- and total-parameter-matched homogeneous MoEs. Heterogeneity alone does not result in performance gains, as, at each activation sparsity s, heterogeneous MoEs with n1, n2 = a, b lie between or n… view at source ↗

**Figure 42.** Figure 42: The inclusion of a generalist consistently degrades performance in homogeneous MoEs (§3.2). We train MoE LMs which consist of some routed experts with granularity g, as well as a generalist with granularity ggen ∈ { 1 2 , 1 4 , 1 8 }. We compare to settings with no generalist, only routed experts with granularity g. In all settings and configurations, the addition of any granularity generalist results in … view at source ↗

**Figure 43.** Figure 43: The inclusion of a generalist consistently degrades performance in heterogeneous MoEs (§3.2). We train heterogeneous MoE LMs which consist of routed experts with granularity g1, g2, as well as a generalist with granularity ggen = 1 2 . We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 … view at source ↗

**Figure 44.** Figure 44: Dropless routing outperforms default routing (§3.3). We compare dropless routing to the default setting, which allow tokens to be dropped. Across all scales, we find that dropless routing outperforms or performs comparably to default routing. 57 [PITH_FULL_IMAGE:figures/full_fig_p057_44.png] view at source ↗

**Figure 45.** Figure 45: Dropless routing, with bias γ = 1e−3 (§3.3). As in [PITH_FULL_IMAGE:figures/full_fig_p058_45.png] view at source ↗

**Figure 46.** Figure 46: Load balancing mechanisms must be tuned correctly (§3.3). We consider load balancing loss weight αLB ∈ {1e−2, 1e−4} and loss-free load balancing with bias γ ∈ {0, 1e−3} (γ = 0 indicates no loss-free mechanism). Results show that poorly chosen hyperparameters, such as high bias γ = 1e − 3 with total experts n ≥ 512, may impair performance. However, all settings other than (αLB = 1e−2, γ = 1e−3) perform com… view at source ↗

**Figure 47.** Figure 47: Increasing inactive expert parameters via expert size (left) or total count (center) improves performance in MoEs (§3.1). This effect is seen both when holding total number of experts fixed (left) and when holding expert granularity fixed (center). In general, increasing total parameters results in improved performance. Optimal tradeoff between expert count and granularity varies in MoEs (right). (§3.1) A… view at source ↗

**Figure 48.** Figure 48: Heterogeneity of expert size alone does not improve MoE performance (§3.2). To explore the potential benefits of their architectural flexibility, we compare heterogeneous MoEs (indicated by dotted lines) to active- and total-parameter-matched homogeneous MoEs. Heterogeneity alone does not result in performance gains, as, at each activation sparsity s, heterogeneous MoEs with n1, n2 = a, b lie between or n… view at source ↗

**Figure 49.** Figure 49: The inclusion of a generalist consistently degrades performance in homogeneous MoEs (§3.2). We train MoE LMs which consist of some routed experts with granularity g, as well as a generalist with granularity ggen ∈ { 1 2 , 1 4 , 1 8 }. We compare to settings with no generalist, only routed experts with granularity g. In all settings and configurations, the addition of any granularity generalist results in … view at source ↗

**Figure 50.** Figure 50: The inclusion of a generalist consistently degrades performance in heterogeneous MoEs (§3.2). We train heterogeneous MoE LMs which consist of routed experts with granularity g1, g2, as well as a generalist with granularity ggen = 1 2 . We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 … view at source ↗

**Figure 51.** Figure 51: Dropless routing outperforms default routing (§3.3). We compare dropless routing to the default setting, which allow tokens to be dropped. Across all scales, we find that dropless routing outperforms or performs comparably to default routing. 64 [PITH_FULL_IMAGE:figures/full_fig_p064_51.png] view at source ↗

**Figure 52.** Figure 52: Dropless routing, with bias γ = 1e−3 (§3.3). As in [PITH_FULL_IMAGE:figures/full_fig_p065_52.png] view at source ↗

**Figure 53.** Figure 53: Load balancing mechanisms must be tuned correctly (§3.3). We consider load balancing loss weight αLB ∈ {1e−2, 1e−4} and loss-free load balancing with bias γ ∈ {0, 1e−3} (γ = 0 indicates no loss-free mechanism). Results show that poorly chosen hyperparameters, such as high bias γ = 1e − 3 with total experts n ≥ 512, may impair performance. However, all settings other than (αLB = 1e−2, γ = 1e−3) perform com… view at source ↗

**Figure 54.** Figure 54: Increasing inactive expert parameters via expert size (left) or total count (center) improves performance in MoEs (§3.1). This effect is seen both when holding total number of experts fixed (left) and when holding expert granularity fixed (center). In general, increasing total parameters results in improved performance. Optimal tradeoff between expert count and granularity varies in MoEs (right). (§3.1) A… view at source ↗

**Figure 55.** Figure 55: Heterogeneity of expert size alone does not improve MoE performance (§3.2). To explore the potential benefits of their architectural flexibility, we compare heterogeneous MoEs (indicated by dotted lines) to active- and total-parameter-matched homogeneous MoEs. Heterogeneity alone does not result in performance gains, as, at each activation sparsity s, heterogeneous MoEs with n1, n2 = a, b lie between or n… view at source ↗

**Figure 56.** Figure 56: The inclusion of a generalist consistently degrades performance in homogeneous MoEs (§3.2). We train MoE LMs which consist of some routed experts with granularity g, as well as a generalist with granularity ggen ∈ { 1 2 , 1 4 , 1 8 }. We compare to settings with no generalist, only routed experts with granularity g. In all settings and configurations, the addition of any granularity generalist results in … view at source ↗

**Figure 57.** Figure 57: The inclusion of a generalist consistently degrades performance in heterogeneous MoEs (§3.2). We train heterogeneous MoE LMs which consist of routed experts with granularity g1, g2, as well as a generalist with granularity ggen = 1 2 . We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 … view at source ↗

**Figure 58.** Figure 58: Dropless routing outperforms default routing (§3.3). We compare dropless routing to the default setting, which allow tokens to be dropped. Across all scales, we find that dropless routing outperforms or performs comparably to default routing. 71 [PITH_FULL_IMAGE:figures/full_fig_p071_58.png] view at source ↗

**Figure 59.** Figure 59: Dropless routing, with bias γ = 1e−3 (§3.3). As in [PITH_FULL_IMAGE:figures/full_fig_p072_59.png] view at source ↗

**Figure 60.** Figure 60: Load balancing mechanisms must be tuned correctly (§3.3). We consider load balancing loss weight αLB ∈ {1e−2, 1e−4} and loss-free load balancing with bias γ ∈ {0, 1e−3} (γ = 0 indicates no loss-free mechanism). Results show that poorly chosen hyperparameters, such as high bias γ = 1e − 3 with total experts n ≥ 512, may impair performance. However, all settings other than (αLB = 1e−2, γ = 1e−3) perform com… view at source ↗

**Figure 61.** Figure 61: Increasing inactive expert parameters via expert size (left) or total count (center) improves performance in MoEs (§3.1). This effect is seen both when holding total number of experts fixed (left) and when holding expert granularity fixed (center). In general, increasing total parameters results in improved performance. Optimal tradeoff between expert count and granularity varies in MoEs (right). (§3.1) A… view at source ↗

**Figure 62.** Figure 62: Heterogeneity of expert size alone does not improve MoE performance (§3.2). To explore the potential benefits of their architectural flexibility, we compare heterogeneous MoEs (indicated by dotted lines) to active- and total-parameter-matched homogeneous MoEs. Heterogeneity alone does not result in performance gains, as, at each activation sparsity s, heterogeneous MoEs with n1, n2 = a, b lie between or n… view at source ↗

**Figure 63.** Figure 63: The inclusion of a generalist consistently degrades performance in homogeneous MoEs (§3.2). We train MoE LMs which consist of some routed experts with granularity g, as well as a generalist with granularity ggen ∈ { 1 2 , 1 4 , 1 8 }. We compare to settings with no generalist, only routed experts with granularity g. In all settings and configurations, the addition of any granularity generalist results in … view at source ↗

**Figure 64.** Figure 64: The inclusion of a generalist consistently degrades performance in heterogeneous MoEs (§3.2). We train heterogeneous MoE LMs which consist of routed experts with granularity g1, g2, as well as a generalist with granularity ggen = 1 2 . We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 … view at source ↗

**Figure 65.** Figure 65: Dropless routing outperforms default routing (§3.3). We compare dropless routing to the default setting, which allow tokens to be dropped. Across all scales, we find that dropless routing outperforms or performs comparably to default routing. 78 [PITH_FULL_IMAGE:figures/full_fig_p078_65.png] view at source ↗

**Figure 66.** Figure 66: Dropless routing, with bias γ = 1e−3 (§3.3). As in [PITH_FULL_IMAGE:figures/full_fig_p079_66.png] view at source ↗

**Figure 67.** Figure 67: Load balancing mechanisms must be tuned correctly (§3.3). We consider load balancing loss weight αLB ∈ {1e−2, 1e−4} and loss-free load balancing with bias γ ∈ {0, 1e−3} (γ = 0 indicates no loss-free mechanism). Results show that poorly chosen hyperparameters, such as high bias γ = 1e − 3 with total experts n ≥ 512, may impair performance. However, all settings other than (αLB = 1e−2, γ = 1e−3) perform com… view at source ↗

**Figure 68.** Figure 68: Increasing inactive expert parameters via expert size (left) or total count (center) improves performance in MoEs (§3.1). This effect is seen both when holding total number of experts fixed (left) and when holding expert granularity fixed (center). In general, increasing total parameters results in improved performance. Optimal tradeoff between expert count and granularity varies in MoEs (right). (§3.1) A… view at source ↗

**Figure 69.** Figure 69: Heterogeneity of expert size alone does not improve MoE performance (§3.2). To explore the potential benefits of their architectural flexibility, we compare heterogeneous MoEs (indicated by dotted lines) to active- and total-parameter-matched homogeneous MoEs. Heterogeneity alone does not result in performance gains, as, at each activation sparsity s, heterogeneous MoEs with n1, n2 = a, b lie between or n… view at source ↗

**Figure 70.** Figure 70: The inclusion of a generalist consistently degrades performance in homogeneous MoEs (§3.2). We train MoE LMs which consist of some routed experts with granularity g, as well as a generalist with granularity ggen ∈ { 1 2 , 1 4 , 1 8 }. We compare to settings with no generalist, only routed experts with granularity g. In all settings and configurations, the addition of any granularity generalist results in … view at source ↗

**Figure 71.** Figure 71: The inclusion of a generalist consistently degrades performance in heterogeneous MoEs (§3.2). We train heterogeneous MoE LMs which consist of routed experts with granularity g1, g2, as well as a generalist with granularity ggen = 1 2 . We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 … view at source ↗

**Figure 72.** Figure 72: Dropless routing outperforms default routing (§3.3). We compare dropless routing to the default setting, which allow tokens to be dropped. Across all scales, we find that dropless routing outperforms or performs comparably to default routing. 85 [PITH_FULL_IMAGE:figures/full_fig_p085_72.png] view at source ↗

**Figure 73.** Figure 73: Dropless routing, with bias γ = 1e−3 (§3.3). As in [PITH_FULL_IMAGE:figures/full_fig_p086_73.png] view at source ↗

**Figure 74.** Figure 74: Load balancing mechanisms must be tuned correctly (§3.3). We consider load balancing loss weight αLB ∈ {1e−2, 1e−4} and loss-free load balancing with bias γ ∈ {0, 1e−3} (γ = 0 indicates no loss-free mechanism). Results show that poorly chosen hyperparameters, such as high bias γ = 1e − 3 with total experts n ≥ 512, may impair performance. However, all settings other than (αLB = 1e−2, γ = 1e−3) perform com… view at source ↗

**Figure 75.** Figure 75: Increasing inactive expert parameters via expert size (left) or total count (center) improves performance in MoEs (§3.1). This effect is seen both when holding total number of experts fixed (left) and when holding expert granularity fixed (center). In general, increasing total parameters results in improved performance. Optimal tradeoff between expert count and granularity varies in MoEs (right). (§3.1) A… view at source ↗

**Figure 76.** Figure 76: Heterogeneity of expert size alone does not improve MoE performance (§3.2). To explore the potential benefits of their architectural flexibility, we compare heterogeneous MoEs (indicated by dotted lines) to active- and total-parameter-matched homogeneous MoEs. Heterogeneity alone does not result in performance gains, as, at each activation sparsity s, heterogeneous MoEs with n1, n2 = a, b lie between or n… view at source ↗

**Figure 77.** Figure 77: The inclusion of a generalist consistently degrades performance in homogeneous MoEs (§3.2). We train MoE LMs which consist of some routed experts with granularity g, as well as a generalist with granularity ggen ∈ { 1 2 , 1 4 , 1 8 }. We compare to settings with no generalist, only routed experts with granularity g. In all settings and configurations, the addition of any granularity generalist results in … view at source ↗

**Figure 78.** Figure 78: The inclusion of a generalist consistently degrades performance in heterogeneous MoEs (§3.2). We train heterogeneous MoE LMs which consist of routed experts with granularity g1, g2, as well as a generalist with granularity ggen = 1 2 . We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 … view at source ↗

**Figure 79.** Figure 79: Dropless routing outperforms default routing (§3.3). We compare dropless routing to the default setting, which allow tokens to be dropped. Across all scales, we find that dropless routing outperforms or performs comparably to default routing. 92 [PITH_FULL_IMAGE:figures/full_fig_p092_79.png] view at source ↗

**Figure 80.** Figure 80: Dropless routing, with bias γ = 1e−3 (§3.3). As in [PITH_FULL_IMAGE:figures/full_fig_p093_80.png] view at source ↗

**Figure 81.** Figure 81: Load balancing mechanisms must be tuned correctly (§3.3). We consider load balancing loss weight αLB ∈ {1e−2, 1e−4} and loss-free load balancing with bias γ ∈ {0, 1e−3} (γ = 0 indicates no loss-free mechanism). Results show that poorly chosen hyperparameters, such as high bias γ = 1e − 3 with total experts n ≥ 512, may impair performance. However, all settings other than (αLB = 1e−2, γ = 1e−3) perform com… view at source ↗

**Figure 82.** Figure 82: Increasing inactive expert parameters via expert size (left) or total count (center) improves performance in MoEs (§3.1). This effect is seen both when holding total number of experts fixed (left) and when holding expert granularity fixed (center). In general, increasing total parameters results in improved performance. Optimal tradeoff between expert count and granularity varies in MoEs (right). (§3.1) A… view at source ↗

**Figure 83.** Figure 83: Heterogeneity of expert size alone does not improve MoE performance (§3.2). To explore the potential benefits of their architectural flexibility, we compare heterogeneous MoEs (indicated by dotted lines) to active- and total-parameter-matched homogeneous MoEs. Heterogeneity alone does not result in performance gains, as, at each activation sparsity s, heterogeneous MoEs with n1, n2 = a, b lie between or n… view at source ↗

**Figure 84.** Figure 84: The inclusion of a generalist consistently degrades performance in homogeneous MoEs (§3.2). We train MoE LMs which consist of some routed experts with granularity g, as well as a generalist with granularity ggen ∈ { 1 2 , 1 4 , 1 8 }. We compare to settings with no generalist, only routed experts with granularity g. In all settings and configurations, the addition of any granularity generalist results in … view at source ↗

**Figure 85.** Figure 85: The inclusion of a generalist consistently degrades performance in heterogeneous MoEs (§3.2). We train heterogeneous MoE LMs which consist of routed experts with granularity g1, g2, as well as a generalist with granularity ggen = 1 2 . We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 … view at source ↗

**Figure 86.** Figure 86: Dropless routing outperforms default routing (§3.3). We compare dropless routing to the default setting, which allow tokens to be dropped. Across all scales, we find that dropless routing outperforms or performs comparably to default routing. 99 [PITH_FULL_IMAGE:figures/full_fig_p099_86.png] view at source ↗

**Figure 87.** Figure 87: Dropless routing, with bias γ = 1e−3 (§3.3). As in [PITH_FULL_IMAGE:figures/full_fig_p100_87.png] view at source ↗

**Figure 88.** Figure 88: Load balancing mechanisms must be tuned correctly (§3.3). We consider load balancing loss weight αLB ∈ {1e−2, 1e−4} and loss-free load balancing with bias γ ∈ {0, 1e−3} (γ = 0 indicates no loss-free mechanism). Results show that poorly chosen hyperparameters, such as high bias γ = 1e − 3 with total experts n ≥ 512, may impair performance. However, all settings other than (αLB = 1e−2, γ = 1e−3) perform com… view at source ↗

**Figure 89.** Figure 89: Increasing inactive expert parameters via expert size (left) or total count (center) improves performance in MoEs (§3.1). This effect is seen both when holding total number of experts fixed (left) and when holding expert granularity fixed (center). In general, increasing total parameters results in improved performance. Optimal tradeoff between expert count and granularity varies in MoEs (right). (§3.1) A… view at source ↗

**Figure 90.** Figure 90: Heterogeneity of expert size alone does not improve MoE performance (§3.2). To explore the potential benefits of their architectural flexibility, we compare heterogeneous MoEs (indicated by dotted lines) to active- and total-parameter-matched homogeneous MoEs. Heterogeneity alone does not result in performance gains, as, at each activation sparsity s, heterogeneous MoEs with n1, n2 = a, b lie between or n… view at source ↗

**Figure 91.** Figure 91: The inclusion of a generalist consistently degrades performance in homogeneous MoEs (§3.2). We train MoE LMs which consist of some routed experts with granularity g, as well as a generalist with granularity ggen ∈ { 1 2 , 1 4 , 1 8 }. We compare to settings with no generalist, only routed experts with granularity g. In all settings and configurations, the addition of any granularity generalist results in … view at source ↗

**Figure 92.** Figure 92: The inclusion of a generalist consistently degrades performance in heterogeneous MoEs (§3.2). We train heterogeneous MoE LMs which consist of routed experts with granularity g1, g2, as well as a generalist with granularity ggen = 1 2 . We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 … view at source ↗

**Figure 93.** Figure 93: Dropless routing outperforms default routing (§3.3). We compare dropless routing to the default setting, which allow tokens to be dropped. Across all scales, we find that dropless routing outperforms or performs comparably to default routing. 106 [PITH_FULL_IMAGE:figures/full_fig_p106_93.png] view at source ↗

**Figure 94.** Figure 94: Dropless routing, with bias γ = 1e−3 (§3.3). As in [PITH_FULL_IMAGE:figures/full_fig_p107_94.png] view at source ↗

**Figure 95.** Figure 95: Load balancing mechanisms must be tuned correctly (§3.3). We consider load balancing loss weight αLB ∈ {1e−2, 1e−4} and loss-free load balancing with bias γ ∈ {0, 1e−3} (γ = 0 indicates no loss-free mechanism). Results show that poorly chosen hyperparameters, such as high bias γ = 1e − 3 with total experts n ≥ 512, may impair performance. However, all settings other than (αLB = 1e−2, γ = 1e−3) perform com… view at source ↗

**Figure 96.** Figure 96: Increasing inactive expert parameters via expert size (left) or total count (center) improves performance in MoEs (§3.1). This effect is seen both when holding total number of experts fixed (left) and when holding expert granularity fixed (center). In general, increasing total parameters results in improved performance. Optimal tradeoff between expert count and granularity varies in MoEs (right). (§3.1) A… view at source ↗

**Figure 97.** Figure 97: Heterogeneity of expert size alone does not improve MoE performance (§3.2). To explore the potential benefits of their architectural flexibility, we compare heterogeneous MoEs (indicated by dotted lines) to active- and total-parameter-matched homogeneous MoEs. Heterogeneity alone does not result in performance gains, as, at each activation sparsity s, heterogeneous MoEs with n1, n2 = a, b lie between or n… view at source ↗

**Figure 98.** Figure 98: The inclusion of a generalist consistently degrades performance in homogeneous MoEs (§3.2). We train MoE LMs which consist of some routed experts with granularity g, as well as a generalist with granularity ggen ∈ { 1 2 , 1 4 , 1 8 }. We compare to settings with no generalist, only routed experts with granularity g. In all settings and configurations, the addition of any granularity generalist results in … view at source ↗

**Figure 99.** Figure 99: The inclusion of a generalist consistently degrades performance in heterogeneous MoEs (§3.2). We train heterogeneous MoE LMs which consist of routed experts with granularity g1, g2, as well as a generalist with granularity ggen = 1 2 . We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 … view at source ↗

**Figure 100.** Figure 100: Dropless routing outperforms default routing (§3.3). We compare dropless routing to the default setting, which allow tokens to be dropped. Across all scales, we find that dropless routing outperforms or performs comparably to default routing. 113 [PITH_FULL_IMAGE:figures/full_fig_p113_100.png] view at source ↗

**Figure 101.** Figure 101: Dropless routing, with bias γ = 1e−3 (§3.3). As in [PITH_FULL_IMAGE:figures/full_fig_p114_101.png] view at source ↗

**Figure 102.** Figure 102: Load balancing mechanisms must be tuned correctly (§3.3). We consider load balancing loss weight αLB ∈ {1e−2, 1e−4} and loss-free load balancing with bias γ ∈ {0, 1e−3} (γ = 0 indicates no loss-free mechanism). Results show that poorly chosen hyperparameters, such as high bias γ = 1e − 3 with total experts n ≥ 512, may impair performance. However, all settings other than (αLB = 1e−2, γ = 1e−3) perform co… view at source ↗

**Figure 103.** Figure 103: Increasing inactive expert parameters via expert size (left) or total count (center) improves performance in MoEs (§3.1). This effect is seen both when holding total number of experts fixed (left) and when holding expert granularity fixed (center). In general, increasing total parameters results in improved performance. Optimal tradeoff between expert count and granularity varies in MoEs (right). (§3.1) … view at source ↗

**Figure 104.** Figure 104: Heterogeneity of expert size alone does not improve MoE performance (§3.2). To explore the potential benefits of their architectural flexibility, we compare heterogeneous MoEs (indicated by dotted lines) to active- and total-parameter-matched homogeneous MoEs. Heterogeneity alone does not result in performance gains, as, at each activation sparsity s, heterogeneous MoEs with n1, n2 = a, b lie between or … view at source ↗

**Figure 105.** Figure 105: The inclusion of a generalist consistently degrades performance in homogeneous MoEs (§3.2). We train MoE LMs which consist of some routed experts with granularity g, as well as a generalist with granularity ggen ∈ { 1 2 , 1 4 , 1 8 }. We compare to settings with no generalist, only routed experts with granularity g. In all settings and configurations, the addition of any granularity generalist results in… view at source ↗

**Figure 106.** Figure 106: The inclusion of a generalist consistently degrades performance in heterogeneous MoEs (§3.2). We train heterogeneous MoE LMs which consist of routed experts with granularity g1, g2, as well as a generalist with granularity ggen = 1 2 . We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1… view at source ↗

**Figure 107.** Figure 107: Dropless routing outperforms default routing (§3.3). We compare dropless routing to the default setting, which allow tokens to be dropped. Across all scales, we find that dropless routing outperforms or performs comparably to default routing. 120 [PITH_FULL_IMAGE:figures/full_fig_p120_107.png] view at source ↗

**Figure 108.** Figure 108: Dropless routing, with bias γ = 1e−3 (§3.3). As in [PITH_FULL_IMAGE:figures/full_fig_p121_108.png] view at source ↗

**Figure 109.** Figure 109: Load balancing mechanisms must be tuned correctly (§3.3). We consider load balancing loss weight αLB ∈ {1e−2, 1e−4} and loss-free load balancing with bias γ ∈ {0, 1e−3} (γ = 0 indicates no loss-free mechanism). Results show that poorly chosen hyperparameters, such as high bias γ = 1e − 3 with total experts n ≥ 512, may impair performance. However, all settings other than (αLB = 1e−2, γ = 1e−3) perform co… view at source ↗

**Figure 110.** Figure 110: Increasing inactive expert parameters via expert size (left) or total count (center) improves performance in MoEs (§3.1). This effect is seen both when holding total number of experts fixed (left) and when holding expert granularity fixed (center). In general, increasing total parameters results in improved performance. Optimal tradeoff between expert count and granularity varies in MoEs (right). (§3.1) … view at source ↗

**Figure 111.** Figure 111: Heterogeneity of expert size alone does not improve MoE performance (§3.2). To explore the potential benefits of their architectural flexibility, we compare heterogeneous MoEs (indicated by dotted lines) to active- and total-parameter-matched homogeneous MoEs. Heterogeneity alone does not result in performance gains, as, at each activation sparsity s, heterogeneous MoEs with n1, n2 = a, b lie between or … view at source ↗

**Figure 112.** Figure 112: The inclusion of a generalist consistently degrades performance in homogeneous MoEs (§3.2). We train MoE LMs which consist of some routed experts with granularity g, as well as a generalist with granularity ggen ∈ { 1 2 , 1 4 , 1 8 }. We compare to settings with no generalist, only routed experts with granularity g. In all settings and configurations, the addition of any granularity generalist results in… view at source ↗

**Figure 113.** Figure 113: The inclusion of a generalist consistently degrades performance in heterogeneous MoEs (§3.2). We train heterogeneous MoE LMs which consist of routed experts with granularity g1, g2, as well as a generalist with granularity ggen = 1 2 . We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1… view at source ↗

**Figure 114.** Figure 114: Dropless routing outperforms default routing (§3.3). We compare dropless routing to the default setting, which allow tokens to be dropped. Across all scales, we find that dropless routing outperforms or performs comparably to default routing. 127 [PITH_FULL_IMAGE:figures/full_fig_p127_114.png] view at source ↗

**Figure 115.** Figure 115: Dropless routing, with bias γ = 1e−3 (§3.3). As in [PITH_FULL_IMAGE:figures/full_fig_p128_115.png] view at source ↗

**Figure 116.** Figure 116: Load balancing mechanisms must be tuned correctly (§3.3). We consider load balancing loss weight αLB ∈ {1e−2, 1e−4} and loss-free load balancing with bias γ ∈ {0, 1e−3} (γ = 0 indicates no loss-free mechanism). Results show that poorly chosen hyperparameters, such as high bias γ = 1e − 3 with total experts n ≥ 512, may impair performance. However, all settings other than (αLB = 1e−2, γ = 1e−3) perform co… view at source ↗

**Figure 117.** Figure 117: Increasing inactive expert parameters via expert size (left) or total count (center) improves performance in MoEs (§3.1). This effect is seen both when holding total number of experts fixed (left) and when holding expert granularity fixed (center). In general, increasing total parameters results in improved performance. Optimal tradeoff between expert count and granularity varies in MoEs (right). (§3.1) … view at source ↗

**Figure 118.** Figure 118: Heterogeneity of expert size alone does not improve MoE performance (§3.2). To explore the potential benefits of their architectural flexibility, we compare heterogeneous MoEs (indicated by dotted lines) to active- and total-parameter-matched homogeneous MoEs. Heterogeneity alone does not result in performance gains, as, at each activation sparsity s, heterogeneous MoEs with n1, n2 = a, b lie between or … view at source ↗

**Figure 119.** Figure 119: The inclusion of a generalist consistently degrades performance in homogeneous MoEs (§3.2). We train MoE LMs which consist of some routed experts with granularity g, as well as a generalist with granularity ggen ∈ { 1 2 , 1 4 , 1 8 }. We compare to settings with no generalist, only routed experts with granularity g. In all settings and configurations, the addition of any granularity generalist results in… view at source ↗

**Figure 120.** Figure 120: The inclusion of a generalist consistently degrades performance in heterogeneous MoEs (§3.2). We train heterogeneous MoE LMs which consist of routed experts with granularity g1, g2, as well as a generalist with granularity ggen = 1 2 . We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1… view at source ↗

**Figure 121.** Figure 121: Dropless routing outperforms default routing (§3.3). We compare dropless routing to the default setting, which allow tokens to be dropped. Across all scales, we find that dropless routing outperforms or performs comparably to default routing. 134 [PITH_FULL_IMAGE:figures/full_fig_p134_121.png] view at source ↗

**Figure 122.** Figure 122: Dropless routing, with bias γ = 1e−3 (§3.3). As in [PITH_FULL_IMAGE:figures/full_fig_p135_122.png] view at source ↗

**Figure 123.** Figure 123: Load balancing mechanisms must be tuned correctly (§3.3). We consider load balancing loss weight αLB ∈ {1e−2, 1e−4} and loss-free load balancing with bias γ ∈ {0, 1e−3} (γ = 0 indicates no loss-free mechanism). Results show that poorly chosen hyperparameters, such as high bias γ = 1e − 3 with total experts n ≥ 512, may impair performance. However, all settings other than (αLB = 1e−2, γ = 1e−3) perform co… view at source ↗

**Figure 124.** Figure 124: Increasing inactive expert parameters via expert size (left) or total count (center) improves performance in MoEs (§3.1). This effect is seen both when holding total number of experts fixed (left) and when holding expert granularity fixed (center). In general, increasing total parameters results in improved performance. Optimal tradeoff between expert count and granularity varies in MoEs (right). (§3.1) … view at source ↗

**Figure 125.** Figure 125: Heterogeneity of expert size alone does not improve MoE performance (§3.2). To explore the potential benefits of their architectural flexibility, we compare heterogeneous MoEs (indicated by dotted lines) to active- and total-parameter-matched homogeneous MoEs. Heterogeneity alone does not result in performance gains, as, at each activation sparsity s, heterogeneous MoEs with n1, n2 = a, b lie between or … view at source ↗

**Figure 126.** Figure 126: The inclusion of a generalist consistently degrades performance in homogeneous MoEs (§3.2). We train MoE LMs which consist of some routed experts with granularity g, as well as a generalist with granularity ggen ∈ { 1 2 , 1 4 , 1 8 }. We compare to settings with no generalist, only routed experts with granularity g. In all settings and configurations, the addition of any granularity generalist results in… view at source ↗

**Figure 127.** Figure 127: The inclusion of a generalist consistently degrades performance in heterogeneous MoEs (§3.2). We train heterogeneous MoE LMs which consist of routed experts with granularity g1, g2, as well as a generalist with granularity ggen = 1 2 . We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1… view at source ↗

**Figure 128.** Figure 128: Dropless routing outperforms default routing (§3.3). We compare dropless routing to the default setting, which allow tokens to be dropped. Across all scales, we find that dropless routing outperforms or performs comparably to default routing. 141 [PITH_FULL_IMAGE:figures/full_fig_p141_128.png] view at source ↗

**Figure 129.** Figure 129: Dropless routing, with bias γ = 1e−3 (§3.3). As in [PITH_FULL_IMAGE:figures/full_fig_p142_129.png] view at source ↗

**Figure 130.** Figure 130: Load balancing mechanisms must be tuned correctly (§3.3). We consider load balancing loss weight αLB ∈ {1e−2, 1e−4} and loss-free load balancing with bias γ ∈ {0, 1e−3} (γ = 0 indicates no loss-free mechanism). Results show that poorly chosen hyperparameters, such as high bias γ = 1e − 3 with total experts n ≥ 512, may impair performance. However, all settings other than (αLB = 1e−2, γ = 1e−3) perform co… view at source ↗

**Figure 131.** Figure 131: Increasing inactive expert parameters via expert size (left) or total count (center) improves performance in MoEs (§3.1). This effect is seen both when holding total number of experts fixed (left) and when holding expert granularity fixed (center). In general, increasing total parameters results in improved performance. Optimal tradeoff between expert count and granularity varies in MoEs (right). (§3.1) … view at source ↗

**Figure 132.** Figure 132: Heterogeneity of expert size alone does not improve MoE performance (§3.2). To explore the potential benefits of their architectural flexibility, we compare heterogeneous MoEs (indicated by dotted lines) to active- and total-parameter-matched homogeneous MoEs. Heterogeneity alone does not result in performance gains, as, at each activation sparsity s, heterogeneous MoEs with n1, n2 = a, b lie between or … view at source ↗

**Figure 133.** Figure 133: The inclusion of a generalist consistently degrades performance in homogeneous MoEs (§3.2). We train MoE LMs which consist of some routed experts with granularity g, as well as a generalist with granularity ggen ∈ { 1 2 , 1 4 , 1 8 }. We compare to settings with no generalist, only routed experts with granularity g. In all settings and configurations, the addition of any granularity generalist results in… view at source ↗

**Figure 134.** Figure 134: The inclusion of a generalist consistently degrades performance in heterogeneous MoEs (§3.2). We train heterogeneous MoE LMs which consist of routed experts with granularity g1, g2, as well as a generalist with granularity ggen = 1 2 . We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1… view at source ↗

**Figure 135.** Figure 135: Dropless routing outperforms default routing (§3.3). We compare dropless routing to the default setting, which allow tokens to be dropped. Across all scales, we find that dropless routing outperforms or performs comparably to default routing. 148 [PITH_FULL_IMAGE:figures/full_fig_p148_135.png] view at source ↗

**Figure 136.** Figure 136: Dropless routing, with bias γ = 1e−3 (§3.3). As in [PITH_FULL_IMAGE:figures/full_fig_p149_136.png] view at source ↗

**Figure 137.** Figure 137: Load balancing mechanisms must be tuned correctly (§3.3). We consider load balancing loss weight αLB ∈ {1e−2, 1e−4} and loss-free load balancing with bias γ ∈ {0, 1e−3} (γ = 0 indicates no loss-free mechanism). Results show that poorly chosen hyperparameters, such as high bias γ = 1e − 3 with total experts n ≥ 512, may impair performance. However, all settings other than (αLB = 1e−2, γ = 1e−3) perform co… view at source ↗

**Figure 138.** Figure 138: Increasing inactive expert parameters via expert size (left) or total count (center) improves performance in MoEs (§3.1). This effect is seen both when holding total number of experts fixed (left) and when holding expert granularity fixed (center). In general, increasing total parameters results in improved performance. Optimal tradeoff between expert count and granularity varies in MoEs (right). (§3.1) … view at source ↗

**Figure 139.** Figure 139: Heterogeneity of expert size alone does not improve MoE performance (§3.2). To explore the potential benefits of their architectural flexibility, we compare heterogeneous MoEs (indicated by dotted lines) to active- and total-parameter-matched homogeneous MoEs. Heterogeneity alone does not result in performance gains, as, at each activation sparsity s, heterogeneous MoEs with n1, n2 = a, b lie between or … view at source ↗

**Figure 140.** Figure 140: The inclusion of a generalist consistently degrades performance in homogeneous MoEs (§3.2). We train MoE LMs which consist of some routed experts with granularity g, as well as a generalist with granularity ggen ∈ { 1 2 , 1 4 , 1 8 }. We compare to settings with no generalist, only routed experts with granularity g. In all settings and configurations, the addition of any granularity generalist results in… view at source ↗

**Figure 141.** Figure 141: The inclusion of a generalist consistently degrades performance in heterogeneous MoEs (§3.2). We train heterogeneous MoE LMs which consist of routed experts with granularity g1, g2, as well as a generalist with granularity ggen = 1 2 . We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1… view at source ↗

**Figure 142.** Figure 142: Dropless routing outperforms default routing (§3.3). We compare dropless routing to the default setting, which allow tokens to be dropped. Across all scales, we find that dropless routing outperforms or performs comparably to default routing. 155 [PITH_FULL_IMAGE:figures/full_fig_p155_142.png] view at source ↗

**Figure 143.** Figure 143: Dropless routing, with bias γ = 1e−3 (§3.3). As in [PITH_FULL_IMAGE:figures/full_fig_p156_143.png] view at source ↗

**Figure 144.** Figure 144: Load balancing mechanisms must be tuned correctly (§3.3). We consider load balancing loss weight αLB ∈ {1e−2, 1e−4} and loss-free load balancing with bias γ ∈ {0, 1e−3} (γ = 0 indicates no loss-free mechanism). Results show that poorly chosen hyperparameters, such as high bias γ = 1e − 3 with total experts n ≥ 512, may impair performance. However, all settings other than (αLB = 1e−2, γ = 1e−3) perform co… view at source ↗

read the original abstract

Mixture-of-Experts (MoE) architectures have become standard in large language models, yet many of their core design choices - expert count, granularity, shared experts, load balancing, token dropping - have only been studied one or two at a time over narrow configuration ranges. It remains an open question whether these choices can be optimized independently, without considering interactions. We present the first systematic study of over 2,000 pretraining runs spanning models up to 6.6B total parameters, in which we exhaustively vary total experts, expert dimension, heterogeneous expert sizing within a single layer, shared expert size and load-balancing mechanisms. We find that at every active-parameter scale that we study, performance consistently improves with total MoE parameters even at extreme active expert parameter ratios like 128.Further, the optimal expert size is nearly invariant to total parameter count and depends only on active parameter count. Third, we see that other choices like shared experts, heterogeneous experts and load-balancing settings have small effects relative to expert count and granularity, although dropless routing yields a consistent gain. Overall, our results suggest a simpler recipe: focus on expert count and granularity, other choices have minimal effect on final quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The main finding is that MoE quality tracks total expert count and granularity far more than shared experts, heterogeneous sizes, or most load-balancing tweaks.

read the letter

The paper ran over 2,000 pretraining runs up to 6.6B parameters and varied expert count, dimension, shared experts, heterogeneous sizing, and routing all at once. That joint sweep is the real contribution. It shows performance keeps improving as you add more total MoE parameters even at 128:1 active-to-total ratios, and the best expert size stays roughly constant once you fix the active parameter budget. Dropless routing gives a small but consistent lift; everything else moves the needle less. Those patterns hold across the scales they tested, which makes the results immediately usable for cutting down hyperparameter grids when you scale MoE models. The data are direct pretraining losses from independent runs, so there is no hidden fitting or circularity. The experiments look clean and the trends are consistent enough that the central claims land inside the measured regime. The obvious limits are the 6.6B ceiling and the reliance on pretraining loss rather than full downstream suites. We do not yet know whether the same invariance holds at 100B+ or on the tasks that actually matter for deployment. Still, within what they measured the evidence is strong and the practical takeaway is clear. Anyone tuning MoE layers will get value from the simpler recipe they propose. It is worth sending to referees because the empirical base is large and the design choices are directly tested rather than assumed.

Referee Report

0 major / 3 minor

Summary. The paper reports results from over 2,000 pretraining runs on MoE models up to 6.6B total parameters. It claims that, at fixed active-parameter count, pretraining loss improves monotonically with increasing total MoE parameters even at active-to-total ratios as high as 128; that the optimal expert dimension is essentially invariant to total model size and depends only on the active parameter budget; and that auxiliary design choices (shared experts, heterogeneous sizing, load-balancing coefficients) produce only small effects relative to expert count and granularity, while dropless routing yields a consistent gain. The authors conclude that MoE configuration can be simplified to primarily tuning expert count and granularity.

Significance. If the reported trends hold, the work supplies a practical, data-driven recipe that reduces the configuration search space for MoE models. The scale of the experimental campaign (>2,000 independent runs) and the consistency of the ordering across multiple active-parameter regimes constitute a clear empirical contribution to the literature on sparse architectures.

minor comments (3)

[§4.1] §4.1 and Figure 3: the definition of the active-to-total parameter ratio is introduced only in the caption; moving the explicit formula to the main text would improve readability.
[Table 2] Table 2: the reported loss differences for shared-expert and load-balancing ablations are on the order of 0.01–0.03; adding bootstrap confidence intervals or noting the number of seeds would help readers judge whether these differences are distinguishable from noise.
[§5.3] §5.3: the discussion of downstream-task transfer is limited to a single sentence; a brief quantitative statement (or explicit statement that downstream evaluation is left for future work) would clarify the scope of the claims.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive review, accurate summary of our findings, and recommendation to accept. The scale and consistency of the experimental results are indeed central to the contribution.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper is a purely empirical study reporting results from over 2,000 independent pretraining runs up to 6.6B parameters. Central claims (performance gains with total MoE parameters at fixed active count, invariance of optimal expert dimension to total size) are direct observations from measured losses across varied configurations; no equations, derivations, fitted parameters renamed as predictions, or self-citation chains are present that would reduce any result to its own inputs by construction. The work contains no load-bearing mathematical steps or ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the standard assumption that pretraining loss is a sufficient proxy for model quality and that the tested configuration ranges are representative of practical MoE usage.

axioms (1)

domain assumption Pretraining loss reliably indicates relative model quality across configurations
All comparisons use pretraining loss; no downstream metrics are reported for the full sweep.

pith-pipeline@v0.9.0 · 5521 in / 1216 out tokens · 29973 ms · 2026-05-13T07:32:07.850034+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We present the first systematic study of over 2,000 pretraining runs spanning models up to 6.6B total parameters, in which we exhaustively vary total experts, expert dimension, heterogeneous expert sizing within a single layer, shared expert size and load-balancing mechanisms.
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

performance consistently improves with total MoE parameters even at extreme active expert parameter ratios like 128. The optimal expert size is nearly invariant to total parameter count and depends only on active parameter count.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages

[1]

13 A Additional Training Details A.1 Training Data Our training data is directly taken from Muennighoff et al

in OLMoE. 13 A Additional Training Details A.1 Training Data Our training data is directly taken from Muennighoff et al. (2025) and consists of documents from: DCLM-Baseline (Li et al., 2024), StarCoder (Li et al., 2023; Kocetkov et al., 2022), peS2o (Soldaini & Lo, 2023; Soldaini et al., 2024a), arXiv (Computer, 2023), OpenWebMath (Paster et al., 2023), ...

work page 2025
[2]

(1, 2) (4, 8) 4 (8, 16) 8 (16, 32) 16 (32, 64) 32 ( 1 4 , 1

work page
[3]

(2, 4) (8, 16) 4 (16, 32) 8 (32, 64) 16 (64, 128) 32 ( 1 8 , 1

work page
[4]

(4, 8) (16, 32) 4 (32, 64) 8 (64, 128) 16 (128, 256) 32 ( 1 16 , 1

work page
[5]

We show the configurations used in the heterogeneous MoE experiments

(8, 16) (16, 32) 2 (32, 64) 4 (64, 128) 8 (128, 256) 16 Table 3:Heterogeneous MoE configurations (§3.2).. We show the configurations used in the heterogeneous MoE experiments. 17 B Discussion of Hyperparameters and Configurations B.1 Preliminary Hyperparameter Investigations Learning RateWe sweep the learning rate in {1e−4,4e−4,1e−3,4e−3,1e−2} . We find t...

work page 2025
[6]

In all settings and configurations, the addition of a generalist results in comparable or degraded performance

We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 8 1 4 1 2 1 Granularity 5.2 5.4 5.6 5.8 6.0 6.2 LM Average ( ) Sparsity 1 2 4 8 16 32 64 128 256 Router Type default dropless Dense (a) 50M active, 50M - 930M total parameters 1 64 1 32...

work page
[7]

In all settings and configurations, the addition of a generalist results in comparable or degraded performance

We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 8 1 4 1 2 1 Granularity 5.2 5.4 5.6 5.8C4 Sparsity 1 2 4 8 16 32 64 128 256 Router Type default dropless Dense (a) 50M active, 50M - 930M total parameters 1 64 1 32 1 16 1 8 1 4 1 2 1 G...

work page
[8]

In all settings and configurations, the addition of a generalist results in comparable or degraded performance

We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 8 1 4 1 2 1 Granularity 5.4 5.6 5.8 6.0 6.2 6.4 6.6 6.8Dolma Books Sparsity 1 2 4 8 16 32 64 128 256 Router Type default dropless Dense (a) 50M active, 50M - 930M total parameters 1 64 ...

work page
[9]

In all settings and configurations, the addition of a generalist results in comparable or degraded performance

We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 8 1 4 1 2 1 Granularity 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 6.0Dolma CC Sparsity 1 2 4 8 16 32 64 128 256 Router Type default dropless Dense (a) 50M active, 50M - 930M total parameters 1 64...

work page
[10]

In all settings and configurations, the addition of a generalist results in comparable or degraded performance

We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 8 1 4 1 2 1 Granularity 4.8 5.0 5.2 5.4 5.6 5.8 6.0Dolma peS2o Sparsity 1 2 4 8 16 32 64 128 256 Router Type default dropless Dense (a) 50M active, 50M - 930M total parameters 1 64 1 32...

work page
[11]

In all settings and configurations, the addition of a generalist results in comparable or degraded performance

We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 8 1 4 1 2 1 Granularity 4.9 5.0 5.1 5.2 5.3 5.4 5.5Dolma Reddit Sparsity 1 2 4 8 16 32 64 128 256 Router Type default dropless Dense (a) 50M active, 50M - 930M total parameters 1 64 1 3...

work page
[12]

In all settings and configurations, the addition of a generalist results in comparable or degraded performance

We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 8 1 4 1 2 1 Granularity 5.4 5.6 5.8 6.0 6.2 6.4 6.6 6.8Dolma Stack Sparsity 1 2 4 8 16 32 64 128 256 Router Type default dropless Dense (a) 50M active, 50M - 930M total parameters 1 64 ...

work page
[13]

In all settings and configurations, the addition of a generalist results in comparable or degraded performance

We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 8 1 4 1 2 1 Granularity 5.0 5.2 5.4 5.6 5.8 6.0Dolma Wiki Sparsity 1 2 4 8 16 32 64 128 256 Router Type default dropless Dense (a) 50M active, 50M - 930M total parameters 1 64 1 32 1 16...

work page
[14]

In all settings and configurations, the addition of a generalist results in comparable or degraded performance

We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 8 1 4 1 2 1 Granularity 5.4 5.6 5.8 6.0 6.2ICE Sparsity 1 2 4 8 16 32 64 128 256 Router Type default dropless Dense (a) 50M active, 50M - 930M total parameters 1 64 1 32 1 16 1 8 1 4 1 ...

work page
[15]

In all settings and configurations, the addition of a generalist results in comparable or degraded performance

We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 8 1 4 1 2 1 Granularity 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8M2D2 S2ORC Sparsity 1 2 4 8 16 32 64 128 256 Router Type default dropless Dense (a) 50M active, 50M - 930M total parameters 1 64 1...

work page
[16]

In all settings and configurations, the addition of a generalist results in comparable or degraded performance

We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 8 1 4 1 2 1 Granularity 5.0 5.2 5.4 5.6 5.8 6.0Pile Sparsity 1 2 4 8 16 32 64 128 256 Router Type default dropless Dense (a) 50M active, 50M - 930M total parameters 1 64 1 32 1 16 1 8 1...

work page
[17]

In all settings and configurations, the addition of a generalist results in comparable or degraded performance

We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 8 1 4 1 2 1 Granularity 5.6 5.8 6.0 6.2 6.4 6.6WikiText-103 Sparsity 1 2 4 8 16 32 64 128 256 Router Type default dropless Dense (a) 50M active, 50M - 930M total parameters 1 64 1 32 1 ...

work page
[18]

In all settings and configurations, the addition of a generalist results in comparable or degraded performance

We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 8 1 4 1 2 1 Granularity 2.6 2.8 3.0 3.2 3.4 3.6 BoolQ ( ) Sparsity 1 2 4 8 16 32 64 128 256 Router Type default dropless Dense (a) 50M active, 50M - 930M total parameters 1 64 1 32 1 16...

work page
[19]

In all settings and configurations, the addition of a generalist results in comparable or degraded performance

We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 8 1 4 1 2 1 Granularity 2.6 2.8 3.0 3.2 3.4 3.6 BoolQ ( ) Sparsity 1 2 4 8 16 32 64 128 256 Router Type default dropless Dense (a) 50M active, 50M - 930M total parameters 1 64 1 32 1 16...

work page
[20]

In all settings and configurations, the addition of a generalist results in comparable or degraded performance

We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 8 1 4 1 2 1 Granularity 3.0 3.5 4.0 4.5 5.0 5.5 MMLU Humanities ( ) Sparsity 1 2 4 8 16 32 64 128 256 Router Type default dropless Dense (a) 50M active, 50M - 930M total parameters 1 64...

work page
[21]

In all settings and configurations, the addition of a generalist results in comparable or degraded performance

We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 8 1 4 1 2 1 Granularity 2.5 3.0 3.5 4.0 4.5 MMLU Other ( ) Sparsity 1 2 4 8 16 32 64 128 256 Router Type default dropless Dense (a) 50M active, 50M - 930M total parameters 1 64 1 32 1 1...

work page
[22]

In all settings and configurations, the addition of a generalist results in comparable or degraded performance

We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 8 1 4 1 2 1 Granularity 2.5 3.0 3.5 4.0 4.5 5.0 MMLU Social Sci ( ) Sparsity 1 2 4 8 16 32 64 128 256 Router Type default dropless Dense (a) 50M active, 50M - 930M total parameters 1 64...

work page
[23]

In all settings and configurations, the addition of a generalist results in comparable or degraded performance

We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 8 1 4 1 2 1 Granularity 2.5 3.0 3.5 4.0 4.5 5.0 MMLU Social Sci ( ) Sparsity 1 2 4 8 16 32 64 128 256 Router Type default dropless Dense (a) 50M active, 50M - 930M total parameters 1 64...

work page