pith. machine review for the scientific record. sign in

arxiv: 2605.11689 · v1 · submitted 2026-05-12 · 💻 cs.LG · cs.CL

Recognition: 2 theorem links

· Lean Theorem

Slicing and Dicing: Configuring Optimal Mixtures of Experts

Authors on Pith no claims yet

Pith reviewed 2026-05-13 07:32 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords mixture of expertsmoeexpert granularitymodel scalingpretrainingload balancingsparse activationlanguage model design
0
0 comments X

The pith

Increasing total MoE parameters improves performance at every active-parameter scale studied.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper runs over two thousand pretraining experiments on Mixture-of-Experts models up to 6.6 billion parameters. It shows that quality keeps rising as the total number of expert parameters grows, even when the active-to-total ratio reaches extremes such as 128 to 1. The best size for each expert stays nearly constant once the active parameter count is fixed and does not shift with overall model size. Secondary choices including shared experts, heterogeneous expert sizes, and most load-balancing settings produce only small changes compared with expert count and granularity, though dropless routing gives a steady lift. The results point to a simpler design rule that centers on those two main levers.

Core claim

At every active-parameter scale studied, performance consistently improves with total MoE parameters even at extreme active expert parameter ratios like 128. The optimal expert size is nearly invariant to total parameter count and depends only on active parameter count. Other choices like shared experts, heterogeneous experts and load-balancing settings have small effects relative to expert count and granularity, although dropless routing yields a consistent gain.

What carries the argument

Exhaustive sweeps of total expert count, expert dimension, shared expert size, heterogeneous sizing, and load-balancing mechanisms across thousands of pretraining runs.

Load-bearing premise

That patterns seen in pretraining loss for models up to 6.6 billion parameters will continue at much larger scales and will predict gains on downstream tasks.

What would settle it

A pretraining run at a scale well above 6.6 billion parameters in which raising total MoE parameters at fixed active count stops improving loss, or in which downstream task accuracy diverges from the pretraining trends.

Figures

Figures reproduced from arXiv: 2605.11689 by Danielle Rothermel, Luke Zettlemoyer, Margaret Li, Sneha Kudugunta.

Figure 1
Figure 1. Figure 1: A Mixture of Experts Layer. In an MoE Transformer layer, each token first passes through the same self attention mechanism. Then, router(s) activate the highest affinity experts. Finally, the outputs of any activated FFN modules are combined in a weighted sum. A standard homogeneous MoE layer includes n experts of identical granularity g, of which k are activated. We introduce more flexible configurations … view at source ↗
Figure 2
Figure 2. Figure 2: Increasing inactive expert parameters via expert size (left) or total count (center) improves performance in MoEs (§3.1). This effect is seen both when holding total number of experts fixed (left) and when holding expert granularity fixed (center). In general, increasing total parameters results in improved performance. Additional model scales in Appendix §C.4. Optimal tradeoff between expert count and gra… view at source ↗
Figure 4
Figure 4. Figure 4: Heterogeneity of expert size alone does not improve MoE performance (§3.2). To explore the potential benefits of their archi￾tectural flexibility, we compare heterogeneous MoEs (indicated by dotted lines) to active- and total-parameter-matched homogeneous MoEs. Heterogeneity alone does not result in perfor￾mance gains, as, at each activation sparsity s, heterogeneous MoEs with n1, n2 = a, b lie be￾tween or… view at source ↗
Figure 5
Figure 5. Figure 5: The inclusion of a generalist consistently degrades performance in homogeneous MoEs (§3.2). We train MoE LMs which consist of some routed experts with granularity g, as well as a generalist with granularity ggen ∈ { 1 2 , 1 4 , 1 8 }. We compare to settings with no generalist, only routed experts with granularity g. In all settings and configurations, the addition of any granularity generalist results in c… view at source ↗
Figure 6
Figure 6. Figure 6: Dropless routing outperforms default routing (§3.3). We compare dropless routing to the default setting, which allow tokens to be dropped. Across all scales, we find that dropless routing outperforms or performs comparably to default routing. Additional model scales in Appendix §C.4. Reweighting finegrained FFNs without sparsity underperforms We hypothesize that the MoE architecture provides performance im… view at source ↗
Figure 7
Figure 7. Figure 7: Load balancing mechanisms must be tuned correctly (§3.3). We consider load balancing loss weight αLB ∈ {1e−2, 1e−4} and loss-free load balancing with bias γ ∈ {0, 1e−3} (γ = 0 indicates no loss-free mechanism). Results show that poorly chosen hyperparameters, such as high bias γ = 1e − 3 with total experts n ≥ 512, may impair performance. However, all settings other than (αLB = 1e−2, γ = 1e−3) perform comp… view at source ↗
Figure 8
Figure 8. Figure 8: Homogeneous MoE configurations represented in §3 [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Train Cross-Entropy Loss (§3.1). We plot the train cross-entropy loss, averaged over the final 50 steps, at each of 5 active parameter model sizes. For each active parameter count (column), we show all four load balancing settings with (αLB ∈ {1e−4, 1e−2}, γ ∈ {0, 1e−3}). The trends seen in cross-entropy loss on train data closely follow those seen in validation data cross-entropy loss ( [PITH_FULL_IMAGE:… view at source ↗
Figure 10
Figure 10. Figure 10: Load Balancing Loss (§3.3). We plot the train load balancing loss as defined in §2, averaged over the final 50 steps, at each of 5 active parameter model sizes. For each active parameter count (column), we show all four load balancing settings with (αLB ∈ {1e−4, 1e−2}, γ ∈ {0, 1e−3}). Across all model scales, (αLB = 1e−4, γ = 0) results in higher load balancing loss overall. 24 [PITH_FULL_IMAGE:figures/f… view at source ↗
Figure 11
Figure 11. Figure 11: Load Imbalance (§3.3). We define load imbalance as the ratio between the maximum and mean expert loads in a batch. We plot the train load imbalance, averaged over the final 50 steps, at each of 5 active parameter model sizes. For each active parameter count (column), we show all four load balancing settings with (αLB ∈ {1e−4, 1e−2}, γ ∈ {0, 1e−3}). Across all model scales, (αLB = 1e−4, γ = 0) results in h… view at source ↗
Figure 12
Figure 12. Figure 12: At 10-20M scale, MoEs underperform dense baselines at all expert (count, granular￾ity) configurations tested. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: At 10-20M scale, MoEs do not improve over dense baselines, regardless of expert heterogeneity. 2 8 32 Sparsity 9.308 9.310 9.312 9.314 9.316 9.318 9.320 9.322 9.324 Expert Granularity (g) = 1/16 2 8 32 Sparsity 9.300 9.305 9.310 9.315 9.320 9.325 Expert Granularity (g) = 1/8 2 8 32 Sparsity 9.295 9.300 9.305 9.310 9.315 9.320 9.325 Expert Granularity (g) = 1/4 2 8 32 Sparsity 9.290 9.295 9.300 9.305 9.310… view at source ↗
Figure 14
Figure 14. Figure 14: At 10-20M scale, MoEs do not improve over dense baselines regardless of generalist inclusion. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: At 10-20M scale, MoEs do not improve over dense baselines regardless of load balancing mechanisms. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: MoEs outperform dense models, even at 10M active (10M - 15M total) parameters, given sufficient compute. At 10M active parameter scale, MoE LMs underperform dense counter￾parts, if pretrained with Chinchilla-optimal tokens (200M total tokens). However, when pretraining with a 20 times greater data budget (4B total tokens), MoEs outperform dense models, exhibiting trends more similar to those seen in our 5… view at source ↗
Figure 17
Figure 17. Figure 17: MoEs outperform dense models, even at 20M active (20M - 48M total) parameters, given sufficient compute. At 20M active parameter scale, MoE LMs underperform dense counter￾parts, if pretrained with Chinchilla-optimal tokens (400M total tokens). However, when pretraining with a 5 times greater data budget (2B total tokens), MoEs outperform dense models, exhibiting trends more similar to those seen in our 50… view at source ↗
Figure 18
Figure 18. Figure 18: Increasing inactive expert parameters via expert size (left) or total count (center) improves performance in MoEs (§3.1). This effect is seen both when holding total number of experts fixed (left) and when holding expert granularity fixed (center). In general, increasing total parameters results in improved performance. Optimal tradeoff between expert count and granularity varies in MoEs (right). (§3.1) A… view at source ↗
Figure 19
Figure 19. Figure 19: Heterogeneity of expert size alone does not improve MoE performance (§3.2). To explore the potential benefits of their architectural flexibility, we compare heterogeneous MoEs (indicated by dotted lines) to active- and total-parameter-matched homogeneous MoEs. Heterogeneity alone does not result in performance gains, as, at each activation sparsity s, heterogeneous MoEs with n1, n2 = a, b lie between or n… view at source ↗
Figure 20
Figure 20. Figure 20: The inclusion of a generalist consistently degrades performance in homogeneous MoEs (§3.2). We train MoE LMs which consist of some routed experts with granularity g, as well as a generalist with granularity ggen ∈ { 1 2 , 1 4 , 1 8 }. We compare to settings with no generalist, only routed experts with granularity g. In all settings and configurations, the addition of any granularity generalist results in … view at source ↗
Figure 21
Figure 21. Figure 21: The inclusion of a generalist consistently degrades performance in heterogeneous MoEs (§3.2). We train heterogeneous MoE LMs which consist of routed experts with granularity g1, g2, as well as a generalist with granularity ggen = 1 2 . We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 … view at source ↗
Figure 22
Figure 22. Figure 22: Dropless routing outperforms default routing (§3.3). We compare dropless routing to the default setting, which allow tokens to be dropped. Across all scales, we find that dropless routing outperforms or performs comparably to default routing. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Dropless routing, with bias γ = 1e−3 (§3.3). As in [PITH_FULL_IMAGE:figures/full_fig_p035_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Load balancing mechanisms must be tuned correctly (§3.3). We consider load balancing loss weight αLB ∈ {1e−2, 1e−4} and loss-free load balancing with bias γ ∈ {0, 1e−3} (γ = 0 indicates no loss-free mechanism). Results show that poorly chosen hyperparameters, such as high bias γ = 1e − 3 with total experts n ≥ 512, may impair performance. However, all settings other than (αLB = 1e−2, γ = 1e−3) perform com… view at source ↗
Figure 25
Figure 25. Figure 25: At sufficiently large compute scales, MoEs performance on Hellaswag accuracy mirrors cross-entropy loss (§3.3). 39 [PITH_FULL_IMAGE:figures/full_fig_p039_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Increasing inactive expert parameters via expert size (left) or total count (center) improves performance in MoEs (§3.1). This effect is seen both when holding total number of experts fixed (left) and when holding expert granularity fixed (center). In general, increasing total parameters results in improved performance. Optimal tradeoff between expert count and granularity varies in MoEs (right). (§3.1) A… view at source ↗
Figure 27
Figure 27. Figure 27: Heterogeneity of expert size alone does not improve MoE performance (§3.2). To explore the potential benefits of their architectural flexibility, we compare heterogeneous MoEs (indicated by dotted lines) to active- and total-parameter-matched homogeneous MoEs. Heterogeneity alone does not result in performance gains, as, at each activation sparsity s, heterogeneous MoEs with n1, n2 = a, b lie between or n… view at source ↗
Figure 28
Figure 28. Figure 28: The inclusion of a generalist consistently degrades performance in homogeneous MoEs (§3.2). We train MoE LMs which consist of some routed experts with granularity g, as well as a generalist with granularity ggen ∈ { 1 2 , 1 4 , 1 8 }. We compare to settings with no generalist, only routed experts with granularity g. In all settings and configurations, the addition of any granularity generalist results in … view at source ↗
Figure 29
Figure 29. Figure 29: The inclusion of a generalist consistently degrades performance in heterogeneous MoEs (§3.2). We train heterogeneous MoE LMs which consist of routed experts with granularity g1, g2, as well as a generalist with granularity ggen = 1 2 . We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 … view at source ↗
Figure 30
Figure 30. Figure 30: Dropless routing outperforms default routing (§3.3). We compare dropless routing to the default setting, which allow tokens to be dropped. Across all scales, we find that dropless routing outperforms or performs comparably to default routing. 43 [PITH_FULL_IMAGE:figures/full_fig_p043_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: Dropless routing, with bias γ = 1e−3 (§3.3). As in [PITH_FULL_IMAGE:figures/full_fig_p044_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: Load balancing mechanisms must be tuned correctly (§3.3). We consider load balancing loss weight αLB ∈ {1e−2, 1e−4} and loss-free load balancing with bias γ ∈ {0, 1e−3} (γ = 0 indicates no loss-free mechanism). Results show that poorly chosen hyperparameters, such as high bias γ = 1e − 3 with total experts n ≥ 512, may impair performance. However, all settings other than (αLB = 1e−2, γ = 1e−3) perform com… view at source ↗
Figure 33
Figure 33. Figure 33: Increasing inactive expert parameters via expert size (left) or total count (center) improves performance in MoEs (§3.1). This effect is seen both when holding total number of experts fixed (left) and when holding expert granularity fixed (center). In general, increasing total parameters results in improved performance. Optimal tradeoff between expert count and granularity varies in MoEs (right). (§3.1) A… view at source ↗
Figure 34
Figure 34. Figure 34: Heterogeneity of expert size alone does not improve MoE performance (§3.2). To explore the potential benefits of their architectural flexibility, we compare heterogeneous MoEs (indicated by dotted lines) to active- and total-parameter-matched homogeneous MoEs. Heterogeneity alone does not result in performance gains, as, at each activation sparsity s, heterogeneous MoEs with n1, n2 = a, b lie between or n… view at source ↗
Figure 35
Figure 35. Figure 35: The inclusion of a generalist consistently degrades performance in homogeneous MoEs (§3.2). We train MoE LMs which consist of some routed experts with granularity g, as well as a generalist with granularity ggen ∈ { 1 2 , 1 4 , 1 8 }. We compare to settings with no generalist, only routed experts with granularity g. In all settings and configurations, the addition of any granularity generalist results in … view at source ↗
Figure 36
Figure 36. Figure 36: The inclusion of a generalist consistently degrades performance in heterogeneous MoEs (§3.2). We train heterogeneous MoE LMs which consist of routed experts with granularity g1, g2, as well as a generalist with granularity ggen = 1 2 . We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 … view at source ↗
Figure 37
Figure 37. Figure 37: Dropless routing outperforms default routing (§3.3). We compare dropless routing to the default setting, which allow tokens to be dropped. Across all scales, we find that dropless routing outperforms or performs comparably to default routing. 50 [PITH_FULL_IMAGE:figures/full_fig_p050_37.png] view at source ↗
Figure 38
Figure 38. Figure 38: Dropless routing, with bias γ = 1e−3 (§3.3). As in [PITH_FULL_IMAGE:figures/full_fig_p051_38.png] view at source ↗
Figure 39
Figure 39. Figure 39: Load balancing mechanisms must be tuned correctly (§3.3). We consider load balancing loss weight αLB ∈ {1e−2, 1e−4} and loss-free load balancing with bias γ ∈ {0, 1e−3} (γ = 0 indicates no loss-free mechanism). Results show that poorly chosen hyperparameters, such as high bias γ = 1e − 3 with total experts n ≥ 512, may impair performance. However, all settings other than (αLB = 1e−2, γ = 1e−3) perform com… view at source ↗
Figure 40
Figure 40. Figure 40: Increasing inactive expert parameters via expert size (left) or total count (center) improves performance in MoEs (§3.1). This effect is seen both when holding total number of experts fixed (left) and when holding expert granularity fixed (center). In general, increasing total parameters results in improved performance. Optimal tradeoff between expert count and granularity varies in MoEs (right). (§3.1) A… view at source ↗
Figure 41
Figure 41. Figure 41: Heterogeneity of expert size alone does not improve MoE performance (§3.2). To explore the potential benefits of their architectural flexibility, we compare heterogeneous MoEs (indicated by dotted lines) to active- and total-parameter-matched homogeneous MoEs. Heterogeneity alone does not result in performance gains, as, at each activation sparsity s, heterogeneous MoEs with n1, n2 = a, b lie between or n… view at source ↗
Figure 42
Figure 42. Figure 42: The inclusion of a generalist consistently degrades performance in homogeneous MoEs (§3.2). We train MoE LMs which consist of some routed experts with granularity g, as well as a generalist with granularity ggen ∈ { 1 2 , 1 4 , 1 8 }. We compare to settings with no generalist, only routed experts with granularity g. In all settings and configurations, the addition of any granularity generalist results in … view at source ↗
Figure 43
Figure 43. Figure 43: The inclusion of a generalist consistently degrades performance in heterogeneous MoEs (§3.2). We train heterogeneous MoE LMs which consist of routed experts with granularity g1, g2, as well as a generalist with granularity ggen = 1 2 . We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 … view at source ↗
Figure 44
Figure 44. Figure 44: Dropless routing outperforms default routing (§3.3). We compare dropless routing to the default setting, which allow tokens to be dropped. Across all scales, we find that dropless routing outperforms or performs comparably to default routing. 57 [PITH_FULL_IMAGE:figures/full_fig_p057_44.png] view at source ↗
Figure 45
Figure 45. Figure 45: Dropless routing, with bias γ = 1e−3 (§3.3). As in [PITH_FULL_IMAGE:figures/full_fig_p058_45.png] view at source ↗
Figure 46
Figure 46. Figure 46: Load balancing mechanisms must be tuned correctly (§3.3). We consider load balancing loss weight αLB ∈ {1e−2, 1e−4} and loss-free load balancing with bias γ ∈ {0, 1e−3} (γ = 0 indicates no loss-free mechanism). Results show that poorly chosen hyperparameters, such as high bias γ = 1e − 3 with total experts n ≥ 512, may impair performance. However, all settings other than (αLB = 1e−2, γ = 1e−3) perform com… view at source ↗
Figure 47
Figure 47. Figure 47: Increasing inactive expert parameters via expert size (left) or total count (center) improves performance in MoEs (§3.1). This effect is seen both when holding total number of experts fixed (left) and when holding expert granularity fixed (center). In general, increasing total parameters results in improved performance. Optimal tradeoff between expert count and granularity varies in MoEs (right). (§3.1) A… view at source ↗
Figure 48
Figure 48. Figure 48: Heterogeneity of expert size alone does not improve MoE performance (§3.2). To explore the potential benefits of their architectural flexibility, we compare heterogeneous MoEs (indicated by dotted lines) to active- and total-parameter-matched homogeneous MoEs. Heterogeneity alone does not result in performance gains, as, at each activation sparsity s, heterogeneous MoEs with n1, n2 = a, b lie between or n… view at source ↗
Figure 49
Figure 49. Figure 49: The inclusion of a generalist consistently degrades performance in homogeneous MoEs (§3.2). We train MoE LMs which consist of some routed experts with granularity g, as well as a generalist with granularity ggen ∈ { 1 2 , 1 4 , 1 8 }. We compare to settings with no generalist, only routed experts with granularity g. In all settings and configurations, the addition of any granularity generalist results in … view at source ↗
Figure 50
Figure 50. Figure 50: The inclusion of a generalist consistently degrades performance in heterogeneous MoEs (§3.2). We train heterogeneous MoE LMs which consist of routed experts with granularity g1, g2, as well as a generalist with granularity ggen = 1 2 . We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 … view at source ↗
Figure 51
Figure 51. Figure 51: Dropless routing outperforms default routing (§3.3). We compare dropless routing to the default setting, which allow tokens to be dropped. Across all scales, we find that dropless routing outperforms or performs comparably to default routing. 64 [PITH_FULL_IMAGE:figures/full_fig_p064_51.png] view at source ↗
Figure 52
Figure 52. Figure 52: Dropless routing, with bias γ = 1e−3 (§3.3). As in [PITH_FULL_IMAGE:figures/full_fig_p065_52.png] view at source ↗
Figure 53
Figure 53. Figure 53: Load balancing mechanisms must be tuned correctly (§3.3). We consider load balancing loss weight αLB ∈ {1e−2, 1e−4} and loss-free load balancing with bias γ ∈ {0, 1e−3} (γ = 0 indicates no loss-free mechanism). Results show that poorly chosen hyperparameters, such as high bias γ = 1e − 3 with total experts n ≥ 512, may impair performance. However, all settings other than (αLB = 1e−2, γ = 1e−3) perform com… view at source ↗
Figure 54
Figure 54. Figure 54: Increasing inactive expert parameters via expert size (left) or total count (center) improves performance in MoEs (§3.1). This effect is seen both when holding total number of experts fixed (left) and when holding expert granularity fixed (center). In general, increasing total parameters results in improved performance. Optimal tradeoff between expert count and granularity varies in MoEs (right). (§3.1) A… view at source ↗
Figure 55
Figure 55. Figure 55: Heterogeneity of expert size alone does not improve MoE performance (§3.2). To explore the potential benefits of their architectural flexibility, we compare heterogeneous MoEs (indicated by dotted lines) to active- and total-parameter-matched homogeneous MoEs. Heterogeneity alone does not result in performance gains, as, at each activation sparsity s, heterogeneous MoEs with n1, n2 = a, b lie between or n… view at source ↗
Figure 56
Figure 56. Figure 56: The inclusion of a generalist consistently degrades performance in homogeneous MoEs (§3.2). We train MoE LMs which consist of some routed experts with granularity g, as well as a generalist with granularity ggen ∈ { 1 2 , 1 4 , 1 8 }. We compare to settings with no generalist, only routed experts with granularity g. In all settings and configurations, the addition of any granularity generalist results in … view at source ↗
Figure 57
Figure 57. Figure 57: The inclusion of a generalist consistently degrades performance in heterogeneous MoEs (§3.2). We train heterogeneous MoE LMs which consist of routed experts with granularity g1, g2, as well as a generalist with granularity ggen = 1 2 . We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 … view at source ↗
Figure 58
Figure 58. Figure 58: Dropless routing outperforms default routing (§3.3). We compare dropless routing to the default setting, which allow tokens to be dropped. Across all scales, we find that dropless routing outperforms or performs comparably to default routing. 71 [PITH_FULL_IMAGE:figures/full_fig_p071_58.png] view at source ↗
Figure 59
Figure 59. Figure 59: Dropless routing, with bias γ = 1e−3 (§3.3). As in [PITH_FULL_IMAGE:figures/full_fig_p072_59.png] view at source ↗
Figure 60
Figure 60. Figure 60: Load balancing mechanisms must be tuned correctly (§3.3). We consider load balancing loss weight αLB ∈ {1e−2, 1e−4} and loss-free load balancing with bias γ ∈ {0, 1e−3} (γ = 0 indicates no loss-free mechanism). Results show that poorly chosen hyperparameters, such as high bias γ = 1e − 3 with total experts n ≥ 512, may impair performance. However, all settings other than (αLB = 1e−2, γ = 1e−3) perform com… view at source ↗
Figure 61
Figure 61. Figure 61: Increasing inactive expert parameters via expert size (left) or total count (center) improves performance in MoEs (§3.1). This effect is seen both when holding total number of experts fixed (left) and when holding expert granularity fixed (center). In general, increasing total parameters results in improved performance. Optimal tradeoff between expert count and granularity varies in MoEs (right). (§3.1) A… view at source ↗
Figure 62
Figure 62. Figure 62: Heterogeneity of expert size alone does not improve MoE performance (§3.2). To explore the potential benefits of their architectural flexibility, we compare heterogeneous MoEs (indicated by dotted lines) to active- and total-parameter-matched homogeneous MoEs. Heterogeneity alone does not result in performance gains, as, at each activation sparsity s, heterogeneous MoEs with n1, n2 = a, b lie between or n… view at source ↗
Figure 63
Figure 63. Figure 63: The inclusion of a generalist consistently degrades performance in homogeneous MoEs (§3.2). We train MoE LMs which consist of some routed experts with granularity g, as well as a generalist with granularity ggen ∈ { 1 2 , 1 4 , 1 8 }. We compare to settings with no generalist, only routed experts with granularity g. In all settings and configurations, the addition of any granularity generalist results in … view at source ↗
Figure 64
Figure 64. Figure 64: The inclusion of a generalist consistently degrades performance in heterogeneous MoEs (§3.2). We train heterogeneous MoE LMs which consist of routed experts with granularity g1, g2, as well as a generalist with granularity ggen = 1 2 . We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 … view at source ↗
Figure 65
Figure 65. Figure 65: Dropless routing outperforms default routing (§3.3). We compare dropless routing to the default setting, which allow tokens to be dropped. Across all scales, we find that dropless routing outperforms or performs comparably to default routing. 78 [PITH_FULL_IMAGE:figures/full_fig_p078_65.png] view at source ↗
Figure 66
Figure 66. Figure 66: Dropless routing, with bias γ = 1e−3 (§3.3). As in [PITH_FULL_IMAGE:figures/full_fig_p079_66.png] view at source ↗
Figure 67
Figure 67. Figure 67: Load balancing mechanisms must be tuned correctly (§3.3). We consider load balancing loss weight αLB ∈ {1e−2, 1e−4} and loss-free load balancing with bias γ ∈ {0, 1e−3} (γ = 0 indicates no loss-free mechanism). Results show that poorly chosen hyperparameters, such as high bias γ = 1e − 3 with total experts n ≥ 512, may impair performance. However, all settings other than (αLB = 1e−2, γ = 1e−3) perform com… view at source ↗
Figure 68
Figure 68. Figure 68: Increasing inactive expert parameters via expert size (left) or total count (center) improves performance in MoEs (§3.1). This effect is seen both when holding total number of experts fixed (left) and when holding expert granularity fixed (center). In general, increasing total parameters results in improved performance. Optimal tradeoff between expert count and granularity varies in MoEs (right). (§3.1) A… view at source ↗
Figure 69
Figure 69. Figure 69: Heterogeneity of expert size alone does not improve MoE performance (§3.2). To explore the potential benefits of their architectural flexibility, we compare heterogeneous MoEs (indicated by dotted lines) to active- and total-parameter-matched homogeneous MoEs. Heterogeneity alone does not result in performance gains, as, at each activation sparsity s, heterogeneous MoEs with n1, n2 = a, b lie between or n… view at source ↗
Figure 70
Figure 70. Figure 70: The inclusion of a generalist consistently degrades performance in homogeneous MoEs (§3.2). We train MoE LMs which consist of some routed experts with granularity g, as well as a generalist with granularity ggen ∈ { 1 2 , 1 4 , 1 8 }. We compare to settings with no generalist, only routed experts with granularity g. In all settings and configurations, the addition of any granularity generalist results in … view at source ↗
Figure 71
Figure 71. Figure 71: The inclusion of a generalist consistently degrades performance in heterogeneous MoEs (§3.2). We train heterogeneous MoE LMs which consist of routed experts with granularity g1, g2, as well as a generalist with granularity ggen = 1 2 . We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 … view at source ↗
Figure 72
Figure 72. Figure 72: Dropless routing outperforms default routing (§3.3). We compare dropless routing to the default setting, which allow tokens to be dropped. Across all scales, we find that dropless routing outperforms or performs comparably to default routing. 85 [PITH_FULL_IMAGE:figures/full_fig_p085_72.png] view at source ↗
Figure 73
Figure 73. Figure 73: Dropless routing, with bias γ = 1e−3 (§3.3). As in [PITH_FULL_IMAGE:figures/full_fig_p086_73.png] view at source ↗
Figure 74
Figure 74. Figure 74: Load balancing mechanisms must be tuned correctly (§3.3). We consider load balancing loss weight αLB ∈ {1e−2, 1e−4} and loss-free load balancing with bias γ ∈ {0, 1e−3} (γ = 0 indicates no loss-free mechanism). Results show that poorly chosen hyperparameters, such as high bias γ = 1e − 3 with total experts n ≥ 512, may impair performance. However, all settings other than (αLB = 1e−2, γ = 1e−3) perform com… view at source ↗
Figure 75
Figure 75. Figure 75: Increasing inactive expert parameters via expert size (left) or total count (center) improves performance in MoEs (§3.1). This effect is seen both when holding total number of experts fixed (left) and when holding expert granularity fixed (center). In general, increasing total parameters results in improved performance. Optimal tradeoff between expert count and granularity varies in MoEs (right). (§3.1) A… view at source ↗
Figure 76
Figure 76. Figure 76: Heterogeneity of expert size alone does not improve MoE performance (§3.2). To explore the potential benefits of their architectural flexibility, we compare heterogeneous MoEs (indicated by dotted lines) to active- and total-parameter-matched homogeneous MoEs. Heterogeneity alone does not result in performance gains, as, at each activation sparsity s, heterogeneous MoEs with n1, n2 = a, b lie between or n… view at source ↗
Figure 77
Figure 77. Figure 77: The inclusion of a generalist consistently degrades performance in homogeneous MoEs (§3.2). We train MoE LMs which consist of some routed experts with granularity g, as well as a generalist with granularity ggen ∈ { 1 2 , 1 4 , 1 8 }. We compare to settings with no generalist, only routed experts with granularity g. In all settings and configurations, the addition of any granularity generalist results in … view at source ↗
Figure 78
Figure 78. Figure 78: The inclusion of a generalist consistently degrades performance in heterogeneous MoEs (§3.2). We train heterogeneous MoE LMs which consist of routed experts with granularity g1, g2, as well as a generalist with granularity ggen = 1 2 . We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 … view at source ↗
Figure 79
Figure 79. Figure 79: Dropless routing outperforms default routing (§3.3). We compare dropless routing to the default setting, which allow tokens to be dropped. Across all scales, we find that dropless routing outperforms or performs comparably to default routing. 92 [PITH_FULL_IMAGE:figures/full_fig_p092_79.png] view at source ↗
Figure 80
Figure 80. Figure 80: Dropless routing, with bias γ = 1e−3 (§3.3). As in [PITH_FULL_IMAGE:figures/full_fig_p093_80.png] view at source ↗
Figure 81
Figure 81. Figure 81: Load balancing mechanisms must be tuned correctly (§3.3). We consider load balancing loss weight αLB ∈ {1e−2, 1e−4} and loss-free load balancing with bias γ ∈ {0, 1e−3} (γ = 0 indicates no loss-free mechanism). Results show that poorly chosen hyperparameters, such as high bias γ = 1e − 3 with total experts n ≥ 512, may impair performance. However, all settings other than (αLB = 1e−2, γ = 1e−3) perform com… view at source ↗
Figure 82
Figure 82. Figure 82: Increasing inactive expert parameters via expert size (left) or total count (center) improves performance in MoEs (§3.1). This effect is seen both when holding total number of experts fixed (left) and when holding expert granularity fixed (center). In general, increasing total parameters results in improved performance. Optimal tradeoff between expert count and granularity varies in MoEs (right). (§3.1) A… view at source ↗
Figure 83
Figure 83. Figure 83: Heterogeneity of expert size alone does not improve MoE performance (§3.2). To explore the potential benefits of their architectural flexibility, we compare heterogeneous MoEs (indicated by dotted lines) to active- and total-parameter-matched homogeneous MoEs. Heterogeneity alone does not result in performance gains, as, at each activation sparsity s, heterogeneous MoEs with n1, n2 = a, b lie between or n… view at source ↗
Figure 84
Figure 84. Figure 84: The inclusion of a generalist consistently degrades performance in homogeneous MoEs (§3.2). We train MoE LMs which consist of some routed experts with granularity g, as well as a generalist with granularity ggen ∈ { 1 2 , 1 4 , 1 8 }. We compare to settings with no generalist, only routed experts with granularity g. In all settings and configurations, the addition of any granularity generalist results in … view at source ↗
Figure 85
Figure 85. Figure 85: The inclusion of a generalist consistently degrades performance in heterogeneous MoEs (§3.2). We train heterogeneous MoE LMs which consist of routed experts with granularity g1, g2, as well as a generalist with granularity ggen = 1 2 . We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 … view at source ↗
Figure 86
Figure 86. Figure 86: Dropless routing outperforms default routing (§3.3). We compare dropless routing to the default setting, which allow tokens to be dropped. Across all scales, we find that dropless routing outperforms or performs comparably to default routing. 99 [PITH_FULL_IMAGE:figures/full_fig_p099_86.png] view at source ↗
Figure 87
Figure 87. Figure 87: Dropless routing, with bias γ = 1e−3 (§3.3). As in [PITH_FULL_IMAGE:figures/full_fig_p100_87.png] view at source ↗
Figure 88
Figure 88. Figure 88: Load balancing mechanisms must be tuned correctly (§3.3). We consider load balancing loss weight αLB ∈ {1e−2, 1e−4} and loss-free load balancing with bias γ ∈ {0, 1e−3} (γ = 0 indicates no loss-free mechanism). Results show that poorly chosen hyperparameters, such as high bias γ = 1e − 3 with total experts n ≥ 512, may impair performance. However, all settings other than (αLB = 1e−2, γ = 1e−3) perform com… view at source ↗
Figure 89
Figure 89. Figure 89: Increasing inactive expert parameters via expert size (left) or total count (center) improves performance in MoEs (§3.1). This effect is seen both when holding total number of experts fixed (left) and when holding expert granularity fixed (center). In general, increasing total parameters results in improved performance. Optimal tradeoff between expert count and granularity varies in MoEs (right). (§3.1) A… view at source ↗
Figure 90
Figure 90. Figure 90: Heterogeneity of expert size alone does not improve MoE performance (§3.2). To explore the potential benefits of their architectural flexibility, we compare heterogeneous MoEs (indicated by dotted lines) to active- and total-parameter-matched homogeneous MoEs. Heterogeneity alone does not result in performance gains, as, at each activation sparsity s, heterogeneous MoEs with n1, n2 = a, b lie between or n… view at source ↗
Figure 91
Figure 91. Figure 91: The inclusion of a generalist consistently degrades performance in homogeneous MoEs (§3.2). We train MoE LMs which consist of some routed experts with granularity g, as well as a generalist with granularity ggen ∈ { 1 2 , 1 4 , 1 8 }. We compare to settings with no generalist, only routed experts with granularity g. In all settings and configurations, the addition of any granularity generalist results in … view at source ↗
Figure 92
Figure 92. Figure 92: The inclusion of a generalist consistently degrades performance in heterogeneous MoEs (§3.2). We train heterogeneous MoE LMs which consist of routed experts with granularity g1, g2, as well as a generalist with granularity ggen = 1 2 . We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 … view at source ↗
Figure 93
Figure 93. Figure 93: Dropless routing outperforms default routing (§3.3). We compare dropless routing to the default setting, which allow tokens to be dropped. Across all scales, we find that dropless routing outperforms or performs comparably to default routing. 106 [PITH_FULL_IMAGE:figures/full_fig_p106_93.png] view at source ↗
Figure 94
Figure 94. Figure 94: Dropless routing, with bias γ = 1e−3 (§3.3). As in [PITH_FULL_IMAGE:figures/full_fig_p107_94.png] view at source ↗
Figure 95
Figure 95. Figure 95: Load balancing mechanisms must be tuned correctly (§3.3). We consider load balancing loss weight αLB ∈ {1e−2, 1e−4} and loss-free load balancing with bias γ ∈ {0, 1e−3} (γ = 0 indicates no loss-free mechanism). Results show that poorly chosen hyperparameters, such as high bias γ = 1e − 3 with total experts n ≥ 512, may impair performance. However, all settings other than (αLB = 1e−2, γ = 1e−3) perform com… view at source ↗
Figure 96
Figure 96. Figure 96: Increasing inactive expert parameters via expert size (left) or total count (center) improves performance in MoEs (§3.1). This effect is seen both when holding total number of experts fixed (left) and when holding expert granularity fixed (center). In general, increasing total parameters results in improved performance. Optimal tradeoff between expert count and granularity varies in MoEs (right). (§3.1) A… view at source ↗
Figure 97
Figure 97. Figure 97: Heterogeneity of expert size alone does not improve MoE performance (§3.2). To explore the potential benefits of their architectural flexibility, we compare heterogeneous MoEs (indicated by dotted lines) to active- and total-parameter-matched homogeneous MoEs. Heterogeneity alone does not result in performance gains, as, at each activation sparsity s, heterogeneous MoEs with n1, n2 = a, b lie between or n… view at source ↗
Figure 98
Figure 98. Figure 98: The inclusion of a generalist consistently degrades performance in homogeneous MoEs (§3.2). We train MoE LMs which consist of some routed experts with granularity g, as well as a generalist with granularity ggen ∈ { 1 2 , 1 4 , 1 8 }. We compare to settings with no generalist, only routed experts with granularity g. In all settings and configurations, the addition of any granularity generalist results in … view at source ↗
Figure 99
Figure 99. Figure 99: The inclusion of a generalist consistently degrades performance in heterogeneous MoEs (§3.2). We train heterogeneous MoE LMs which consist of routed experts with granularity g1, g2, as well as a generalist with granularity ggen = 1 2 . We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 … view at source ↗
Figure 100
Figure 100. Figure 100: Dropless routing outperforms default routing (§3.3). We compare dropless routing to the default setting, which allow tokens to be dropped. Across all scales, we find that dropless routing outperforms or performs comparably to default routing. 113 [PITH_FULL_IMAGE:figures/full_fig_p113_100.png] view at source ↗
Figure 101
Figure 101. Figure 101: Dropless routing, with bias γ = 1e−3 (§3.3). As in [PITH_FULL_IMAGE:figures/full_fig_p114_101.png] view at source ↗
Figure 102
Figure 102. Figure 102: Load balancing mechanisms must be tuned correctly (§3.3). We consider load balancing loss weight αLB ∈ {1e−2, 1e−4} and loss-free load balancing with bias γ ∈ {0, 1e−3} (γ = 0 indicates no loss-free mechanism). Results show that poorly chosen hyperparameters, such as high bias γ = 1e − 3 with total experts n ≥ 512, may impair performance. However, all settings other than (αLB = 1e−2, γ = 1e−3) perform co… view at source ↗
Figure 103
Figure 103. Figure 103: Increasing inactive expert parameters via expert size (left) or total count (center) improves performance in MoEs (§3.1). This effect is seen both when holding total number of experts fixed (left) and when holding expert granularity fixed (center). In general, increasing total parameters results in improved performance. Optimal tradeoff between expert count and granularity varies in MoEs (right). (§3.1) … view at source ↗
Figure 104
Figure 104. Figure 104: Heterogeneity of expert size alone does not improve MoE performance (§3.2). To explore the potential benefits of their architectural flexibility, we compare heterogeneous MoEs (indicated by dotted lines) to active- and total-parameter-matched homogeneous MoEs. Heterogeneity alone does not result in performance gains, as, at each activation sparsity s, heterogeneous MoEs with n1, n2 = a, b lie between or … view at source ↗
Figure 105
Figure 105. Figure 105: The inclusion of a generalist consistently degrades performance in homogeneous MoEs (§3.2). We train MoE LMs which consist of some routed experts with granularity g, as well as a generalist with granularity ggen ∈ { 1 2 , 1 4 , 1 8 }. We compare to settings with no generalist, only routed experts with granularity g. In all settings and configurations, the addition of any granularity generalist results in… view at source ↗
Figure 106
Figure 106. Figure 106: The inclusion of a generalist consistently degrades performance in heterogeneous MoEs (§3.2). We train heterogeneous MoE LMs which consist of routed experts with granularity g1, g2, as well as a generalist with granularity ggen = 1 2 . We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1… view at source ↗
Figure 107
Figure 107. Figure 107: Dropless routing outperforms default routing (§3.3). We compare dropless routing to the default setting, which allow tokens to be dropped. Across all scales, we find that dropless routing outperforms or performs comparably to default routing. 120 [PITH_FULL_IMAGE:figures/full_fig_p120_107.png] view at source ↗
Figure 108
Figure 108. Figure 108: Dropless routing, with bias γ = 1e−3 (§3.3). As in [PITH_FULL_IMAGE:figures/full_fig_p121_108.png] view at source ↗
Figure 109
Figure 109. Figure 109: Load balancing mechanisms must be tuned correctly (§3.3). We consider load balancing loss weight αLB ∈ {1e−2, 1e−4} and loss-free load balancing with bias γ ∈ {0, 1e−3} (γ = 0 indicates no loss-free mechanism). Results show that poorly chosen hyperparameters, such as high bias γ = 1e − 3 with total experts n ≥ 512, may impair performance. However, all settings other than (αLB = 1e−2, γ = 1e−3) perform co… view at source ↗
Figure 110
Figure 110. Figure 110: Increasing inactive expert parameters via expert size (left) or total count (center) improves performance in MoEs (§3.1). This effect is seen both when holding total number of experts fixed (left) and when holding expert granularity fixed (center). In general, increasing total parameters results in improved performance. Optimal tradeoff between expert count and granularity varies in MoEs (right). (§3.1) … view at source ↗
Figure 111
Figure 111. Figure 111: Heterogeneity of expert size alone does not improve MoE performance (§3.2). To explore the potential benefits of their architectural flexibility, we compare heterogeneous MoEs (indicated by dotted lines) to active- and total-parameter-matched homogeneous MoEs. Heterogeneity alone does not result in performance gains, as, at each activation sparsity s, heterogeneous MoEs with n1, n2 = a, b lie between or … view at source ↗
Figure 112
Figure 112. Figure 112: The inclusion of a generalist consistently degrades performance in homogeneous MoEs (§3.2). We train MoE LMs which consist of some routed experts with granularity g, as well as a generalist with granularity ggen ∈ { 1 2 , 1 4 , 1 8 }. We compare to settings with no generalist, only routed experts with granularity g. In all settings and configurations, the addition of any granularity generalist results in… view at source ↗
Figure 113
Figure 113. Figure 113: The inclusion of a generalist consistently degrades performance in heterogeneous MoEs (§3.2). We train heterogeneous MoE LMs which consist of routed experts with granularity g1, g2, as well as a generalist with granularity ggen = 1 2 . We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1… view at source ↗
Figure 114
Figure 114. Figure 114: Dropless routing outperforms default routing (§3.3). We compare dropless routing to the default setting, which allow tokens to be dropped. Across all scales, we find that dropless routing outperforms or performs comparably to default routing. 127 [PITH_FULL_IMAGE:figures/full_fig_p127_114.png] view at source ↗
Figure 115
Figure 115. Figure 115: Dropless routing, with bias γ = 1e−3 (§3.3). As in [PITH_FULL_IMAGE:figures/full_fig_p128_115.png] view at source ↗
Figure 116
Figure 116. Figure 116: Load balancing mechanisms must be tuned correctly (§3.3). We consider load balancing loss weight αLB ∈ {1e−2, 1e−4} and loss-free load balancing with bias γ ∈ {0, 1e−3} (γ = 0 indicates no loss-free mechanism). Results show that poorly chosen hyperparameters, such as high bias γ = 1e − 3 with total experts n ≥ 512, may impair performance. However, all settings other than (αLB = 1e−2, γ = 1e−3) perform co… view at source ↗
Figure 117
Figure 117. Figure 117: Increasing inactive expert parameters via expert size (left) or total count (center) improves performance in MoEs (§3.1). This effect is seen both when holding total number of experts fixed (left) and when holding expert granularity fixed (center). In general, increasing total parameters results in improved performance. Optimal tradeoff between expert count and granularity varies in MoEs (right). (§3.1) … view at source ↗
Figure 118
Figure 118. Figure 118: Heterogeneity of expert size alone does not improve MoE performance (§3.2). To explore the potential benefits of their architectural flexibility, we compare heterogeneous MoEs (indicated by dotted lines) to active- and total-parameter-matched homogeneous MoEs. Heterogeneity alone does not result in performance gains, as, at each activation sparsity s, heterogeneous MoEs with n1, n2 = a, b lie between or … view at source ↗
Figure 119
Figure 119. Figure 119: The inclusion of a generalist consistently degrades performance in homogeneous MoEs (§3.2). We train MoE LMs which consist of some routed experts with granularity g, as well as a generalist with granularity ggen ∈ { 1 2 , 1 4 , 1 8 }. We compare to settings with no generalist, only routed experts with granularity g. In all settings and configurations, the addition of any granularity generalist results in… view at source ↗
Figure 120
Figure 120. Figure 120: The inclusion of a generalist consistently degrades performance in heterogeneous MoEs (§3.2). We train heterogeneous MoE LMs which consist of routed experts with granularity g1, g2, as well as a generalist with granularity ggen = 1 2 . We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1… view at source ↗
Figure 121
Figure 121. Figure 121: Dropless routing outperforms default routing (§3.3). We compare dropless routing to the default setting, which allow tokens to be dropped. Across all scales, we find that dropless routing outperforms or performs comparably to default routing. 134 [PITH_FULL_IMAGE:figures/full_fig_p134_121.png] view at source ↗
Figure 122
Figure 122. Figure 122: Dropless routing, with bias γ = 1e−3 (§3.3). As in [PITH_FULL_IMAGE:figures/full_fig_p135_122.png] view at source ↗
Figure 123
Figure 123. Figure 123: Load balancing mechanisms must be tuned correctly (§3.3). We consider load balancing loss weight αLB ∈ {1e−2, 1e−4} and loss-free load balancing with bias γ ∈ {0, 1e−3} (γ = 0 indicates no loss-free mechanism). Results show that poorly chosen hyperparameters, such as high bias γ = 1e − 3 with total experts n ≥ 512, may impair performance. However, all settings other than (αLB = 1e−2, γ = 1e−3) perform co… view at source ↗
Figure 124
Figure 124. Figure 124: Increasing inactive expert parameters via expert size (left) or total count (center) improves performance in MoEs (§3.1). This effect is seen both when holding total number of experts fixed (left) and when holding expert granularity fixed (center). In general, increasing total parameters results in improved performance. Optimal tradeoff between expert count and granularity varies in MoEs (right). (§3.1) … view at source ↗
Figure 125
Figure 125. Figure 125: Heterogeneity of expert size alone does not improve MoE performance (§3.2). To explore the potential benefits of their architectural flexibility, we compare heterogeneous MoEs (indicated by dotted lines) to active- and total-parameter-matched homogeneous MoEs. Heterogeneity alone does not result in performance gains, as, at each activation sparsity s, heterogeneous MoEs with n1, n2 = a, b lie between or … view at source ↗
Figure 126
Figure 126. Figure 126: The inclusion of a generalist consistently degrades performance in homogeneous MoEs (§3.2). We train MoE LMs which consist of some routed experts with granularity g, as well as a generalist with granularity ggen ∈ { 1 2 , 1 4 , 1 8 }. We compare to settings with no generalist, only routed experts with granularity g. In all settings and configurations, the addition of any granularity generalist results in… view at source ↗
Figure 127
Figure 127. Figure 127: The inclusion of a generalist consistently degrades performance in heterogeneous MoEs (§3.2). We train heterogeneous MoE LMs which consist of routed experts with granularity g1, g2, as well as a generalist with granularity ggen = 1 2 . We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1… view at source ↗
Figure 128
Figure 128. Figure 128: Dropless routing outperforms default routing (§3.3). We compare dropless routing to the default setting, which allow tokens to be dropped. Across all scales, we find that dropless routing outperforms or performs comparably to default routing. 141 [PITH_FULL_IMAGE:figures/full_fig_p141_128.png] view at source ↗
Figure 129
Figure 129. Figure 129: Dropless routing, with bias γ = 1e−3 (§3.3). As in [PITH_FULL_IMAGE:figures/full_fig_p142_129.png] view at source ↗
Figure 130
Figure 130. Figure 130: Load balancing mechanisms must be tuned correctly (§3.3). We consider load balancing loss weight αLB ∈ {1e−2, 1e−4} and loss-free load balancing with bias γ ∈ {0, 1e−3} (γ = 0 indicates no loss-free mechanism). Results show that poorly chosen hyperparameters, such as high bias γ = 1e − 3 with total experts n ≥ 512, may impair performance. However, all settings other than (αLB = 1e−2, γ = 1e−3) perform co… view at source ↗
Figure 131
Figure 131. Figure 131: Increasing inactive expert parameters via expert size (left) or total count (center) improves performance in MoEs (§3.1). This effect is seen both when holding total number of experts fixed (left) and when holding expert granularity fixed (center). In general, increasing total parameters results in improved performance. Optimal tradeoff between expert count and granularity varies in MoEs (right). (§3.1) … view at source ↗
Figure 132
Figure 132. Figure 132: Heterogeneity of expert size alone does not improve MoE performance (§3.2). To explore the potential benefits of their architectural flexibility, we compare heterogeneous MoEs (indicated by dotted lines) to active- and total-parameter-matched homogeneous MoEs. Heterogeneity alone does not result in performance gains, as, at each activation sparsity s, heterogeneous MoEs with n1, n2 = a, b lie between or … view at source ↗
Figure 133
Figure 133. Figure 133: The inclusion of a generalist consistently degrades performance in homogeneous MoEs (§3.2). We train MoE LMs which consist of some routed experts with granularity g, as well as a generalist with granularity ggen ∈ { 1 2 , 1 4 , 1 8 }. We compare to settings with no generalist, only routed experts with granularity g. In all settings and configurations, the addition of any granularity generalist results in… view at source ↗
Figure 134
Figure 134. Figure 134: The inclusion of a generalist consistently degrades performance in heterogeneous MoEs (§3.2). We train heterogeneous MoE LMs which consist of routed experts with granularity g1, g2, as well as a generalist with granularity ggen = 1 2 . We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1… view at source ↗
Figure 135
Figure 135. Figure 135: Dropless routing outperforms default routing (§3.3). We compare dropless routing to the default setting, which allow tokens to be dropped. Across all scales, we find that dropless routing outperforms or performs comparably to default routing. 148 [PITH_FULL_IMAGE:figures/full_fig_p148_135.png] view at source ↗
Figure 136
Figure 136. Figure 136: Dropless routing, with bias γ = 1e−3 (§3.3). As in [PITH_FULL_IMAGE:figures/full_fig_p149_136.png] view at source ↗
Figure 137
Figure 137. Figure 137: Load balancing mechanisms must be tuned correctly (§3.3). We consider load balancing loss weight αLB ∈ {1e−2, 1e−4} and loss-free load balancing with bias γ ∈ {0, 1e−3} (γ = 0 indicates no loss-free mechanism). Results show that poorly chosen hyperparameters, such as high bias γ = 1e − 3 with total experts n ≥ 512, may impair performance. However, all settings other than (αLB = 1e−2, γ = 1e−3) perform co… view at source ↗
Figure 138
Figure 138. Figure 138: Increasing inactive expert parameters via expert size (left) or total count (center) improves performance in MoEs (§3.1). This effect is seen both when holding total number of experts fixed (left) and when holding expert granularity fixed (center). In general, increasing total parameters results in improved performance. Optimal tradeoff between expert count and granularity varies in MoEs (right). (§3.1) … view at source ↗
Figure 139
Figure 139. Figure 139: Heterogeneity of expert size alone does not improve MoE performance (§3.2). To explore the potential benefits of their architectural flexibility, we compare heterogeneous MoEs (indicated by dotted lines) to active- and total-parameter-matched homogeneous MoEs. Heterogeneity alone does not result in performance gains, as, at each activation sparsity s, heterogeneous MoEs with n1, n2 = a, b lie between or … view at source ↗
Figure 140
Figure 140. Figure 140: The inclusion of a generalist consistently degrades performance in homogeneous MoEs (§3.2). We train MoE LMs which consist of some routed experts with granularity g, as well as a generalist with granularity ggen ∈ { 1 2 , 1 4 , 1 8 }. We compare to settings with no generalist, only routed experts with granularity g. In all settings and configurations, the addition of any granularity generalist results in… view at source ↗
Figure 141
Figure 141. Figure 141: The inclusion of a generalist consistently degrades performance in heterogeneous MoEs (§3.2). We train heterogeneous MoE LMs which consist of routed experts with granularity g1, g2, as well as a generalist with granularity ggen = 1 2 . We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1… view at source ↗
Figure 142
Figure 142. Figure 142: Dropless routing outperforms default routing (§3.3). We compare dropless routing to the default setting, which allow tokens to be dropped. Across all scales, we find that dropless routing outperforms or performs comparably to default routing. 155 [PITH_FULL_IMAGE:figures/full_fig_p155_142.png] view at source ↗
Figure 143
Figure 143. Figure 143: Dropless routing, with bias γ = 1e−3 (§3.3). As in [PITH_FULL_IMAGE:figures/full_fig_p156_143.png] view at source ↗
Figure 144
Figure 144. Figure 144: Load balancing mechanisms must be tuned correctly (§3.3). We consider load balancing loss weight αLB ∈ {1e−2, 1e−4} and loss-free load balancing with bias γ ∈ {0, 1e−3} (γ = 0 indicates no loss-free mechanism). Results show that poorly chosen hyperparameters, such as high bias γ = 1e − 3 with total experts n ≥ 512, may impair performance. However, all settings other than (αLB = 1e−2, γ = 1e−3) perform co… view at source ↗
read the original abstract

Mixture-of-Experts (MoE) architectures have become standard in large language models, yet many of their core design choices - expert count, granularity, shared experts, load balancing, token dropping - have only been studied one or two at a time over narrow configuration ranges. It remains an open question whether these choices can be optimized independently, without considering interactions. We present the first systematic study of over 2,000 pretraining runs spanning models up to 6.6B total parameters, in which we exhaustively vary total experts, expert dimension, heterogeneous expert sizing within a single layer, shared expert size and load-balancing mechanisms. We find that at every active-parameter scale that we study, performance consistently improves with total MoE parameters even at extreme active expert parameter ratios like 128.Further, the optimal expert size is nearly invariant to total parameter count and depends only on active parameter count. Third, we see that other choices like shared experts, heterogeneous experts and load-balancing settings have small effects relative to expert count and granularity, although dropless routing yields a consistent gain. Overall, our results suggest a simpler recipe: focus on expert count and granularity, other choices have minimal effect on final quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper reports results from over 2,000 pretraining runs on MoE models up to 6.6B total parameters. It claims that, at fixed active-parameter count, pretraining loss improves monotonically with increasing total MoE parameters even at active-to-total ratios as high as 128; that the optimal expert dimension is essentially invariant to total model size and depends only on the active parameter budget; and that auxiliary design choices (shared experts, heterogeneous sizing, load-balancing coefficients) produce only small effects relative to expert count and granularity, while dropless routing yields a consistent gain. The authors conclude that MoE configuration can be simplified to primarily tuning expert count and granularity.

Significance. If the reported trends hold, the work supplies a practical, data-driven recipe that reduces the configuration search space for MoE models. The scale of the experimental campaign (>2,000 independent runs) and the consistency of the ordering across multiple active-parameter regimes constitute a clear empirical contribution to the literature on sparse architectures.

minor comments (3)
  1. [§4.1] §4.1 and Figure 3: the definition of the active-to-total parameter ratio is introduced only in the caption; moving the explicit formula to the main text would improve readability.
  2. [Table 2] Table 2: the reported loss differences for shared-expert and load-balancing ablations are on the order of 0.01–0.03; adding bootstrap confidence intervals or noting the number of seeds would help readers judge whether these differences are distinguishable from noise.
  3. [§5.3] §5.3: the discussion of downstream-task transfer is limited to a single sentence; a brief quantitative statement (or explicit statement that downstream evaluation is left for future work) would clarify the scope of the claims.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive review, accurate summary of our findings, and recommendation to accept. The scale and consistency of the experimental results are indeed central to the contribution.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper is a purely empirical study reporting results from over 2,000 independent pretraining runs up to 6.6B parameters. Central claims (performance gains with total MoE parameters at fixed active count, invariance of optimal expert dimension to total size) are direct observations from measured losses across varied configurations; no equations, derivations, fitted parameters renamed as predictions, or self-citation chains are present that would reduce any result to its own inputs by construction. The work contains no load-bearing mathematical steps or ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the standard assumption that pretraining loss is a sufficient proxy for model quality and that the tested configuration ranges are representative of practical MoE usage.

axioms (1)
  • domain assumption Pretraining loss reliably indicates relative model quality across configurations
    All comparisons use pretraining loss; no downstream metrics are reported for the full sweep.

pith-pipeline@v0.9.0 · 5521 in / 1216 out tokens · 29973 ms · 2026-05-13T07:32:07.850034+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We present the first systematic study of over 2,000 pretraining runs spanning models up to 6.6B total parameters, in which we exhaustively vary total experts, expert dimension, heterogeneous expert sizing within a single layer, shared expert size and load-balancing mechanisms.

  • IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    performance consistently improves with total MoE parameters even at extreme active expert parameter ratios like 128. The optimal expert size is nearly invariant to total parameter count and depends only on active parameter count.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages

  1. [1]

    13 A Additional Training Details A.1 Training Data Our training data is directly taken from Muennighoff et al

    in OLMoE. 13 A Additional Training Details A.1 Training Data Our training data is directly taken from Muennighoff et al. (2025) and consists of documents from: DCLM-Baseline (Li et al., 2024), StarCoder (Li et al., 2023; Kocetkov et al., 2022), peS2o (Soldaini & Lo, 2023; Soldaini et al., 2024a), arXiv (Computer, 2023), OpenWebMath (Paster et al., 2023), ...

  2. [2]

    (1, 2) (4, 8) 4 (8, 16) 8 (16, 32) 16 (32, 64) 32 ( 1 4 , 1

  3. [3]

    (2, 4) (8, 16) 4 (16, 32) 8 (32, 64) 16 (64, 128) 32 ( 1 8 , 1

  4. [4]

    (4, 8) (16, 32) 4 (32, 64) 8 (64, 128) 16 (128, 256) 32 ( 1 16 , 1

  5. [5]

    We show the configurations used in the heterogeneous MoE experiments

    (8, 16) (16, 32) 2 (32, 64) 4 (64, 128) 8 (128, 256) 16 Table 3:Heterogeneous MoE configurations (§3.2).. We show the configurations used in the heterogeneous MoE experiments. 17 B Discussion of Hyperparameters and Configurations B.1 Preliminary Hyperparameter Investigations Learning RateWe sweep the learning rate in {1e−4,4e−4,1e−3,4e−3,1e−2} . We find t...

  6. [6]

    In all settings and configurations, the addition of a generalist results in comparable or degraded performance

    We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 8 1 4 1 2 1 Granularity 5.2 5.4 5.6 5.8 6.0 6.2 LM Average ( ) Sparsity 1 2 4 8 16 32 64 128 256 Router Type default dropless Dense (a) 50M active, 50M - 930M total parameters 1 64 1 32...

  7. [7]

    In all settings and configurations, the addition of a generalist results in comparable or degraded performance

    We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 8 1 4 1 2 1 Granularity 5.2 5.4 5.6 5.8C4 Sparsity 1 2 4 8 16 32 64 128 256 Router Type default dropless Dense (a) 50M active, 50M - 930M total parameters 1 64 1 32 1 16 1 8 1 4 1 2 1 G...

  8. [8]

    In all settings and configurations, the addition of a generalist results in comparable or degraded performance

    We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 8 1 4 1 2 1 Granularity 5.4 5.6 5.8 6.0 6.2 6.4 6.6 6.8Dolma Books Sparsity 1 2 4 8 16 32 64 128 256 Router Type default dropless Dense (a) 50M active, 50M - 930M total parameters 1 64 ...

  9. [9]

    In all settings and configurations, the addition of a generalist results in comparable or degraded performance

    We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 8 1 4 1 2 1 Granularity 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 6.0Dolma CC Sparsity 1 2 4 8 16 32 64 128 256 Router Type default dropless Dense (a) 50M active, 50M - 930M total parameters 1 64...

  10. [10]

    In all settings and configurations, the addition of a generalist results in comparable or degraded performance

    We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 8 1 4 1 2 1 Granularity 4.8 5.0 5.2 5.4 5.6 5.8 6.0Dolma peS2o Sparsity 1 2 4 8 16 32 64 128 256 Router Type default dropless Dense (a) 50M active, 50M - 930M total parameters 1 64 1 32...

  11. [11]

    In all settings and configurations, the addition of a generalist results in comparable or degraded performance

    We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 8 1 4 1 2 1 Granularity 4.9 5.0 5.1 5.2 5.3 5.4 5.5Dolma Reddit Sparsity 1 2 4 8 16 32 64 128 256 Router Type default dropless Dense (a) 50M active, 50M - 930M total parameters 1 64 1 3...

  12. [12]

    In all settings and configurations, the addition of a generalist results in comparable or degraded performance

    We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 8 1 4 1 2 1 Granularity 5.4 5.6 5.8 6.0 6.2 6.4 6.6 6.8Dolma Stack Sparsity 1 2 4 8 16 32 64 128 256 Router Type default dropless Dense (a) 50M active, 50M - 930M total parameters 1 64 ...

  13. [13]

    In all settings and configurations, the addition of a generalist results in comparable or degraded performance

    We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 8 1 4 1 2 1 Granularity 5.0 5.2 5.4 5.6 5.8 6.0Dolma Wiki Sparsity 1 2 4 8 16 32 64 128 256 Router Type default dropless Dense (a) 50M active, 50M - 930M total parameters 1 64 1 32 1 16...

  14. [14]

    In all settings and configurations, the addition of a generalist results in comparable or degraded performance

    We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 8 1 4 1 2 1 Granularity 5.4 5.6 5.8 6.0 6.2ICE Sparsity 1 2 4 8 16 32 64 128 256 Router Type default dropless Dense (a) 50M active, 50M - 930M total parameters 1 64 1 32 1 16 1 8 1 4 1 ...

  15. [15]

    In all settings and configurations, the addition of a generalist results in comparable or degraded performance

    We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 8 1 4 1 2 1 Granularity 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8M2D2 S2ORC Sparsity 1 2 4 8 16 32 64 128 256 Router Type default dropless Dense (a) 50M active, 50M - 930M total parameters 1 64 1...

  16. [16]

    In all settings and configurations, the addition of a generalist results in comparable or degraded performance

    We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 8 1 4 1 2 1 Granularity 5.0 5.2 5.4 5.6 5.8 6.0Pile Sparsity 1 2 4 8 16 32 64 128 256 Router Type default dropless Dense (a) 50M active, 50M - 930M total parameters 1 64 1 32 1 16 1 8 1...

  17. [17]

    In all settings and configurations, the addition of a generalist results in comparable or degraded performance

    We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 8 1 4 1 2 1 Granularity 5.6 5.8 6.0 6.2 6.4 6.6WikiText-103 Sparsity 1 2 4 8 16 32 64 128 256 Router Type default dropless Dense (a) 50M active, 50M - 930M total parameters 1 64 1 32 1 ...

  18. [18]

    In all settings and configurations, the addition of a generalist results in comparable or degraded performance

    We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 8 1 4 1 2 1 Granularity 2.6 2.8 3.0 3.2 3.4 3.6 BoolQ ( ) Sparsity 1 2 4 8 16 32 64 128 256 Router Type default dropless Dense (a) 50M active, 50M - 930M total parameters 1 64 1 32 1 16...

  19. [19]

    In all settings and configurations, the addition of a generalist results in comparable or degraded performance

    We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 8 1 4 1 2 1 Granularity 2.6 2.8 3.0 3.2 3.4 3.6 BoolQ ( ) Sparsity 1 2 4 8 16 32 64 128 256 Router Type default dropless Dense (a) 50M active, 50M - 930M total parameters 1 64 1 32 1 16...

  20. [20]

    In all settings and configurations, the addition of a generalist results in comparable or degraded performance

    We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 8 1 4 1 2 1 Granularity 3.0 3.5 4.0 4.5 5.0 5.5 MMLU Humanities ( ) Sparsity 1 2 4 8 16 32 64 128 256 Router Type default dropless Dense (a) 50M active, 50M - 930M total parameters 1 64...

  21. [21]

    In all settings and configurations, the addition of a generalist results in comparable or degraded performance

    We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 8 1 4 1 2 1 Granularity 2.5 3.0 3.5 4.0 4.5 MMLU Other ( ) Sparsity 1 2 4 8 16 32 64 128 256 Router Type default dropless Dense (a) 50M active, 50M - 930M total parameters 1 64 1 32 1 1...

  22. [22]

    In all settings and configurations, the addition of a generalist results in comparable or degraded performance

    We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 8 1 4 1 2 1 Granularity 2.5 3.0 3.5 4.0 4.5 5.0 MMLU Social Sci ( ) Sparsity 1 2 4 8 16 32 64 128 256 Router Type default dropless Dense (a) 50M active, 50M - 930M total parameters 1 64...

  23. [23]

    In all settings and configurations, the addition of a generalist results in comparable or degraded performance

    We compare to settings with no generalist. In all settings and configurations, the addition of a generalist results in comparable or degraded performance. 1 64 1 32 1 16 1 8 1 4 1 2 1 Granularity 2.5 3.0 3.5 4.0 4.5 5.0 MMLU Social Sci ( ) Sparsity 1 2 4 8 16 32 64 128 256 Router Type default dropless Dense (a) 50M active, 50M - 930M total parameters 1 64...