pith. sign in

arxiv: 2604.09175 · v1 · submitted 2026-04-10 · 💻 cs.LG · cs.AI· math.ST· stat.ML· stat.TH

Generalization and Scaling Laws for Mixture-of-Experts Transformers

Pith reviewed 2026-05-10 18:17 UTC · model grok-4.3

classification 💻 cs.LG cs.AImath.STstat.MLstat.TH
keywords mixture of expertsgeneralization boundsscaling lawstransformersactive parametersneural scalingrouting overhead
0
0 comments X

The pith

Mixture-of-Experts Transformers generalize like dense networks when only active parameters per input are counted, with an added routing overhead.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that generalization bounds for MoE Transformers can be derived by fixing routing patterns and applying a union bound across possible patterns. This leads to a covering number whose size depends on the active parameter budget rather than the full model size. Paired with empirical risk minimization for squared loss, the bound applies to data on a d-dimensional manifold with beta-smooth targets. The work also includes a constructive approximation result that lets error decrease by either increasing active capacity or adding experts. These combine to give scaling laws for size, data, and compute tradeoffs.

Core claim

By conditioning on fixed routing patterns and union-bounding across them, we derive a sup-norm covering-number bound whose metric entropy scales with the active parameter budget and incurs a MoE-specific routing overhead. Combined with a standard ERM analysis for squared loss, this yields a generalization bound under a d-dimensional manifold data model and C^β targets, showing that approximation and estimation trade off as in dense networks once active parameters are accounted for appropriately. We further prove a constructive approximation theorem for MoE architectures, showing that, under the approximation construction, error can decrease either by scaling active capacity or by increasing

What carries the argument

The sup-norm covering-number bound derived by conditioning on fixed routing patterns and taking a union bound over patterns, whose entropy scales only with active parameters plus routing overhead.

Load-bearing premise

Routing patterns do not concentrate in ways that would make the union bound loose, and the data satisfies the low-dimensional manifold and smoothness assumptions uniformly across the input space.

What would settle it

Compare the measured generalization gap of trained MoE models against the predicted bound as a function of active parameter count; a mismatch where error does not follow the active-parameter scaling would falsify the claim.

Figures

Figures reproduced from arXiv: 2604.09175 by Mansour Zoubeirou a Mayaki.

Figure 1
Figure 1. Figure 1: Routing ablation across expert pool size M and active experts k. (a) In the moderate regime (M/k ≤ 8), validation loss increases with the routing complexity term k log(eM/k), consistent with its interpretation as a worst￾case routing overhead. (b) Over the full range, performance improves again for sufficiently large M/k, suggesting gains from expert specialization not captured by the worst-case analysis. … view at source ↗
Figure 2
Figure 2. Figure 2: Intrinsic Dimension (ID) Evolution Across Model Depth. The plots show the Maximum Likelihood Esti￾mation (MLE) of the Intrinsic Dimension, d, calculated for the representations at each layer of the four GPT-2 models (varying size). The ID remains stable across different model sizes for a given dataset but varies significantly across data corpora. (a) OpenWebText (OWT): Shows the highest intrinsic dimension… view at source ↗
Figure 3
Figure 3. Figure 3: Intrinsic Dimension Scaling Law: ¯d versus Number of Parameters (N). The plots show the relationship between the Average Intrinsic Dimension ( ¯d) (log-log scale) and the Number of Parameters (N) across the four GPT-2 model sizes for each dataset. This relationship is modeled as a power law, ¯d ∼ Nβ , where the slope of the linear fit determines the ID Scaling Exponent β. The results suggest that ¯d exhibi… view at source ↗
Figure 4
Figure 4. Figure 4: Data-scaling results (fixed d, varying β). Empirical estimates of the data-scaling law (loss versus data size D) across datasets, compared with theoretical lines obtained by fixing the dataset-specific intrinsic dimension d and varying β. The empirical fits are most consistent with β = 1.5 for OpenWebText and β = 1.0 for TinyStories and WikiText-103. 1  1  1  1  1  !N% 1 11 11 11 11 11 … view at source ↗
Figure 5
Figure 5. Figure 5: Model-scaling results (fixed d, varying β). Empirical estimates of the model-scaling law (loss versus active parameter count Nact) across datasets. For each plot, the dataset-specific intrinsic dimension d is fixed and theoretical lines are shown for varying β, with theoretical exponent αN = 2β/d. The empirical fits are most consistent with β = 0.5 across all three datasets. coincide exactly (see Figures 4… view at source ↗
Figure 6
Figure 6. Figure 6: Empirical amplification patterns across experts. [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Cross-budget compute-optimal allocations. (a) Cross-budget slopes for each [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
read the original abstract

We develop a theory of generalization and scaling for Mixture-of-Experts (MoE) Transformers that cleanly separates \emph{active} per-input capacity from routing combinatorics. By conditioning on fixed routing patterns and union-bounding across them, we derive a sup-norm covering-number bound whose metric entropy scales with the active parameter budget and incurs a MoE-specific routing overhead. Combined with a standard ERM analysis for squared loss, this yields a generalization bound under a $d$-dimensional manifold data model and $C^\beta$ targets, showing that approximation and estimation trade off as in dense networks once active parameters are accounted for appropriately. We further prove a constructive approximation theorem for MoE architectures, showing that, under the approximation construction, error can decrease either by scaling active capacity or by increasing the number of experts, depending on the dominant bottleneck. From these results we derive neural scaling laws for model size, data size, and compute-optimal tradeoffs. Overall, our results provide a transparent statistical reference point for reasoning about MoE scaling, clarifying which behaviors are certified by worst-case theory and which must arise from data-dependent routing structure or optimization dynamics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims to develop a theory of generalization and scaling for Mixture-of-Experts (MoE) Transformers that separates active per-input capacity from routing combinatorics. By conditioning on fixed routing patterns and union-bounding across them, it derives a sup-norm covering-number bound whose metric entropy scales with the active parameter budget plus an MoE-specific routing overhead. Combined with standard ERM analysis for squared loss, this yields generalization bounds under a d-dimensional manifold data model and C^β targets. The paper also proves a constructive approximation theorem showing error can decrease by scaling active capacity or number of experts, and derives neural scaling laws for model size, data size, and compute-optimal tradeoffs.

Significance. If the derivations hold, the work supplies a worst-case theoretical reference point that certifies MoE scaling behavior in terms of active parameters once routing overhead is accounted for. The use of covering numbers and ERM (rather than post-hoc fitting to observed curves) and the explicit separation of active capacity from combinatorial routing are strengths that could benchmark empirical MoE observations against theory.

major comments (2)
  1. [Derivation of the sup-norm covering-number bound and union bound (abstract and main generalization theorem)] The central generalization claim rests on conditioning on a fixed routing pattern r, deriving a sup-norm covering-number bound whose entropy scales only with active parameters plus a routing overhead, then taking a union bound over all admissible patterns. When the router concentrates (as is typical), the effective number of patterns is much smaller than the worst-case combinatorial count, so the union-bound penalty log(#patterns) may dominate and prevent the bound from certifying that MoE behaves like a dense network with fixed active capacity. The argument treats the pattern set as fixed and data-independent; if the data manifold + C^β assumption fails to hold uniformly across patterns, the per-pattern covering numbers are also invalid. This is load-bearing for the active-parameter tradeoff claim.
  2. [Constructive approximation theorem] The constructive approximation theorem states that error can decrease either by scaling active capacity or by increasing the number of experts depending on the dominant bottleneck. However, the quantitative rates at which this occurs under the d-dimensional manifold and C^β assumptions are not made explicit; without concrete dependence on d, β, and the routing construction, it is unclear whether the theorem actually supports the claimed scaling-law tradeoffs.
minor comments (2)
  1. [Derivation of neural scaling laws] The transition from the generalization bound to explicit neural scaling laws (model size, data size, compute-optimal) should include the intermediate steps that produce the power-law exponents, rather than stating the final forms only.
  2. [Introduction and preliminaries] Notation for routing patterns, the covering metric, and the manifold dimension d should be introduced with a short table or explicit definitions early in the paper to improve readability for readers outside the immediate subfield.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below, clarifying the worst-case nature of our bounds and the explicit rates in the approximation result. We indicate revisions where they will strengthen the presentation without altering the core derivations.

read point-by-point responses
  1. Referee: [Derivation of the sup-norm covering-number bound and union bound (abstract and main generalization theorem)] The central generalization claim rests on conditioning on a fixed routing pattern r, deriving a sup-norm covering-number bound whose entropy scales only with active parameters plus a routing overhead, then taking a union bound over all admissible patterns. When the router concentrates (as is typical), the effective number of patterns is much smaller than the worst-case combinatorial count, so the union-bound penalty log(#patterns) may dominate and prevent the bound from certifying that MoE behaves like a dense network with fixed active capacity. The argument treats the pattern set as fixed and data-independent; if the data manifold + C^β assumption fails to hold uniformly across patterns, the per-pattern covering numbers are also invalid. This is load-bearing for the active-param

    Authors: Our analysis is explicitly worst-case over routing patterns to obtain a uniform guarantee that does not rely on data-dependent concentration of the router. The covering-number bound for each fixed pattern scales with active parameters plus the routing overhead term, and the union bound over admissible patterns is taken precisely to produce the MoE-specific overhead stated in the abstract and main theorem. This overhead is part of the result rather than a flaw; the bound certifies that generalization scales with active capacity once the combinatorial cost is paid, which is the intended separation. When routing concentrates, empirical performance improves beyond the bound, but the worst-case statement remains valid. The d-dimensional manifold and C^β assumptions are properties of the input distribution and target function, which are independent of any particular routing pattern; hence they hold uniformly for the per-pattern function classes. We will revise the manuscript to explicitly note the worst-case character of the union bound and to discuss its relation to concentrated routing observed in practice. revision: partial

  2. Referee: [Constructive approximation theorem] The constructive approximation theorem states that error can decrease either by scaling active capacity or by increasing the number of experts depending on the dominant bottleneck. However, the quantitative rates at which this occurs under the d-dimensional manifold and C^β assumptions are not made explicit; without concrete dependence on d, β, and the routing construction, it is unclear whether the theorem actually supports the claimed scaling-law tradeoffs.

    Authors: The constructive approximation theorem (Theorem 4.3) and its proof supply explicit rates: the approximation error decays as O(N^{-β/d}) when scaling active capacity per expert (with N the active parameter count) and as O(E^{-β/d}) when increasing the number of experts E, under the stated manifold dimension d and smoothness β, with the routing construction determining which term dominates. These rates are derived in the appendix and directly underpin the scaling-law tradeoffs discussed in Section 5. We will move the explicit dependence on d, β, and the routing parameters into the main-text statement of the theorem to make the quantitative support for the scaling laws transparent. revision: yes

Circularity Check

0 steps flagged

No circularity: bounds and scaling laws derived from covering numbers and ERM without reduction to fitted inputs or self-citations

full rationale

The paper's derivation begins with a standard sup-norm covering-number argument conditioned on fixed routing patterns, followed by a union bound over patterns, then applies classical ERM analysis for squared loss under a d-dimensional manifold model and C^β smoothness. Scaling laws for model size, data size, and compute-optimal tradeoffs are obtained by inspecting the dependence of the resulting generalization bound on active parameters and routing overhead. No equation equates a claimed prediction or scaling law to a parameter fitted from the same data, no load-bearing step relies on self-citation for uniqueness or ansatz smuggling, and the construction is self-contained against external benchmarks from statistical learning theory. The analysis therefore contains no self-definitional, fitted-input, or self-citation circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The theory rests on a standard manifold data model and smoothness class plus the decision to condition on fixed routing patterns.

axioms (2)
  • domain assumption Inputs lie on a d-dimensional manifold and targets are C^β smooth
    Invoked to obtain the approximation-estimation tradeoff once active parameters are fixed.
  • domain assumption Routing patterns can be conditioned upon and union-bounded
    Central step that separates active capacity from combinatorics.

pith-pipeline@v0.9.0 · 5506 in / 1366 out tokens · 46452 ms · 2026-05-10T18:17:14.006066+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

  1. [1]

    URL https://arxiv.org/abs/2112.06905

    PMLR, 2022. URL https://arxiv.org/abs/2112.06905. GLaM. Eldan, R. and Li, Y . Tinystories: How small can language models be and still speak coherent english?, 2023. URL https://arxiv.org/abs/2305.07759. Facco, E., dErrico, M., Rodriguez, A., and Laio, A. Estimating the intrinsic dimension of datasets by a minimal neighborhood information. Scientific Report...

  2. [2]

    arXiv preprint arXiv:2402.07871 , year=

    doi: 10.48550/arXiv.2402.07871. URL https://arxiv.org/abs/2402.07871. Krajewski, J., Chochowski, M., and Korzekwa, D. Scaling fine-grained moe beyond 50b parameters: Empirical evaluation and practical insights. 2025. URL https://arxiv.org/abs/2506.02890. Lepikhin, D., Lee, H., Xu, Y ., Chen, D., Firat, O., Huang, Y ., Krikun, M., Shazeer, N., and Chen, Z. ...

  3. [3]

    arXiv preprint arXiv:2507.17702 , year=

    URL https://openreview.net/forum?id=B1ckMDqlg. Tian, C., Chen, K., Liu, J., Liu, Z., Zhang, Z., and Zhou, J. Towards greater leverage: Scaling laws for efficient mixture-of-experts language models. 2025. doi: 10.48550/arXiv.2507.17702. URL https://arxiv.org/abs/ 2507.17702. 12 Generalization and Scaling Laws for Mixture-of-Experts Transformers V aswani, A....

  4. [4]

    We adopt Their counting conventions for active parameters and FLOPs when fitting/reading off exponents

    provides fitted coefficients, per-E reduced exponents, and a compute-optimal analysis under the training-FLOPs proxy F = 6 Nact D. We adopt Their counting conventions for active parameters and FLOPs when fitting/reading off exponents. From their per-expert reduction (their Eq. (5) and Table 4), we read exponents µ(E) and ν(E) and map ˆαN (E) := µ(E), ˆαD(E) ...