Generalization and Scaling Laws for Mixture-of-Experts Transformers
Pith reviewed 2026-05-10 18:17 UTC · model grok-4.3
The pith
Mixture-of-Experts Transformers generalize like dense networks when only active parameters per input are counted, with an added routing overhead.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By conditioning on fixed routing patterns and union-bounding across them, we derive a sup-norm covering-number bound whose metric entropy scales with the active parameter budget and incurs a MoE-specific routing overhead. Combined with a standard ERM analysis for squared loss, this yields a generalization bound under a d-dimensional manifold data model and C^β targets, showing that approximation and estimation trade off as in dense networks once active parameters are accounted for appropriately. We further prove a constructive approximation theorem for MoE architectures, showing that, under the approximation construction, error can decrease either by scaling active capacity or by increasing
What carries the argument
The sup-norm covering-number bound derived by conditioning on fixed routing patterns and taking a union bound over patterns, whose entropy scales only with active parameters plus routing overhead.
Load-bearing premise
Routing patterns do not concentrate in ways that would make the union bound loose, and the data satisfies the low-dimensional manifold and smoothness assumptions uniformly across the input space.
What would settle it
Compare the measured generalization gap of trained MoE models against the predicted bound as a function of active parameter count; a mismatch where error does not follow the active-parameter scaling would falsify the claim.
Figures
read the original abstract
We develop a theory of generalization and scaling for Mixture-of-Experts (MoE) Transformers that cleanly separates \emph{active} per-input capacity from routing combinatorics. By conditioning on fixed routing patterns and union-bounding across them, we derive a sup-norm covering-number bound whose metric entropy scales with the active parameter budget and incurs a MoE-specific routing overhead. Combined with a standard ERM analysis for squared loss, this yields a generalization bound under a $d$-dimensional manifold data model and $C^\beta$ targets, showing that approximation and estimation trade off as in dense networks once active parameters are accounted for appropriately. We further prove a constructive approximation theorem for MoE architectures, showing that, under the approximation construction, error can decrease either by scaling active capacity or by increasing the number of experts, depending on the dominant bottleneck. From these results we derive neural scaling laws for model size, data size, and compute-optimal tradeoffs. Overall, our results provide a transparent statistical reference point for reasoning about MoE scaling, clarifying which behaviors are certified by worst-case theory and which must arise from data-dependent routing structure or optimization dynamics.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to develop a theory of generalization and scaling for Mixture-of-Experts (MoE) Transformers that separates active per-input capacity from routing combinatorics. By conditioning on fixed routing patterns and union-bounding across them, it derives a sup-norm covering-number bound whose metric entropy scales with the active parameter budget plus an MoE-specific routing overhead. Combined with standard ERM analysis for squared loss, this yields generalization bounds under a d-dimensional manifold data model and C^β targets. The paper also proves a constructive approximation theorem showing error can decrease by scaling active capacity or number of experts, and derives neural scaling laws for model size, data size, and compute-optimal tradeoffs.
Significance. If the derivations hold, the work supplies a worst-case theoretical reference point that certifies MoE scaling behavior in terms of active parameters once routing overhead is accounted for. The use of covering numbers and ERM (rather than post-hoc fitting to observed curves) and the explicit separation of active capacity from combinatorial routing are strengths that could benchmark empirical MoE observations against theory.
major comments (2)
- [Derivation of the sup-norm covering-number bound and union bound (abstract and main generalization theorem)] The central generalization claim rests on conditioning on a fixed routing pattern r, deriving a sup-norm covering-number bound whose entropy scales only with active parameters plus a routing overhead, then taking a union bound over all admissible patterns. When the router concentrates (as is typical), the effective number of patterns is much smaller than the worst-case combinatorial count, so the union-bound penalty log(#patterns) may dominate and prevent the bound from certifying that MoE behaves like a dense network with fixed active capacity. The argument treats the pattern set as fixed and data-independent; if the data manifold + C^β assumption fails to hold uniformly across patterns, the per-pattern covering numbers are also invalid. This is load-bearing for the active-parameter tradeoff claim.
- [Constructive approximation theorem] The constructive approximation theorem states that error can decrease either by scaling active capacity or by increasing the number of experts depending on the dominant bottleneck. However, the quantitative rates at which this occurs under the d-dimensional manifold and C^β assumptions are not made explicit; without concrete dependence on d, β, and the routing construction, it is unclear whether the theorem actually supports the claimed scaling-law tradeoffs.
minor comments (2)
- [Derivation of neural scaling laws] The transition from the generalization bound to explicit neural scaling laws (model size, data size, compute-optimal) should include the intermediate steps that produce the power-law exponents, rather than stating the final forms only.
- [Introduction and preliminaries] Notation for routing patterns, the covering metric, and the manifold dimension d should be introduced with a short table or explicit definitions early in the paper to improve readability for readers outside the immediate subfield.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address each major point below, clarifying the worst-case nature of our bounds and the explicit rates in the approximation result. We indicate revisions where they will strengthen the presentation without altering the core derivations.
read point-by-point responses
-
Referee: [Derivation of the sup-norm covering-number bound and union bound (abstract and main generalization theorem)] The central generalization claim rests on conditioning on a fixed routing pattern r, deriving a sup-norm covering-number bound whose entropy scales only with active parameters plus a routing overhead, then taking a union bound over all admissible patterns. When the router concentrates (as is typical), the effective number of patterns is much smaller than the worst-case combinatorial count, so the union-bound penalty log(#patterns) may dominate and prevent the bound from certifying that MoE behaves like a dense network with fixed active capacity. The argument treats the pattern set as fixed and data-independent; if the data manifold + C^β assumption fails to hold uniformly across patterns, the per-pattern covering numbers are also invalid. This is load-bearing for the active-param
Authors: Our analysis is explicitly worst-case over routing patterns to obtain a uniform guarantee that does not rely on data-dependent concentration of the router. The covering-number bound for each fixed pattern scales with active parameters plus the routing overhead term, and the union bound over admissible patterns is taken precisely to produce the MoE-specific overhead stated in the abstract and main theorem. This overhead is part of the result rather than a flaw; the bound certifies that generalization scales with active capacity once the combinatorial cost is paid, which is the intended separation. When routing concentrates, empirical performance improves beyond the bound, but the worst-case statement remains valid. The d-dimensional manifold and C^β assumptions are properties of the input distribution and target function, which are independent of any particular routing pattern; hence they hold uniformly for the per-pattern function classes. We will revise the manuscript to explicitly note the worst-case character of the union bound and to discuss its relation to concentrated routing observed in practice. revision: partial
-
Referee: [Constructive approximation theorem] The constructive approximation theorem states that error can decrease either by scaling active capacity or by increasing the number of experts depending on the dominant bottleneck. However, the quantitative rates at which this occurs under the d-dimensional manifold and C^β assumptions are not made explicit; without concrete dependence on d, β, and the routing construction, it is unclear whether the theorem actually supports the claimed scaling-law tradeoffs.
Authors: The constructive approximation theorem (Theorem 4.3) and its proof supply explicit rates: the approximation error decays as O(N^{-β/d}) when scaling active capacity per expert (with N the active parameter count) and as O(E^{-β/d}) when increasing the number of experts E, under the stated manifold dimension d and smoothness β, with the routing construction determining which term dominates. These rates are derived in the appendix and directly underpin the scaling-law tradeoffs discussed in Section 5. We will move the explicit dependence on d, β, and the routing parameters into the main-text statement of the theorem to make the quantitative support for the scaling laws transparent. revision: yes
Circularity Check
No circularity: bounds and scaling laws derived from covering numbers and ERM without reduction to fitted inputs or self-citations
full rationale
The paper's derivation begins with a standard sup-norm covering-number argument conditioned on fixed routing patterns, followed by a union bound over patterns, then applies classical ERM analysis for squared loss under a d-dimensional manifold model and C^β smoothness. Scaling laws for model size, data size, and compute-optimal tradeoffs are obtained by inspecting the dependence of the resulting generalization bound on active parameters and routing overhead. No equation equates a claimed prediction or scaling law to a parameter fitted from the same data, no load-bearing step relies on self-citation for uniqueness or ansatz smuggling, and the construction is self-contained against external benchmarks from statistical learning theory. The analysis therefore contains no self-definitional, fitted-input, or self-citation circularity.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Inputs lie on a d-dimensional manifold and targets are C^β smooth
- domain assumption Routing patterns can be conditioned upon and union-bounded
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
By conditioning on fixed routing patterns and union-bounding across them, we derive a sup-norm covering-number bound whose metric entropy scales with the active parameter budget and incurs a MoE-specific routing overhead.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Lemma 3.7 (MoE covering number) ... log N(δ, TMoE, k·k∞) ≲ (Πattn + LT k Πexp) log(κ R M0 / δ) + LT ℓ k log(eM/k)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
URL https://arxiv.org/abs/2112.06905
PMLR, 2022. URL https://arxiv.org/abs/2112.06905. GLaM. Eldan, R. and Li, Y . Tinystories: How small can language models be and still speak coherent english?, 2023. URL https://arxiv.org/abs/2305.07759. Facco, E., dErrico, M., Rodriguez, A., and Laio, A. Estimating the intrinsic dimension of datasets by a minimal neighborhood information. Scientific Report...
-
[2]
arXiv preprint arXiv:2402.07871 , year=
doi: 10.48550/arXiv.2402.07871. URL https://arxiv.org/abs/2402.07871. Krajewski, J., Chochowski, M., and Korzekwa, D. Scaling fine-grained moe beyond 50b parameters: Empirical evaluation and practical insights. 2025. URL https://arxiv.org/abs/2506.02890. Lepikhin, D., Lee, H., Xu, Y ., Chen, D., Firat, O., Huang, Y ., Krikun, M., Shazeer, N., and Chen, Z. ...
-
[3]
arXiv preprint arXiv:2507.17702 , year=
URL https://openreview.net/forum?id=B1ckMDqlg. Tian, C., Chen, K., Liu, J., Liu, Z., Zhang, Z., and Zhou, J. Towards greater leverage: Scaling laws for efficient mixture-of-experts language models. 2025. doi: 10.48550/arXiv.2507.17702. URL https://arxiv.org/abs/ 2507.17702. 12 Generalization and Scaling Laws for Mixture-of-Experts Transformers V aswani, A....
-
[4]
provides fitted coefficients, per-E reduced exponents, and a compute-optimal analysis under the training-FLOPs proxy F = 6 Nact D. We adopt Their counting conventions for active parameters and FLOPs when fitting/reading off exponents. From their per-expert reduction (their Eq. (5) and Table 4), we read exponents µ(E) and ν(E) and map ˆαN (E) := µ(E), ˆαD(E) ...
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.