Hyperparameter Transfer with Mixture-of-Expert Layers
Pith reviewed 2026-05-22 12:31 UTC · model grok-4.3
The pith
A new parameterization for mixture-of-experts transformers enables reliable hyperparameter transfer when scaling models from 51 million to over 2 billion parameters at fixed token budgets.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors claim that a parameterization for the hyperparameters of MoE-augmented transformers, obtained from dynamical mean-field theory, keeps optimal values stable when model width, depth, number of experts, and expert size are varied at fixed token budget. This stability produces reliable hyperparameter transfer in practice from 51M-parameter models up to models exceeding 2B total parameters, and it permits hyperparameters identified on small models with short token horizons to be used successfully for larger models trained on longer horizons.
What carries the argument
The proposed parameterization of hyperparameters for mixture-of-experts layers in transformers, derived from dynamical mean-field theory predictions of scaling behavior across widths, depths, expert counts, and expert sizes.
If this is right
- Hyperparameters tuned on small MoE models can be used without change on larger models trained at the same token budget.
- Optimal hyperparameters stay consistent when the number or hidden size of experts is increased.
- Short-horizon sweeps performed on small models yield hyperparameters that work for long-horizon training of large models.
- The computational expense of hyperparameter search for large-scale MoE training is reduced because full-scale retuning is no longer required.
Where Pith is reading between the lines
- The same style of parameterization might reduce retuning costs for other sparse activation schemes that decouple total and active parameters.
- Dynamical mean-field theory could supply scaling prescriptions for additional architectural choices such as attention head count or residual connections.
- Adopting the parameterization may make hyperparameter selection more standardized across different MoE library implementations.
Load-bearing premise
The dynamical mean-field theory analysis accurately predicts how optimal hyperparameters scale with changes in model width, depth, number of experts, and expert size.
What would settle it
Finding that the best learning rate or other hyperparameters must be retuned when the number of experts is increased from 8 to 64 while the total token count is held fixed would falsify the transfer claim.
read the original abstract
Mixture-of-Experts (MoE) layers have emerged as an important tool in scaling up modern neural networks by decoupling total trainable parameters from activated parameters in the forward pass for each token. However, sparse MoEs add complexity to training due to (i) new trainable parameters (router weights) that, like all other parameter groups, require hyperparameter (HP) tuning; (ii) new architecture scale dimensions (number of and size of experts) that must be chosen and potentially taken large. To make HP selection cheap and reliable, we propose a new parameterization for transformer models with MoE layers when scaling model width, depth, number of experts, and expert (hidden) size. Our parameterization is justified by a novel dynamical mean-field theory (DMFT) analysis. When varying different model dimensions trained at a fixed token budget, we find empirically that our parameterization enables reliable HP transfer across models from 51M to over 2B total parameters. We further take HPs identified from sweeping small models on a short token horizon to train larger models on longer horizons and report performant model behaviors.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a parameterization for hyperparameters in transformer models incorporating Mixture-of-Experts (MoE) layers. The parameterization is derived from a novel dynamical mean-field theory (DMFT) analysis and is claimed to enable reliable hyperparameter transfer when scaling model width, depth, number of experts, and expert size at fixed token budget. Empirical results show successful transfer across models ranging from 51M to over 2B total parameters, and further demonstrate that hyperparameters identified on small models with short token horizons can be transferred to larger models trained on longer horizons while maintaining performant behavior.
Significance. If the central claim holds, the work would meaningfully reduce the cost of hyperparameter tuning for large-scale MoE models by enabling transfer from smaller-scale experiments. The novel application of DMFT to derive scaling rules for MoE-specific dimensions (expert count and size) alongside empirical validation up to 2B parameters constitutes a practical contribution to sparse model scaling. The falsifiable predictions implicit in the DMFT-derived parameterization are a strength that could be tested more broadly.
major comments (1)
- [DMFT Analysis] DMFT Analysis section: the derivation of the parameterization relies on continuous or infinite-width limits that approximate away the discrete top-k routing, softmax selection, and per-expert load-balancing dynamics present for finite expert counts. Because these discrete effects are central to MoE behavior, their omission risks making the predicted hyperparameter scaling laws inaccurate outside the specific regimes tested, undermining the claim that the parameterization enables reliable transfer rather than merely fitting the observed data.
minor comments (1)
- [Abstract] The abstract and introduction could more explicitly state the precise functional form of the proposed parameterization (e.g., how learning rate or other HPs scale with expert count) to allow readers to reproduce the transfer rule without consulting the full DMFT derivation.
Simulated Author's Rebuttal
We thank the referee for the careful review and insightful comments on our DMFT-derived parameterization for MoE hyperparameter transfer. We address the major comment point by point below.
read point-by-point responses
-
Referee: DMFT Analysis section: the derivation of the parameterization relies on continuous or infinite-width limits that approximate away the discrete top-k routing, softmax selection, and per-expert load-balancing dynamics present for finite expert counts. Because these discrete effects are central to MoE behavior, their omission risks making the predicted hyperparameter scaling laws inaccurate outside the specific regimes tested, undermining the claim that the parameterization enables reliable transfer rather than merely fitting the observed data.
Authors: We acknowledge that the DMFT derivation employs continuous approximations and infinite-width limits that smooth over discrete routing mechanics such as top-k selection, softmax normalization, and explicit load-balancing. These are standard simplifications in mean-field analyses to obtain closed-form scaling predictions. The resulting parameterization is intended to capture leading-order behavior in the large-model regime relevant to hyperparameter transfer. Our empirical results demonstrate reliable transfer from 51M to over 2B parameters across variations in width, depth, expert count, and expert size at fixed token budget, which provides evidence that the scaling rules remain effective despite the approximations. We will add a dedicated subsection in the revised manuscript discussing the validity range of the DMFT assumptions, including when discrete effects may cause deviations, and will include additional experiments with smaller expert counts to illustrate the boundaries. revision: partial
Circularity Check
DMFT analysis supplies independent theoretical justification; no reduction to inputs by construction
full rationale
The paper presents a novel DMFT analysis as the source of the proposed parameterization for HP scaling with width, depth, expert count, and expert size. This derivation is described as first-principles and is then validated empirically on transfer from 51M to >2B parameter models at fixed token budget. No equations or sections in the provided text reduce the claimed predictions to fitted parameters from the target regime, self-citations that bear the central load, or ansatzes smuggled via prior work by the same authors. The DMFT step therefore supplies independent content rather than tautological renaming or statistical forcing.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Dynamical mean-field theory provides an accurate description of hyperparameter scaling in MoE-augmented transformers.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our parameterization is justified by a novel dynamical mean-field theory (DMFT) analysis... three-level mean-field hierarchy: residual stream representations are mean-field over expert outputs, which are themselves mean-field over individual expert neurons.
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Scaling Rule 3... σinit(W(i)down)=α^{-1}ffn n^{-1/2}embd, η(W(i)down)=α^{-1}ffn n^{-1}embd
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 6 Pith papers
-
How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization
The authors derive a Maximally Scale-Stable Parameterization (MSSP) for MoE models that achieves robust learning-rate transfer and monotonic performance gains with scale across co-scaling regimes of width, experts, an...
-
Hyperparameter Transfer for Dense Associative Memories
Explicit scaling prescriptions for hyperparameters in DenseAMs are derived from model dynamics and shown to match empirical results across scales.
-
Spectral Dynamics in Deep Networks: Feature Learning, Outlier Escape, and Learning Rate Transfer
A two-level DMFT predicts width-consistent outlier escape and hyperparameter transfer under μP in deep networks, with bulk restructuring dominating for tasks with many outputs.
-
Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate
A framework quantifies hyperparameter transfer via scaling-law fit quality, extrapolation robustness, and loss penalty, with ablations showing that μP's advantage over standard parameterization stems from maximizing t...
-
Spectral Dynamics in Deep Networks: Feature Learning, Outlier Escape, and Learning Rate Transfer
A two-level DMFT tracks bulk and outlier spectral dynamics in wide networks, predicting width-consistent outlier growth and hyperparameter transfer under muP scaling for deep linear nets while noting bulk restructurin...
-
There Will Be a Scientific Theory of Deep Learning
A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universa...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.