Hyperparameter Transfer with Mixture-of-Expert Layers

Blake Bordelon; Boris Hanin; Cengiz Pehlevan; Tianze Jiang

arxiv: 2601.20205 · v3 · pith:DV3RCMSGnew · submitted 2026-01-28 · 💻 cs.LG

Hyperparameter Transfer with Mixture-of-Expert Layers

Tianze Jiang , Blake Bordelon , Cengiz Pehlevan , Boris Hanin This is my paper

Pith reviewed 2026-05-22 12:31 UTC · model grok-4.3

classification 💻 cs.LG

keywords mixture of expertshyperparameter transfermodel scalingtransformersdynamical mean-field theorysparse modelsparameter efficiency

0 comments

The pith

A new parameterization for mixture-of-experts transformers enables reliable hyperparameter transfer when scaling models from 51 million to over 2 billion parameters at fixed token budgets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a parameterization for the hyperparameters of transformer models that include mixture-of-experts layers. This choice is supported by a dynamical mean-field theory analysis that tracks how hyperparameters should behave when width, depth, expert count, or expert size changes while the total number of training tokens stays constant. Experiments show that the same hyperparameter values work across models ranging from 51 million to more than 2 billion total parameters. If the parameterization holds, hyperparameter tuning can be done on small models and then applied directly to much larger ones, lowering the cost of training sparse networks. The authors further demonstrate that hyperparameters found from short training runs on small models remain effective for longer runs on bigger models.

Core claim

The authors claim that a parameterization for the hyperparameters of MoE-augmented transformers, obtained from dynamical mean-field theory, keeps optimal values stable when model width, depth, number of experts, and expert size are varied at fixed token budget. This stability produces reliable hyperparameter transfer in practice from 51M-parameter models up to models exceeding 2B total parameters, and it permits hyperparameters identified on small models with short token horizons to be used successfully for larger models trained on longer horizons.

What carries the argument

The proposed parameterization of hyperparameters for mixture-of-experts layers in transformers, derived from dynamical mean-field theory predictions of scaling behavior across widths, depths, expert counts, and expert sizes.

If this is right

Hyperparameters tuned on small MoE models can be used without change on larger models trained at the same token budget.
Optimal hyperparameters stay consistent when the number or hidden size of experts is increased.
Short-horizon sweeps performed on small models yield hyperparameters that work for long-horizon training of large models.
The computational expense of hyperparameter search for large-scale MoE training is reduced because full-scale retuning is no longer required.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same style of parameterization might reduce retuning costs for other sparse activation schemes that decouple total and active parameters.
Dynamical mean-field theory could supply scaling prescriptions for additional architectural choices such as attention head count or residual connections.
Adopting the parameterization may make hyperparameter selection more standardized across different MoE library implementations.

Load-bearing premise

The dynamical mean-field theory analysis accurately predicts how optimal hyperparameters scale with changes in model width, depth, number of experts, and expert size.

What would settle it

Finding that the best learning rate or other hyperparameters must be retuned when the number of experts is increased from 8 to 64 while the total token count is held fixed would falsify the transfer claim.

read the original abstract

Mixture-of-Experts (MoE) layers have emerged as an important tool in scaling up modern neural networks by decoupling total trainable parameters from activated parameters in the forward pass for each token. However, sparse MoEs add complexity to training due to (i) new trainable parameters (router weights) that, like all other parameter groups, require hyperparameter (HP) tuning; (ii) new architecture scale dimensions (number of and size of experts) that must be chosen and potentially taken large. To make HP selection cheap and reliable, we propose a new parameterization for transformer models with MoE layers when scaling model width, depth, number of experts, and expert (hidden) size. Our parameterization is justified by a novel dynamical mean-field theory (DMFT) analysis. When varying different model dimensions trained at a fixed token budget, we find empirically that our parameterization enables reliable HP transfer across models from 51M to over 2B total parameters. We further take HPs identified from sweeping small models on a short token horizon to train larger models on longer horizons and report performant model behaviors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives a DMFT-derived parameterization that transfers hyperparameters reliably across MoE scaling dimensions from 51M to over 2B parameters at fixed token budget.

read the letter

The main thing to know is that the authors have come up with a parameterization for hyperparameters in MoE-equipped transformers that transfers well when scaling several dimensions at once, backed by dynamical mean-field theory. They show that with this approach, you can take optimal settings from models around 50 million parameters and apply them to ones over 2 billion, keeping the token budget fixed. They also demonstrate moving from short training runs on small models to longer ones on larger models. What is new here is tying the parameterization directly to a DMFT analysis that handles the joint scaling of width, depth, number of experts, and expert hidden size. This goes beyond standard scaling laws by providing a specific form for the hyperparameters. The paper does a good job on the empirical side. The range from 51M to 2B is meaningful, and reporting successful transfer to longer horizons is useful for practitioners who can't afford full sweeps at scale. On the soft spots, the DMFT derivation likely involves some approximations for the routing mechanism. The concern about discrete top-k selection and load balancing is reasonable, since standard DMFT often uses continuous limits. If the paper does not provide explicit checks for how these discrete effects influence the scaling predictions at finite expert numbers, the theoretical support could be weaker than the empirical results suggest. That said, the experiments cover a good range of scales, so the transfer claim looks practically solid even if the theory needs more validation. This work is for researchers focused on scaling sparse models efficiently. Anyone dealing with hyperparameter selection for large MoEs will get value from the parameterization and the transfer results. It also has something for those interested in mean-field approaches to neural nets. The paper shows clear thinking in connecting theory to practice. It deserves a serious referee because it tackles a concrete problem with both analysis and large-scale tests. I recommend sending this out for peer review.

Referee Report

1 major / 1 minor

Summary. The paper proposes a parameterization for hyperparameters in transformer models incorporating Mixture-of-Experts (MoE) layers. The parameterization is derived from a novel dynamical mean-field theory (DMFT) analysis and is claimed to enable reliable hyperparameter transfer when scaling model width, depth, number of experts, and expert size at fixed token budget. Empirical results show successful transfer across models ranging from 51M to over 2B total parameters, and further demonstrate that hyperparameters identified on small models with short token horizons can be transferred to larger models trained on longer horizons while maintaining performant behavior.

Significance. If the central claim holds, the work would meaningfully reduce the cost of hyperparameter tuning for large-scale MoE models by enabling transfer from smaller-scale experiments. The novel application of DMFT to derive scaling rules for MoE-specific dimensions (expert count and size) alongside empirical validation up to 2B parameters constitutes a practical contribution to sparse model scaling. The falsifiable predictions implicit in the DMFT-derived parameterization are a strength that could be tested more broadly.

major comments (1)

[DMFT Analysis] DMFT Analysis section: the derivation of the parameterization relies on continuous or infinite-width limits that approximate away the discrete top-k routing, softmax selection, and per-expert load-balancing dynamics present for finite expert counts. Because these discrete effects are central to MoE behavior, their omission risks making the predicted hyperparameter scaling laws inaccurate outside the specific regimes tested, undermining the claim that the parameterization enables reliable transfer rather than merely fitting the observed data.

minor comments (1)

[Abstract] The abstract and introduction could more explicitly state the precise functional form of the proposed parameterization (e.g., how learning rate or other HPs scale with expert count) to allow readers to reproduce the transfer rule without consulting the full DMFT derivation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and insightful comments on our DMFT-derived parameterization for MoE hyperparameter transfer. We address the major comment point by point below.

read point-by-point responses

Referee: DMFT Analysis section: the derivation of the parameterization relies on continuous or infinite-width limits that approximate away the discrete top-k routing, softmax selection, and per-expert load-balancing dynamics present for finite expert counts. Because these discrete effects are central to MoE behavior, their omission risks making the predicted hyperparameter scaling laws inaccurate outside the specific regimes tested, undermining the claim that the parameterization enables reliable transfer rather than merely fitting the observed data.

Authors: We acknowledge that the DMFT derivation employs continuous approximations and infinite-width limits that smooth over discrete routing mechanics such as top-k selection, softmax normalization, and explicit load-balancing. These are standard simplifications in mean-field analyses to obtain closed-form scaling predictions. The resulting parameterization is intended to capture leading-order behavior in the large-model regime relevant to hyperparameter transfer. Our empirical results demonstrate reliable transfer from 51M to over 2B parameters across variations in width, depth, expert count, and expert size at fixed token budget, which provides evidence that the scaling rules remain effective despite the approximations. We will add a dedicated subsection in the revised manuscript discussing the validity range of the DMFT assumptions, including when discrete effects may cause deviations, and will include additional experiments with smaller expert counts to illustrate the boundaries. revision: partial

Circularity Check

0 steps flagged

DMFT analysis supplies independent theoretical justification; no reduction to inputs by construction

full rationale

The paper presents a novel DMFT analysis as the source of the proposed parameterization for HP scaling with width, depth, expert count, and expert size. This derivation is described as first-principles and is then validated empirically on transfer from 51M to >2B parameter models at fixed token budget. No equations or sections in the provided text reduce the claimed predictions to fitted parameters from the target regime, self-citations that bear the central load, or ansatzes smuggled via prior work by the same authors. The DMFT step therefore supplies independent content rather than tautological renaming or statistical forcing.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the applicability of dynamical mean-field theory to MoE training dynamics and on the assumption that hyperparameter optima transfer under the proposed rescaling when token budget is held fixed.

axioms (1)

domain assumption Dynamical mean-field theory provides an accurate description of hyperparameter scaling in MoE-augmented transformers.
Invoked to justify the new parameterization when scaling width, depth, number of experts, and expert size.

pith-pipeline@v0.9.0 · 5726 in / 1190 out tokens · 37224 ms · 2026-05-22T12:31:13.316020+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our parameterization is justified by a novel dynamical mean-field theory (DMFT) analysis... three-level mean-field hierarchy: residual stream representations are mean-field over expert outputs, which are themselves mean-field over individual expert neurons.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Scaling Rule 3... σinit(W(i)down)=α^{-1}ffn n^{-1/2}embd, η(W(i)down)=α^{-1}ffn n^{-1}embd

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization
cs.LG 2026-05 unverdicted novelty 7.0

The authors derive a Maximally Scale-Stable Parameterization (MSSP) for MoE models that achieves robust learning-rate transfer and monotonic performance gains with scale across co-scaling regimes of width, experts, an...
Hyperparameter Transfer for Dense Associative Memories
cs.LG 2026-05 unverdicted novelty 7.0

Explicit scaling prescriptions for hyperparameters in DenseAMs are derived from model dynamics and shown to match empirical results across scales.
Spectral Dynamics in Deep Networks: Feature Learning, Outlier Escape, and Learning Rate Transfer
cond-mat.dis-nn 2026-05 unverdicted novelty 7.0

A two-level DMFT predicts width-consistent outlier escape and hyperparameter transfer under μP in deep networks, with bulk restructuring dominating for tasks with many outputs.
Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate
cs.LG 2026-05 unverdicted novelty 6.0

A framework quantifies hyperparameter transfer via scaling-law fit quality, extrapolation robustness, and loss penalty, with ablations showing that μP's advantage over standard parameterization stems from maximizing t...
Spectral Dynamics in Deep Networks: Feature Learning, Outlier Escape, and Learning Rate Transfer
cond-mat.dis-nn 2026-05 unverdicted novelty 6.0

A two-level DMFT tracks bulk and outlier spectral dynamics in wide networks, predicting width-consistent outlier growth and hyperparameter transfer under muP scaling for deep linear nets while noting bulk restructurin...
There Will Be a Scientific Theory of Deep Learning
stat.ML 2026-04 unverdicted novelty 2.0

A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universa...