arxiv: 2605.14200 · v1 · submitted 2026-05-13 · 💻 cs.LG · stat.ML

Recognition: 2 theorem links

· Lean Theorem

How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization

Leena Chennuru Vankadara , Moritz Haas , Luke Hayward , Sebastian Bordt , Alessandro Breccia

Authors on Pith no claims yet

Pith reviewed 2026-05-15 04:42 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords mixture of expertsscaling lawsmuPparameterizationdynamical mean field theorylearning rate transferSGDAdam

0 comments

The pith

Mixture-of-Experts models require a Maximally Scale-Stable Parameterization to restore learning-rate transfer and monotonic gains at scale.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper analyzes three scaling regimes for MoE models: co-scaling width with expert width, co-scaling width with number of experts and sparsity, and full proportional scaling of all four quantities. It derives a Dynamical Mean Field Theory description of limiting dynamics and shows that standard muP fails to guarantee stable behavior because of scale-dependent terms in the expert aggregation step. From this, the authors extract the Maximally Scale-Stable Parameterization (MSSP) for both SGD and Adam, which produces qualitatively different limits and restores the desired scaling properties. Experiments confirm that MSSP yields reliable learning-rate transfer and steady improvement with size across regimes.

Core claim

The muP parameterization does not reliably induce monotonic improvement with scale or robust learning-rate transfer in Mixture-of-Experts models because scale-dependent observables appear in the aggregation dynamics. The Maximally Scale-Stable Parameterization (MSSP), derived by imposing maximal scale stability desiderata instead, yields distinct limiting dynamics that support stable scaling in all three regimes of width, expert width, expert count, and sparsity.

What carries the argument

Dynamical Mean Field Theory (DMFT) descriptions of the limiting training dynamics of MoE models in each of the three scaling regimes, used to derive the unique parameterization satisfying all maximal scale stability conditions.

If this is right

Learning rates chosen at small scale transfer directly to much larger MoE models without retuning.
Performance improves steadily as width, expert width, number of experts, or depth is increased.
A single set of scaling rules now covers width, depth, expert width, and number of experts for both SGD and Adam.
The same MSSP prescription works uniformly across the three identified scaling regimes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same aggregation instability may appear in other sparsely activated or modular networks, suggesting MSSP could apply beyond standard MoE.
Combining MSSP with existing depth-scaling rules produces a practical end-to-end recipe that practitioners can use without per-scale retuning.
Future work could test whether MSSP remains stable when expert width and number of experts grow at rates not covered by the three regimes analyzed here.

Load-bearing premise

The Dynamical Mean Field Theory accurately captures the scale-dependent observables that appear in the aggregation dynamics of Mixture-of-Experts models across all three scaling regimes.

What would settle it

An experiment in which an MSSP-trained MoE shows either loss of learning-rate transfer or non-monotonic performance when the number of experts is increased while holding other dimensions fixed.

Figures

Figures reproduced from arXiv: 2605.14200 by Alessandro Breccia, Leena Chennuru Vankadara, Luke Hayward, Moritz Haas, Sebastian Bordt.

**Figure 1.** Figure 1: µP does not reliably improve performance with scale in MoEs. MSSP recovers monotonic improvement and delivers LR transfer across MoE co-scaling regimes. Left: Across optimizers and regimes, MSSP (solid lines) outperforms µP (dashed lines) at large scale for MLP MoEs on TinyImageNet. Right: LR transfer in validation loss for GPT MoEs trained with Adam in MSSP for 2.5B tokens when co-scaling width and number… view at source ↗

**Figure 2.** Figure 2: Transformer-MoE architecture and the MoE block under the three scaling regimes. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Delayed learning and scale dependent dynamics of µP resolved by MSSP (SGD, Regime II). Training loss (left) is worse at large scale in µP (the darker, the wider), but monotonically improves in MSSP. Scaling exponents of sub-terms of the aggregated MoE activations h l t (right) are approximately 0 in all time steps in MSSP, signaling scale-independent training dynamics. In µP, initially vanishing sub-terms … view at source ↗

**Figure 4.** Figure 4: Consistent exponents in MSSP, but not µP (SGD, Regime II). Corresponds to [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Robust LR transfer in MSSP, but not µP. Top-5 training accuracy of MLP MoEs trained with SGD on TinyImageNet. The optimal learning rate often grows in µP saturating at the maximal stable learning rate with degrading performance at large scale. MSSP recovers learning rate transfer and monotonic improvements with increasing scale. random Gaussian chains simultaneously. The per-expert summands therefore remai… view at source ↗

**Figure 6.** Figure 6: Learning rate transfer in Transformers. Validation loss for GPT MoEs trained with Adam in µP and MSSP for 2.5B tokens in Regime II (N, M, K → ∞, Ne ∈ Θ(1) left) and Regime III (N, Ne, M, K → ∞ right). Observe LR transfer and monotonic improvement with scale in MSSP. 4 A More Fundamental Desideratum Beyond Maximal Updates: Scaling MoEs Requires Maximal Scale Stability In Section 3, we traced the failure mod… view at source ↗

read the original abstract

Recent frontier large language models predominantly rely on Mixture-of-Experts (MoE) architectures. Despite empirical progress, there is still no principled understanding of how hyperparameters should scale with network width $N$, expert width $N_e$, number of experts $M$, sparsity $K$, and depth $L$ to ensure both stability and optimal performance at scale. We take a principled step toward resolving this gap by analyzing three different scaling regimes: (I) co-scaling $N\asymp N_e$, (II) co-scaling $N\asymp M\asymp K$, and (III) full proportional scaling of $N, N_e, M$, and $K$. For each regime, we develop a novel Dynamical Mean Field Theory (DMFT) description of the limiting training dynamics of MoEs that provides a formal foundation for our analysis. Within this framework, we derive the unique parameterization for SGD and Adam satisfying all maximal-update ($\mu$) desiderata. We then show that the resulting $\mu$P prescription does not reliably induce monotonic improvement with scale or robust learning-rate transfer. We trace these pathologies to scale-dependent observables in the aggregation dynamics, which motivates a refined set of desiderata that we term maximal scale stability. Guided by this principle, we derive a Maximally Scale-Stable Parameterization (MSSP) for both SGD and Adam in all three scaling regimes, and characterize the corresponding limiting dynamics - qualitatively distinct from the $\mu$P limit - through a separate DMFT analysis. Experiments verify that MSSP robustly recovers learning rate transfer and monotonic improvement with scale across regimes. Combined with existing depth-scaling theory, these results provide a complete scaling prescription for MoE architectures as a function of width, depth, expert width, and number of experts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MSSP refines muP for MoE via DMFT in three regimes and fixes LR transfer in experiments, but the mean-field limit needs direct finite-width checks to confirm it solves real rather than approximate problems.

read the letter

The core contribution is a new parameterization called MSSP that starts from muP but adds maximal scale stability to handle how routers and experts aggregate as width, expert width, expert count, and sparsity change. They split the problem into three concrete regimes and run DMFT on each to derive the SGD and Adam rules that keep updates maximal while removing scale-dependent pathologies in the limiting dynamics. Experiments then show that MSSP restores learning-rate transfer and gives monotonic gains with scale where muP does not. That is useful work because MoE is the dominant architecture in frontier models and a principled joint rule for all those dimensions has been missing. The derivations are systematic and the experimental verification is presented as clean across regimes, which earns credit for making the claim testable. The soft spot is the DMFT step itself. The paper does not appear to supply quantitative error bounds on the mean-field approximation or side-by-side plots of router logit variance and expert activation statistics against finite-N trajectories at the widths used in the runs. If the leading finite-size corrections in the K ~ M or N ~ Ne regimes differ from the limit, MSSP could be correcting an artifact. That gap is not fatal but it is the part that needs tightening before the prescription can be treated as fully reliable. This paper is for groups that already work on scaling laws and sparse model design. Anyone who needs a concrete rule for choosing width, expert width, M, K, and learning rate together will find the regimes and the final formulas directly usable, even if they plan to re-derive or re-test the DMFT part. It deserves a serious referee because the problem is timely, the approach is grounded, and the experimental signal is positive; the review would mainly press on the finite-scale validation rather than reject the framing.

Referee Report

2 major / 2 minor

Summary. The manuscript develops Dynamical Mean Field Theory (DMFT) descriptions of the limiting training dynamics for Mixture-of-Experts (MoE) models in three scaling regimes: (I) N ≍ Ne, (II) N ≍ M ≍ K, and (III) full proportional scaling of N, Ne, M, and K. It first derives the unique μP parameterization for SGD and Adam that satisfies maximal-update desiderata, then identifies pathologies in learning-rate transfer and monotonic improvement with scale. These are traced to scale-dependent observables in the router/expert aggregation step, motivating a refined set of maximal scale-stability desiderata. The authors derive the corresponding Maximally Scale-Stable Parameterization (MSSP) for both optimizers, characterize its distinct limiting dynamics via a second DMFT analysis, and report experiments showing that MSSP recovers robust learning-rate transfer and monotonic scaling gains across regimes. Combined with existing depth-scaling results, this supplies a complete hyperparameter prescription in terms of width, depth, expert width, and number of experts.

Significance. If the DMFT limits accurately capture the relevant observables and the experimental trends hold at larger scales, the work supplies the first principled scaling rule for MoE hyperparameters that simultaneously guarantees stability and optimal performance. The explicit derivation of both μP and MSSP limits, together with the experimental demonstration of improved transfer, would constitute a substantive advance for training frontier-scale sparse models.

major comments (2)

[§3] §3 (DMFT derivation for regimes I–III): the manuscript provides no quantitative error bounds on the mean-field approximation nor direct numerical comparisons of DMFT-predicted statistics (router logit variance, expert activation fractions, or aggregation moments) against finite-N trajectories at the widths used in the experiments. Because the central claim is that MSSP removes the scale-dependent pathologies identified by DMFT, this validation step is load-bearing; without it the derived parameterization could be addressing an artifact of the infinite-width limit rather than the observed finite-scale behavior.
[Experiments] Experimental section (verification of LR transfer and monotonic improvement): the reported runs should include explicit scaling curves for each regime separately, with the effective N, Ne, M, K values stated and a demonstration that performance continues to improve as these parameters approach the DMFT limit. The current aggregate claim that MSSP “robustly recovers” the desired properties across regimes cannot be assessed without these controls.

minor comments (2)

Notation for the three regimes and the hyperparameters (N, Ne, M, K, L, sparsity) should be introduced once in the main text and used consistently in all figures and equations.
Figure captions should state the precise optimizer, learning-rate schedule, and initialization variance used in each panel so that the MSSP versus μP comparison can be reproduced without consulting the appendix.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review. The comments highlight important opportunities to strengthen the validation of our DMFT analysis and the presentation of experimental results. We address each point below and describe the revisions we will make.

read point-by-point responses

Referee: [§3] §3 (DMFT derivation for regimes I–III): the manuscript provides no quantitative error bounds on the mean-field approximation nor direct numerical comparisons of DMFT-predicted statistics (router logit variance, expert activation fractions, or aggregation moments) against finite-N trajectories at the widths used in the experiments. Because the central claim is that MSSP removes the scale-dependent pathologies identified by DMFT, this validation step is load-bearing; without it the derived parameterization could be addressing an artifact of the infinite-width limit rather than the observed finite-scale behavior.

Authors: We agree that direct numerical validation of the DMFT predictions is important for supporting our central claims. In the revised manuscript we will add explicit comparisons of DMFT-predicted statistics—including router logit variance, expert activation fractions, and aggregation moments—against finite-N trajectories at the widths used in our experiments. These comparisons will demonstrate that the scale-dependent pathologies identified by DMFT are present in finite-scale training and are mitigated by MSSP. While deriving rigorous quantitative error bounds on the mean-field approximation for this setting is a substantial open theoretical question that lies beyond the scope of the present work, we will explicitly note this limitation and rely on the empirical convergence evidence to confirm that the parameterization addresses observed finite-scale behavior rather than an infinite-width artifact. revision: partial
Referee: [Experiments] Experimental section (verification of LR transfer and monotonic improvement): the reported runs should include explicit scaling curves for each regime separately, with the effective N, Ne, M, K values stated and a demonstration that performance continues to improve as these parameters approach the DMFT limit. The current aggregate claim that MSSP “robustly recovers” the desired properties across regimes cannot be assessed without these controls.

Authors: We appreciate this recommendation, which will improve the transparency and interpretability of the experimental results. In the revised manuscript we will include separate scaling curves for each of the three regimes (I, II, and III). For every regime we will explicitly state the effective values of N, Ne, M, and K employed and plot performance metrics as these quantities increase toward the DMFT scaling limit. These per-regime plots will demonstrate both robust learning-rate transfer and continued monotonic improvement under MSSP, allowing readers to evaluate the claims independently for each scaling regime. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation chain is self-contained via DMFT analysis

full rationale

The paper develops novel DMFT descriptions for the three MoE scaling regimes (I-III) and derives both the μP parameterization and the refined MSSP from the limiting dynamics and the maximal scale stability desiderata. No load-bearing step reduces the final parameterization to a fitted quantity defined by the paper's own data, a self-citation chain, or a self-definitional loop. The pathologies in μP are identified analytically from the DMFT observables, and MSSP is constructed to satisfy the new desiderata within the same framework. Experimental verification is presented separately and does not feed back into the derivation. This is a standard theoretical derivation without circular elements.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the applicability of DMFT to MoE aggregation dynamics and the validity of the maximal scale stability desiderata as the correct refinement of muP; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Dynamical Mean Field Theory provides an accurate description of the limiting training dynamics of MoE models in the three scaling regimes
Invoked to derive both the muP pathologies and the MSSP rules.

pith-pipeline@v0.9.0 · 5647 in / 1302 out tokens · 39416 ms · 2026-05-15T04:42:00.533798+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We trace these pathologies to scale-dependent observables in the aggregation dynamics... derive a Maximally Scale-Stable Parameterization (MSSP)... DMFT description of the limiting training dynamics
IndisputableMonolith/Foundation/DimensionForcing.lean dimension_forcing_from_8tick unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

three different scaling regimes: (I) co-scaling N≍Ne, (II) co-scaling N≍M≍K, (III) full proportional scaling

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

300 extracted references · 300 canonical work pages · 12 internal anchors

[1]

arXiv preprint arXiv:2512.22768 , year=

Understanding the Mechanisms of Fast Hyperparameter Transfer , author=. arXiv preprint arXiv:2512.22768 , year=

work page arXiv
[2]

CS 231N , volume=

Tiny imagenet visual recognition challenge , author=. CS 231N , volume=

work page
[3]

2026 , note=

Generalization and Scaling Laws for Mixture-of-Experts Transformers , author=. 2026 , note=

work page 2026
[4]

arXiv preprint arXiv:2407.04153 , year=

Mixture of a million experts , author=. arXiv preprint arXiv:2407.04153 , year=

work page arXiv
[5]

arXiv preprint arXiv:2402.07871 , year=

Scaling laws for fine-grained mixture of experts , author=. arXiv preprint arXiv:2402.07871 , year=

work page arXiv
[6]

arXiv preprint arXiv:2505.06839 , year=

The power of fine-grained experts: Granularity boosts expressivity in Mixture of Experts , author=. arXiv preprint arXiv:2505.06839 , year=

work page arXiv
[7]

arXiv preprint arXiv:2603.18168 , year=

Resnets of all shapes and sizes: Convergence of training dynamics in the large-scale limit , author=. arXiv preprint arXiv:2603.18168 , year=

work page arXiv
[8]

Scalable training of mixture-of-experts models with megatron core.arXiv preprint arXiv:2603.07685,

Scalable Training of Mixture-of-Experts Models with Megatron Core , author=. arXiv preprint arXiv:2603.07685 , year=

work page arXiv
[9]

arXiv preprint arXiv:2601.20205 , year=

Hyperparameter Transfer with Mixture-of-Expert Layers , author=. arXiv preprint arXiv:2601.20205 , year=

work page arXiv
[10]

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models , author=. arXiv preprint arXiv:2401.06066 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[11]

arXiv preprint arXiv:2508.09752 , year=

mu-Parametrization for Mixture of Experts , author=. arXiv preprint arXiv:2508.09752 , year=

work page arXiv
[12]

Olmoe: Open mixture-of-experts language models

Olmoe: Open mixture-of-experts language models , author=. arXiv preprint arXiv:2409.02060 , year=

work page arXiv
[13]

ST-MoE: Designing Stable and Transferable Sparse Expert Models

St-moe: Designing stable and transferable sparse expert models , author=. arXiv preprint arXiv:2202.08906 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[14]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Learning Factored Representations in a Deep Mixture of Experts

Learning factored representations in a deep mixture of experts , author=. arXiv preprint arXiv:1312.4314 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Journal of Machine Learning Research , volume=

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity , author=. Journal of Machine Learning Research , volume=

work page
[17]

arXiv preprint arXiv:2208.02813 , year=

Towards understanding mixture of experts in deep learning , author=. arXiv preprint arXiv:2208.02813 , year=

work page arXiv
[18]

arXiv preprint arXiv:2503.07137 , year=

A comprehensive survey of mixture-of-experts: Algorithms, theory, and applications , author=. arXiv preprint arXiv:2503.07137 , year=

work page arXiv
[19]

IEEE Transactions on Knowledge and Data Engineering , year=

A survey on mixture of experts in large language models , author=. IEEE Transactions on Knowledge and Data Engineering , year=

work page
[20]

Neural computation , volume=

Adaptive mixtures of local experts , author=. Neural computation , volume=. 1991 , publisher=

work page 1991
[21]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

work page 2000
[22]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

work page 1980
[23]

M. J. Kearns , title =

work page
[24]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

work page 1983
[25]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

work page 2000
[26]

Suppressed for Anonymity , author=

work page
[27]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

work page 1981
[28]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

work page 1959
[29]

The Optimization Landscape of

Alexander Atanasov and Alexandru Meterez and James B Simon and Cengiz Pehlevan , booktitle=. The Optimization Landscape of

work page
[30]

Gradient Descent Provably Optimizes Over-parameterized Neural Networks

Gradient descent provably optimizes over-parameterized neural networks , author=. arXiv preprint arXiv:1810.02054 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[31]

arXiv preprint arXiv:2206.10012 , year=

Limitations of the ntk for understanding generalization in deep learning , author=. arXiv preprint arXiv:2206.10012 , year=

work page arXiv
[32]

SIAM Journal on Applied Mathematics , volume=

Mean field analysis of neural networks: A law of large numbers , author=. SIAM Journal on Applied Mathematics , volume=. 2020 , publisher=

work page 2020
[33]

Communications on Pure and Applied Mathematics , volume=

Trainability and accuracy of artificial neural networks: An interacting particle system approach , author=. Communications on Pure and Applied Mathematics , volume=. 2022 , publisher=

work page 2022
[34]

International Conference on Learning Representations (ICLR) , year=

On large-batch training for deep learning: Generalization gap and sharp minima , author=. International Conference on Learning Representations (ICLR) , year=

work page
[35]

Advances in neural information processing systems (NeurIPS) , volume=

Towards explaining the regularization effect of initial large learning rate in training neural networks , author=. Advances in neural information processing systems (NeurIPS) , volume=

work page
[36]

International Conference on Machine Learning (ICML) , pages=

Sgd with large step sizes learns sparse features , author=. International Conference on Machine Learning (ICML) , pages=. 2023 , organization=

work page 2023
[37]

Advances in Neural Information Processing Systems (NeurIPS) , volume=

Large stepsize gradient descent for non-homogeneous two-layer networks: Margin improvement and fast optimization , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

work page
[38]

Advances in Neural Information Processing Systems (NeurIPS) , volume=

Scaling mlps: A tale of inductive bias , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

work page
[39]

International Conference on Learning Representations (ICLR) , year=

How feature learning can improve neural scaling laws , author=. International Conference on Learning Representations (ICLR) , year=

work page
[40]

Transactions on Machine Learning Research (TMLR) , issn=

Temperature check: theory and practice for training models with softmax-cross-entropy losses , author=. Transactions on Machine Learning Research (TMLR) , issn=. 2023 , url=

work page 2023
[41]

arXiv:2506.12543 , year=

Is your batch size the problem? Revisiting the Adam-SGD gap in language modeling , author=. arXiv:2506.12543 , year=

work page arXiv
[42]

Advances in Neural Information Processing Systems (NeurIPS) , volume=

Finite versus infinite neural networks: an empirical study , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

work page
[43]

Advances in Neural Information Processing Systems (NeurIPS) , volume=

Why warmup the learning rate? underlying mechanisms and improvements , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

work page
[44]

2 OLMo 2 Furious

2 OLMo 2 Furious , author=. arXiv:2501.00656 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[45]

arXiv:2006.06574 , year=

Dynamically Stable Infinite-Width Limits of Neural Classifiers , author=. arXiv:2006.06574 , year=

work page arXiv 2006
[46]

arXiv:2001.07301 , year=

On the infinite width limit of neural networks with a standard parameterization , author=. arXiv:2001.07301 , year=

work page arXiv 2001
[47]

IEEE Signal Processing Magazine , volume=

The mnist database of handwritten digit images for machine learning research , author=. IEEE Signal Processing Magazine , volume=. 2012 , publisher=

work page 2012
[48]

Lee , title=

Alex Damian and Eshaan Nichani and Jason D. Lee , title=. 2023 , url=

work page 2023
[49]

Advances in neural information processing systems , volume=

On the global convergence of gradient descent for over-parameterized models using optimal transport , author=. Advances in neural information processing systems , volume=

work page
[50]

arXiv:2505.01618 , year=

Don't be lazy: CompleteP enables compute-efficient deep transformers , author=. arXiv:2505.01618 , year=

work page arXiv
[51]

arXiv:2503.07966 , year=

Benign Overfitting and the Geometry of the Ridge Regression Solution in Binary Classification , author=. arXiv:2503.07966 , year=

work page arXiv
[52]

arXiv:2504.04105 , year=

Minimax Optimal Convergence of Gradient Descent in Logistic Regression via Large and Adaptive Stepsizes , author=. arXiv:2504.04105 , year=

work page arXiv
[53]

arXiv:2504.19983 , year=

Emergence and scaling laws in SGD learning of shallow neural networks , author=. arXiv:2504.19983 , year=

work page arXiv
[54]

Advances in Neural Information Processing Systems (NeurIPS) , volume=

An empirical analysis of compute-optimal large language model training , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

work page
[55]

The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS) , year=

4+3 Phases of Compute-Optimal Neural Scaling Laws , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS) , year=

work page
[56]

arXiv:2410.21265 , year=

Modular duality in deep learning , author=. arXiv:2410.21265 , year=

work page arXiv
[57]

International Conference on Learning Representations (ICLR) , year=

Divergence of Empirical Neural Tangent Kernel in Classification Problems , author=. International Conference on Learning Representations (ICLR) , year=

work page
[58]

arXiv:2503.09565 , year=

Global Convergence and Rich Feature Learning in L -Layer Infinite-Width Neural Networks under mu-Parametrization , author=. arXiv:2503.09565 , year=

work page arXiv
[59]

Forty-first International Conference on Machine Learning (ICML) , year=

Rotational Equilibrium: How Weight Decay Balances Learning Across Neural Networks , author=. Forty-first International Conference on Machine Learning (ICML) , year=

work page
[60]

The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS) , year=

Why Do We Need Weight Decay in Modern Deep Learning? , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS) , year=

work page
[61]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[62]

International Conference on Learning Representations (ICLR) , year=

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. International Conference on Learning Representations (ICLR) , year=

work page
[63]

Liu and Roman Novak and Jaehoon Lee and Mitchell Wortsman and Lechao Xiao and Katie Everett and Alexander A

Peter J. Liu and Roman Novak and Jaehoon Lee and Mitchell Wortsman and Lechao Xiao and Katie Everett and Alexander A. Alemi and Mark Kurzeja and Pierre Marcenac and Izzeddin Gur and Simon Kornblith and Kelvin Xu and Gamaleldin Elsayed and Ian Fischer and Jeffrey Pennington and Ben Adlam and Jascha Sohl-Dickstein , title =. GitHub repository , volume =. 20...

work page 2024
[64]

2024 , journal=

DataComp-LM: In search of the next generation of training sets for language models , author=. 2024 , journal=

work page 2024
[65]

The Thirty Sixth Annual Conference on Learning Theory (COLT) , pages=

Sgd learning on neural networks: leap complexity and saddle-to-saddle dynamics , author=. The Thirty Sixth Annual Conference on Learning Theory (COLT) , pages=. 2023 , organization=

work page 2023
[66]

arXiv:2502.02531 , year=

Deep Linear Network Training Dynamics from Random Initialization: Data, Width, Depth, and Hyperparameter Transfer , author=. arXiv:2502.02531 , year=

work page arXiv
[67]

The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS) , year=

Super Consistency of Neural Network Landscapes and Learning Rate Transfer , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS) , year=

work page
[68]

The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS) , year=

Stable minima cannot overfit in univariate ReLU networks: Generalization by large step sizes , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS) , year=

work page
[69]

An Empirical Model of Large-Batch Training

An empirical model of large-batch training , author=. arXiv:1812.06162 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[70]

Measuring the Effects of Data Parallelism on Neural Network Training

Measuring the Effects of Data Parallelism on Neural Network Training. , author=. arXiv:1811.03600 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[71]

Advances in Neural Information Processing Systems (NeurIPS) , volume=

Language models are few-shot learners , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

work page
[72]

The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS) , year=

Why Warmup the Learning Rate? Underlying Mechanisms and Improvements , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS) , year=

work page
[73]

International Conference on Learning Representations (ICLR) , year=

On the Variance of the Adaptive Learning Rate and Beyond , author=. International Conference on Learning Representations (ICLR) , year=

work page
[74]

International Conference on Machine Learning (ICML) , pages=

On layer normalization in the transformer architecture , author=. International Conference on Machine Learning (ICML) , pages=. 2020 , organization=

work page 2020
[75]

A Closer Look at Deep Learning Heuristics: Learning rate restarts, Warmup and Distillation

A closer look at deep learning heuristics: Learning rate restarts, warmup and distillation , author=. arXiv:1810.13243 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[76]

On Feature Learning in Structured State Space Models , url =

Leena Chennuru Vankadara and Jin Xu and Moritz Haas and Volkan Cevher , booktitle =. On Feature Learning in Structured State Space Models , url =

work page
[77]

Moritz Haas and Jin Xu and Volkan Cevher and Leena Chennuru Vankadara , booktitle =

work page
[78]

The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS) , year=

Understanding and Minimising Outlier Features in Transformer Training , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS) , year=

work page
[79]

arXiv:2407.05872 , year=

Scaling exponents across parameterizations and optimizers , author=. arXiv:2407.05872 , year=

work page arXiv
[80]

The Thirteenth International Conference on Learning Representations (ICLR) , year=

u- P: The Unit-Scaled Maximal Update Parametrization , author=. The Thirteenth International Conference on Learning Representations (ICLR) , year=

work page

Showing first 80 references.