pith. machine review for the scientific record. sign in

arxiv: 2605.14200 · v1 · submitted 2026-05-13 · 💻 cs.LG · stat.ML

Recognition: 2 theorem links

· Lean Theorem

How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization

Authors on Pith no claims yet

Pith reviewed 2026-05-15 04:42 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords mixture of expertsscaling lawsmuPparameterizationdynamical mean field theorylearning rate transferSGDAdam
0
0 comments X

The pith

Mixture-of-Experts models require a Maximally Scale-Stable Parameterization to restore learning-rate transfer and monotonic gains at scale.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper analyzes three scaling regimes for MoE models: co-scaling width with expert width, co-scaling width with number of experts and sparsity, and full proportional scaling of all four quantities. It derives a Dynamical Mean Field Theory description of limiting dynamics and shows that standard muP fails to guarantee stable behavior because of scale-dependent terms in the expert aggregation step. From this, the authors extract the Maximally Scale-Stable Parameterization (MSSP) for both SGD and Adam, which produces qualitatively different limits and restores the desired scaling properties. Experiments confirm that MSSP yields reliable learning-rate transfer and steady improvement with size across regimes.

Core claim

The muP parameterization does not reliably induce monotonic improvement with scale or robust learning-rate transfer in Mixture-of-Experts models because scale-dependent observables appear in the aggregation dynamics. The Maximally Scale-Stable Parameterization (MSSP), derived by imposing maximal scale stability desiderata instead, yields distinct limiting dynamics that support stable scaling in all three regimes of width, expert width, expert count, and sparsity.

What carries the argument

Dynamical Mean Field Theory (DMFT) descriptions of the limiting training dynamics of MoE models in each of the three scaling regimes, used to derive the unique parameterization satisfying all maximal scale stability conditions.

If this is right

  • Learning rates chosen at small scale transfer directly to much larger MoE models without retuning.
  • Performance improves steadily as width, expert width, number of experts, or depth is increased.
  • A single set of scaling rules now covers width, depth, expert width, and number of experts for both SGD and Adam.
  • The same MSSP prescription works uniformly across the three identified scaling regimes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same aggregation instability may appear in other sparsely activated or modular networks, suggesting MSSP could apply beyond standard MoE.
  • Combining MSSP with existing depth-scaling rules produces a practical end-to-end recipe that practitioners can use without per-scale retuning.
  • Future work could test whether MSSP remains stable when expert width and number of experts grow at rates not covered by the three regimes analyzed here.

Load-bearing premise

The Dynamical Mean Field Theory accurately captures the scale-dependent observables that appear in the aggregation dynamics of Mixture-of-Experts models across all three scaling regimes.

What would settle it

An experiment in which an MSSP-trained MoE shows either loss of learning-rate transfer or non-monotonic performance when the number of experts is increased while holding other dimensions fixed.

Figures

Figures reproduced from arXiv: 2605.14200 by Alessandro Breccia, Leena Chennuru Vankadara, Luke Hayward, Moritz Haas, Sebastian Bordt.

Figure 1
Figure 1. Figure 1: µP does not reliably improve performance with scale in MoEs. MSSP recovers monotonic improvement and delivers LR transfer across MoE co-scaling regimes. Left: Across optimizers and regimes, MSSP (solid lines) outperforms µP (dashed lines) at large scale for MLP MoEs on TinyImageNet. Right: LR transfer in validation loss for GPT MoEs trained with Adam in MSSP for 2.5B tokens when co-scaling width and number… view at source ↗
Figure 2
Figure 2. Figure 2: Transformer-MoE architecture and the MoE block under the three scaling regimes. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Delayed learning and scale dependent dynamics of µP resolved by MSSP (SGD, Regime II). Training loss (left) is worse at large scale in µP (the darker, the wider), but monotonically improves in MSSP. Scaling exponents of sub-terms of the aggregated MoE activations h l t (right) are approximately 0 in all time steps in MSSP, signaling scale-independent training dynamics. In µP, initially vanishing sub-terms … view at source ↗
Figure 4
Figure 4. Figure 4: Consistent exponents in MSSP, but not µP (SGD, Regime II). Corresponds to [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Robust LR transfer in MSSP, but not µP. Top-5 training accuracy of MLP MoEs trained with SGD on TinyImageNet. The optimal learning rate often grows in µP saturating at the maximal stable learning rate with degrading performance at large scale. MSSP recovers learning rate transfer and monotonic improvements with increasing scale. random Gaussian chains simultaneously. The per-expert summands therefore remai… view at source ↗
Figure 6
Figure 6. Figure 6: Learning rate transfer in Transformers. Validation loss for GPT MoEs trained with Adam in µP and MSSP for 2.5B tokens in Regime II (N, M, K → ∞, Ne ∈ Θ(1) left) and Regime III (N, Ne, M, K → ∞ right). Observe LR transfer and monotonic improvement with scale in MSSP. 4 A More Fundamental Desideratum Beyond Maximal Updates: Scaling MoEs Requires Maximal Scale Stability In Section 3, we traced the failure mod… view at source ↗
read the original abstract

Recent frontier large language models predominantly rely on Mixture-of-Experts (MoE) architectures. Despite empirical progress, there is still no principled understanding of how hyperparameters should scale with network width $N$, expert width $N_e$, number of experts $M$, sparsity $K$, and depth $L$ to ensure both stability and optimal performance at scale. We take a principled step toward resolving this gap by analyzing three different scaling regimes: (I) co-scaling $N\asymp N_e$, (II) co-scaling $N\asymp M\asymp K$, and (III) full proportional scaling of $N, N_e, M$, and $K$. For each regime, we develop a novel Dynamical Mean Field Theory (DMFT) description of the limiting training dynamics of MoEs that provides a formal foundation for our analysis. Within this framework, we derive the unique parameterization for SGD and Adam satisfying all maximal-update ($\mu$) desiderata. We then show that the resulting $\mu$P prescription does not reliably induce monotonic improvement with scale or robust learning-rate transfer. We trace these pathologies to scale-dependent observables in the aggregation dynamics, which motivates a refined set of desiderata that we term maximal scale stability. Guided by this principle, we derive a Maximally Scale-Stable Parameterization (MSSP) for both SGD and Adam in all three scaling regimes, and characterize the corresponding limiting dynamics - qualitatively distinct from the $\mu$P limit - through a separate DMFT analysis. Experiments verify that MSSP robustly recovers learning rate transfer and monotonic improvement with scale across regimes. Combined with existing depth-scaling theory, these results provide a complete scaling prescription for MoE architectures as a function of width, depth, expert width, and number of experts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript develops Dynamical Mean Field Theory (DMFT) descriptions of the limiting training dynamics for Mixture-of-Experts (MoE) models in three scaling regimes: (I) N ≍ Ne, (II) N ≍ M ≍ K, and (III) full proportional scaling of N, Ne, M, and K. It first derives the unique μP parameterization for SGD and Adam that satisfies maximal-update desiderata, then identifies pathologies in learning-rate transfer and monotonic improvement with scale. These are traced to scale-dependent observables in the router/expert aggregation step, motivating a refined set of maximal scale-stability desiderata. The authors derive the corresponding Maximally Scale-Stable Parameterization (MSSP) for both optimizers, characterize its distinct limiting dynamics via a second DMFT analysis, and report experiments showing that MSSP recovers robust learning-rate transfer and monotonic scaling gains across regimes. Combined with existing depth-scaling results, this supplies a complete hyperparameter prescription in terms of width, depth, expert width, and number of experts.

Significance. If the DMFT limits accurately capture the relevant observables and the experimental trends hold at larger scales, the work supplies the first principled scaling rule for MoE hyperparameters that simultaneously guarantees stability and optimal performance. The explicit derivation of both μP and MSSP limits, together with the experimental demonstration of improved transfer, would constitute a substantive advance for training frontier-scale sparse models.

major comments (2)
  1. [§3] §3 (DMFT derivation for regimes I–III): the manuscript provides no quantitative error bounds on the mean-field approximation nor direct numerical comparisons of DMFT-predicted statistics (router logit variance, expert activation fractions, or aggregation moments) against finite-N trajectories at the widths used in the experiments. Because the central claim is that MSSP removes the scale-dependent pathologies identified by DMFT, this validation step is load-bearing; without it the derived parameterization could be addressing an artifact of the infinite-width limit rather than the observed finite-scale behavior.
  2. [Experiments] Experimental section (verification of LR transfer and monotonic improvement): the reported runs should include explicit scaling curves for each regime separately, with the effective N, Ne, M, K values stated and a demonstration that performance continues to improve as these parameters approach the DMFT limit. The current aggregate claim that MSSP “robustly recovers” the desired properties across regimes cannot be assessed without these controls.
minor comments (2)
  1. Notation for the three regimes and the hyperparameters (N, Ne, M, K, L, sparsity) should be introduced once in the main text and used consistently in all figures and equations.
  2. Figure captions should state the precise optimizer, learning-rate schedule, and initialization variance used in each panel so that the MSSP versus μP comparison can be reproduced without consulting the appendix.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review. The comments highlight important opportunities to strengthen the validation of our DMFT analysis and the presentation of experimental results. We address each point below and describe the revisions we will make.

read point-by-point responses
  1. Referee: [§3] §3 (DMFT derivation for regimes I–III): the manuscript provides no quantitative error bounds on the mean-field approximation nor direct numerical comparisons of DMFT-predicted statistics (router logit variance, expert activation fractions, or aggregation moments) against finite-N trajectories at the widths used in the experiments. Because the central claim is that MSSP removes the scale-dependent pathologies identified by DMFT, this validation step is load-bearing; without it the derived parameterization could be addressing an artifact of the infinite-width limit rather than the observed finite-scale behavior.

    Authors: We agree that direct numerical validation of the DMFT predictions is important for supporting our central claims. In the revised manuscript we will add explicit comparisons of DMFT-predicted statistics—including router logit variance, expert activation fractions, and aggregation moments—against finite-N trajectories at the widths used in our experiments. These comparisons will demonstrate that the scale-dependent pathologies identified by DMFT are present in finite-scale training and are mitigated by MSSP. While deriving rigorous quantitative error bounds on the mean-field approximation for this setting is a substantial open theoretical question that lies beyond the scope of the present work, we will explicitly note this limitation and rely on the empirical convergence evidence to confirm that the parameterization addresses observed finite-scale behavior rather than an infinite-width artifact. revision: partial

  2. Referee: [Experiments] Experimental section (verification of LR transfer and monotonic improvement): the reported runs should include explicit scaling curves for each regime separately, with the effective N, Ne, M, K values stated and a demonstration that performance continues to improve as these parameters approach the DMFT limit. The current aggregate claim that MSSP “robustly recovers” the desired properties across regimes cannot be assessed without these controls.

    Authors: We appreciate this recommendation, which will improve the transparency and interpretability of the experimental results. In the revised manuscript we will include separate scaling curves for each of the three regimes (I, II, and III). For every regime we will explicitly state the effective values of N, Ne, M, and K employed and plot performance metrics as these quantities increase toward the DMFT scaling limit. These per-regime plots will demonstrate both robust learning-rate transfer and continued monotonic improvement under MSSP, allowing readers to evaluate the claims independently for each scaling regime. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation chain is self-contained via DMFT analysis

full rationale

The paper develops novel DMFT descriptions for the three MoE scaling regimes (I-III) and derives both the μP parameterization and the refined MSSP from the limiting dynamics and the maximal scale stability desiderata. No load-bearing step reduces the final parameterization to a fitted quantity defined by the paper's own data, a self-citation chain, or a self-definitional loop. The pathologies in μP are identified analytically from the DMFT observables, and MSSP is constructed to satisfy the new desiderata within the same framework. Experimental verification is presented separately and does not feed back into the derivation. This is a standard theoretical derivation without circular elements.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the applicability of DMFT to MoE aggregation dynamics and the validity of the maximal scale stability desiderata as the correct refinement of muP; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Dynamical Mean Field Theory provides an accurate description of the limiting training dynamics of MoE models in the three scaling regimes
    Invoked to derive both the muP pathologies and the MSSP rules.

pith-pipeline@v0.9.0 · 5647 in / 1302 out tokens · 39416 ms · 2026-05-15T04:42:00.533798+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

300 extracted references · 300 canonical work pages · 12 internal anchors

  1. [1]

    arXiv preprint arXiv:2512.22768 , year=

    Understanding the Mechanisms of Fast Hyperparameter Transfer , author=. arXiv preprint arXiv:2512.22768 , year=

  2. [2]

    CS 231N , volume=

    Tiny imagenet visual recognition challenge , author=. CS 231N , volume=

  3. [3]

    2026 , note=

    Generalization and Scaling Laws for Mixture-of-Experts Transformers , author=. 2026 , note=

  4. [4]

    arXiv preprint arXiv:2407.04153 , year=

    Mixture of a million experts , author=. arXiv preprint arXiv:2407.04153 , year=

  5. [5]

    arXiv preprint arXiv:2402.07871 , year=

    Scaling laws for fine-grained mixture of experts , author=. arXiv preprint arXiv:2402.07871 , year=

  6. [6]

    arXiv preprint arXiv:2505.06839 , year=

    The power of fine-grained experts: Granularity boosts expressivity in Mixture of Experts , author=. arXiv preprint arXiv:2505.06839 , year=

  7. [7]

    arXiv preprint arXiv:2603.18168 , year=

    Resnets of all shapes and sizes: Convergence of training dynamics in the large-scale limit , author=. arXiv preprint arXiv:2603.18168 , year=

  8. [8]

    Scalable training of mixture-of-experts models with megatron core.arXiv preprint arXiv:2603.07685,

    Scalable Training of Mixture-of-Experts Models with Megatron Core , author=. arXiv preprint arXiv:2603.07685 , year=

  9. [9]

    arXiv preprint arXiv:2601.20205 , year=

    Hyperparameter Transfer with Mixture-of-Expert Layers , author=. arXiv preprint arXiv:2601.20205 , year=

  10. [10]

    DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

    Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models , author=. arXiv preprint arXiv:2401.06066 , year=

  11. [11]

    arXiv preprint arXiv:2508.09752 , year=

    mu-Parametrization for Mixture of Experts , author=. arXiv preprint arXiv:2508.09752 , year=

  12. [12]

    Olmoe: Open mixture-of-experts language models

    Olmoe: Open mixture-of-experts language models , author=. arXiv preprint arXiv:2409.02060 , year=

  13. [13]

    ST-MoE: Designing Stable and Transferable Sparse Expert Models

    St-moe: Designing stable and transferable sparse expert models , author=. arXiv preprint arXiv:2202.08906 , year=

  14. [14]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

  15. [15]

    Learning Factored Representations in a Deep Mixture of Experts

    Learning factored representations in a deep mixture of experts , author=. arXiv preprint arXiv:1312.4314 , year=

  16. [16]

    Journal of Machine Learning Research , volume=

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity , author=. Journal of Machine Learning Research , volume=

  17. [17]

    arXiv preprint arXiv:2208.02813 , year=

    Towards understanding mixture of experts in deep learning , author=. arXiv preprint arXiv:2208.02813 , year=

  18. [18]

    arXiv preprint arXiv:2503.07137 , year=

    A comprehensive survey of mixture-of-experts: Algorithms, theory, and applications , author=. arXiv preprint arXiv:2503.07137 , year=

  19. [19]

    IEEE Transactions on Knowledge and Data Engineering , year=

    A survey on mixture of experts in large language models , author=. IEEE Transactions on Knowledge and Data Engineering , year=

  20. [20]

    Neural computation , volume=

    Adaptive mixtures of local experts , author=. Neural computation , volume=. 1991 , publisher=

  21. [21]

    Langley , title =

    P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

  22. [22]

    T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

  23. [23]

    M. J. Kearns , title =

  24. [24]

    Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

  25. [25]

    R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

  26. [26]

    Suppressed for Anonymity , author=

  27. [27]

    Newell and P

    A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

  28. [28]

    A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

  29. [29]

    The Optimization Landscape of

    Alexander Atanasov and Alexandru Meterez and James B Simon and Cengiz Pehlevan , booktitle=. The Optimization Landscape of

  30. [30]

    Gradient Descent Provably Optimizes Over-parameterized Neural Networks

    Gradient descent provably optimizes over-parameterized neural networks , author=. arXiv preprint arXiv:1810.02054 , year=

  31. [31]

    arXiv preprint arXiv:2206.10012 , year=

    Limitations of the ntk for understanding generalization in deep learning , author=. arXiv preprint arXiv:2206.10012 , year=

  32. [32]

    SIAM Journal on Applied Mathematics , volume=

    Mean field analysis of neural networks: A law of large numbers , author=. SIAM Journal on Applied Mathematics , volume=. 2020 , publisher=

  33. [33]

    Communications on Pure and Applied Mathematics , volume=

    Trainability and accuracy of artificial neural networks: An interacting particle system approach , author=. Communications on Pure and Applied Mathematics , volume=. 2022 , publisher=

  34. [34]

    International Conference on Learning Representations (ICLR) , year=

    On large-batch training for deep learning: Generalization gap and sharp minima , author=. International Conference on Learning Representations (ICLR) , year=

  35. [35]

    Advances in neural information processing systems (NeurIPS) , volume=

    Towards explaining the regularization effect of initial large learning rate in training neural networks , author=. Advances in neural information processing systems (NeurIPS) , volume=

  36. [36]

    International Conference on Machine Learning (ICML) , pages=

    Sgd with large step sizes learns sparse features , author=. International Conference on Machine Learning (ICML) , pages=. 2023 , organization=

  37. [37]

    Advances in Neural Information Processing Systems (NeurIPS) , volume=

    Large stepsize gradient descent for non-homogeneous two-layer networks: Margin improvement and fast optimization , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

  38. [38]

    Advances in Neural Information Processing Systems (NeurIPS) , volume=

    Scaling mlps: A tale of inductive bias , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

  39. [39]

    International Conference on Learning Representations (ICLR) , year=

    How feature learning can improve neural scaling laws , author=. International Conference on Learning Representations (ICLR) , year=

  40. [40]

    Transactions on Machine Learning Research (TMLR) , issn=

    Temperature check: theory and practice for training models with softmax-cross-entropy losses , author=. Transactions on Machine Learning Research (TMLR) , issn=. 2023 , url=

  41. [41]

    arXiv:2506.12543 , year=

    Is your batch size the problem? Revisiting the Adam-SGD gap in language modeling , author=. arXiv:2506.12543 , year=

  42. [42]

    Advances in Neural Information Processing Systems (NeurIPS) , volume=

    Finite versus infinite neural networks: an empirical study , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

  43. [43]

    Advances in Neural Information Processing Systems (NeurIPS) , volume=

    Why warmup the learning rate? underlying mechanisms and improvements , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

  44. [44]

    2 OLMo 2 Furious

    2 OLMo 2 Furious , author=. arXiv:2501.00656 , year=

  45. [45]

    arXiv:2006.06574 , year=

    Dynamically Stable Infinite-Width Limits of Neural Classifiers , author=. arXiv:2006.06574 , year=

  46. [46]

    arXiv:2001.07301 , year=

    On the infinite width limit of neural networks with a standard parameterization , author=. arXiv:2001.07301 , year=

  47. [47]

    IEEE Signal Processing Magazine , volume=

    The mnist database of handwritten digit images for machine learning research , author=. IEEE Signal Processing Magazine , volume=. 2012 , publisher=

  48. [48]

    Lee , title=

    Alex Damian and Eshaan Nichani and Jason D. Lee , title=. 2023 , url=

  49. [49]

    Advances in neural information processing systems , volume=

    On the global convergence of gradient descent for over-parameterized models using optimal transport , author=. Advances in neural information processing systems , volume=

  50. [50]

    arXiv:2505.01618 , year=

    Don't be lazy: CompleteP enables compute-efficient deep transformers , author=. arXiv:2505.01618 , year=

  51. [51]

    arXiv:2503.07966 , year=

    Benign Overfitting and the Geometry of the Ridge Regression Solution in Binary Classification , author=. arXiv:2503.07966 , year=

  52. [52]

    arXiv:2504.04105 , year=

    Minimax Optimal Convergence of Gradient Descent in Logistic Regression via Large and Adaptive Stepsizes , author=. arXiv:2504.04105 , year=

  53. [53]

    arXiv:2504.19983 , year=

    Emergence and scaling laws in SGD learning of shallow neural networks , author=. arXiv:2504.19983 , year=

  54. [54]

    Advances in Neural Information Processing Systems (NeurIPS) , volume=

    An empirical analysis of compute-optimal large language model training , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

  55. [55]

    The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS) , year=

    4+3 Phases of Compute-Optimal Neural Scaling Laws , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS) , year=

  56. [56]

    arXiv:2410.21265 , year=

    Modular duality in deep learning , author=. arXiv:2410.21265 , year=

  57. [57]

    International Conference on Learning Representations (ICLR) , year=

    Divergence of Empirical Neural Tangent Kernel in Classification Problems , author=. International Conference on Learning Representations (ICLR) , year=

  58. [58]

    arXiv:2503.09565 , year=

    Global Convergence and Rich Feature Learning in L -Layer Infinite-Width Neural Networks under mu-Parametrization , author=. arXiv:2503.09565 , year=

  59. [59]

    Forty-first International Conference on Machine Learning (ICML) , year=

    Rotational Equilibrium: How Weight Decay Balances Learning Across Neural Networks , author=. Forty-first International Conference on Machine Learning (ICML) , year=

  60. [60]

    The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS) , year=

    Why Do We Need Weight Decay in Modern Deep Learning? , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS) , year=

  61. [61]

    The Llama 3 Herd of Models

    The llama 3 herd of models , author=. arXiv:2407.21783 , year=

  62. [62]

    International Conference on Learning Representations (ICLR) , year=

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. International Conference on Learning Representations (ICLR) , year=

  63. [63]

    Liu and Roman Novak and Jaehoon Lee and Mitchell Wortsman and Lechao Xiao and Katie Everett and Alexander A

    Peter J. Liu and Roman Novak and Jaehoon Lee and Mitchell Wortsman and Lechao Xiao and Katie Everett and Alexander A. Alemi and Mark Kurzeja and Pierre Marcenac and Izzeddin Gur and Simon Kornblith and Kelvin Xu and Gamaleldin Elsayed and Ian Fischer and Jeffrey Pennington and Ben Adlam and Jascha Sohl-Dickstein , title =. GitHub repository , volume =. 20...

  64. [64]

    2024 , journal=

    DataComp-LM: In search of the next generation of training sets for language models , author=. 2024 , journal=

  65. [65]

    The Thirty Sixth Annual Conference on Learning Theory (COLT) , pages=

    Sgd learning on neural networks: leap complexity and saddle-to-saddle dynamics , author=. The Thirty Sixth Annual Conference on Learning Theory (COLT) , pages=. 2023 , organization=

  66. [66]

    arXiv:2502.02531 , year=

    Deep Linear Network Training Dynamics from Random Initialization: Data, Width, Depth, and Hyperparameter Transfer , author=. arXiv:2502.02531 , year=

  67. [67]

    The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS) , year=

    Super Consistency of Neural Network Landscapes and Learning Rate Transfer , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS) , year=

  68. [68]

    The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS) , year=

    Stable minima cannot overfit in univariate ReLU networks: Generalization by large step sizes , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS) , year=

  69. [69]

    An Empirical Model of Large-Batch Training

    An empirical model of large-batch training , author=. arXiv:1812.06162 , year=

  70. [70]

    Measuring the Effects of Data Parallelism on Neural Network Training

    Measuring the Effects of Data Parallelism on Neural Network Training. , author=. arXiv:1811.03600 , year=

  71. [71]

    Advances in Neural Information Processing Systems (NeurIPS) , volume=

    Language models are few-shot learners , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

  72. [72]

    The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS) , year=

    Why Warmup the Learning Rate? Underlying Mechanisms and Improvements , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS) , year=

  73. [73]

    International Conference on Learning Representations (ICLR) , year=

    On the Variance of the Adaptive Learning Rate and Beyond , author=. International Conference on Learning Representations (ICLR) , year=

  74. [74]

    International Conference on Machine Learning (ICML) , pages=

    On layer normalization in the transformer architecture , author=. International Conference on Machine Learning (ICML) , pages=. 2020 , organization=

  75. [75]

    A Closer Look at Deep Learning Heuristics: Learning rate restarts, Warmup and Distillation

    A closer look at deep learning heuristics: Learning rate restarts, warmup and distillation , author=. arXiv:1810.13243 , year=

  76. [76]

    On Feature Learning in Structured State Space Models , url =

    Leena Chennuru Vankadara and Jin Xu and Moritz Haas and Volkan Cevher , booktitle =. On Feature Learning in Structured State Space Models , url =

  77. [77]

    Moritz Haas and Jin Xu and Volkan Cevher and Leena Chennuru Vankadara , booktitle =

  78. [78]

    The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS) , year=

    Understanding and Minimising Outlier Features in Transformer Training , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS) , year=

  79. [79]

    arXiv:2407.05872 , year=

    Scaling exponents across parameterizations and optimizers , author=. arXiv:2407.05872 , year=

  80. [80]

    The Thirteenth International Conference on Learning Representations (ICLR) , year=

    u- P: The Unit-Scaled Maximal Update Parametrization , author=. The Thirteenth International Conference on Learning Representations (ICLR) , year=

Showing first 80 references.