Recognition: 2 theorem links
· Lean TheoremHow to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization
Pith reviewed 2026-05-15 04:42 UTC · model grok-4.3
The pith
Mixture-of-Experts models require a Maximally Scale-Stable Parameterization to restore learning-rate transfer and monotonic gains at scale.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The muP parameterization does not reliably induce monotonic improvement with scale or robust learning-rate transfer in Mixture-of-Experts models because scale-dependent observables appear in the aggregation dynamics. The Maximally Scale-Stable Parameterization (MSSP), derived by imposing maximal scale stability desiderata instead, yields distinct limiting dynamics that support stable scaling in all three regimes of width, expert width, expert count, and sparsity.
What carries the argument
Dynamical Mean Field Theory (DMFT) descriptions of the limiting training dynamics of MoE models in each of the three scaling regimes, used to derive the unique parameterization satisfying all maximal scale stability conditions.
If this is right
- Learning rates chosen at small scale transfer directly to much larger MoE models without retuning.
- Performance improves steadily as width, expert width, number of experts, or depth is increased.
- A single set of scaling rules now covers width, depth, expert width, and number of experts for both SGD and Adam.
- The same MSSP prescription works uniformly across the three identified scaling regimes.
Where Pith is reading between the lines
- The same aggregation instability may appear in other sparsely activated or modular networks, suggesting MSSP could apply beyond standard MoE.
- Combining MSSP with existing depth-scaling rules produces a practical end-to-end recipe that practitioners can use without per-scale retuning.
- Future work could test whether MSSP remains stable when expert width and number of experts grow at rates not covered by the three regimes analyzed here.
Load-bearing premise
The Dynamical Mean Field Theory accurately captures the scale-dependent observables that appear in the aggregation dynamics of Mixture-of-Experts models across all three scaling regimes.
What would settle it
An experiment in which an MSSP-trained MoE shows either loss of learning-rate transfer or non-monotonic performance when the number of experts is increased while holding other dimensions fixed.
Figures
read the original abstract
Recent frontier large language models predominantly rely on Mixture-of-Experts (MoE) architectures. Despite empirical progress, there is still no principled understanding of how hyperparameters should scale with network width $N$, expert width $N_e$, number of experts $M$, sparsity $K$, and depth $L$ to ensure both stability and optimal performance at scale. We take a principled step toward resolving this gap by analyzing three different scaling regimes: (I) co-scaling $N\asymp N_e$, (II) co-scaling $N\asymp M\asymp K$, and (III) full proportional scaling of $N, N_e, M$, and $K$. For each regime, we develop a novel Dynamical Mean Field Theory (DMFT) description of the limiting training dynamics of MoEs that provides a formal foundation for our analysis. Within this framework, we derive the unique parameterization for SGD and Adam satisfying all maximal-update ($\mu$) desiderata. We then show that the resulting $\mu$P prescription does not reliably induce monotonic improvement with scale or robust learning-rate transfer. We trace these pathologies to scale-dependent observables in the aggregation dynamics, which motivates a refined set of desiderata that we term maximal scale stability. Guided by this principle, we derive a Maximally Scale-Stable Parameterization (MSSP) for both SGD and Adam in all three scaling regimes, and characterize the corresponding limiting dynamics - qualitatively distinct from the $\mu$P limit - through a separate DMFT analysis. Experiments verify that MSSP robustly recovers learning rate transfer and monotonic improvement with scale across regimes. Combined with existing depth-scaling theory, these results provide a complete scaling prescription for MoE architectures as a function of width, depth, expert width, and number of experts.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript develops Dynamical Mean Field Theory (DMFT) descriptions of the limiting training dynamics for Mixture-of-Experts (MoE) models in three scaling regimes: (I) N ≍ Ne, (II) N ≍ M ≍ K, and (III) full proportional scaling of N, Ne, M, and K. It first derives the unique μP parameterization for SGD and Adam that satisfies maximal-update desiderata, then identifies pathologies in learning-rate transfer and monotonic improvement with scale. These are traced to scale-dependent observables in the router/expert aggregation step, motivating a refined set of maximal scale-stability desiderata. The authors derive the corresponding Maximally Scale-Stable Parameterization (MSSP) for both optimizers, characterize its distinct limiting dynamics via a second DMFT analysis, and report experiments showing that MSSP recovers robust learning-rate transfer and monotonic scaling gains across regimes. Combined with existing depth-scaling results, this supplies a complete hyperparameter prescription in terms of width, depth, expert width, and number of experts.
Significance. If the DMFT limits accurately capture the relevant observables and the experimental trends hold at larger scales, the work supplies the first principled scaling rule for MoE hyperparameters that simultaneously guarantees stability and optimal performance. The explicit derivation of both μP and MSSP limits, together with the experimental demonstration of improved transfer, would constitute a substantive advance for training frontier-scale sparse models.
major comments (2)
- [§3] §3 (DMFT derivation for regimes I–III): the manuscript provides no quantitative error bounds on the mean-field approximation nor direct numerical comparisons of DMFT-predicted statistics (router logit variance, expert activation fractions, or aggregation moments) against finite-N trajectories at the widths used in the experiments. Because the central claim is that MSSP removes the scale-dependent pathologies identified by DMFT, this validation step is load-bearing; without it the derived parameterization could be addressing an artifact of the infinite-width limit rather than the observed finite-scale behavior.
- [Experiments] Experimental section (verification of LR transfer and monotonic improvement): the reported runs should include explicit scaling curves for each regime separately, with the effective N, Ne, M, K values stated and a demonstration that performance continues to improve as these parameters approach the DMFT limit. The current aggregate claim that MSSP “robustly recovers” the desired properties across regimes cannot be assessed without these controls.
minor comments (2)
- Notation for the three regimes and the hyperparameters (N, Ne, M, K, L, sparsity) should be introduced once in the main text and used consistently in all figures and equations.
- Figure captions should state the precise optimizer, learning-rate schedule, and initialization variance used in each panel so that the MSSP versus μP comparison can be reproduced without consulting the appendix.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive review. The comments highlight important opportunities to strengthen the validation of our DMFT analysis and the presentation of experimental results. We address each point below and describe the revisions we will make.
read point-by-point responses
-
Referee: [§3] §3 (DMFT derivation for regimes I–III): the manuscript provides no quantitative error bounds on the mean-field approximation nor direct numerical comparisons of DMFT-predicted statistics (router logit variance, expert activation fractions, or aggregation moments) against finite-N trajectories at the widths used in the experiments. Because the central claim is that MSSP removes the scale-dependent pathologies identified by DMFT, this validation step is load-bearing; without it the derived parameterization could be addressing an artifact of the infinite-width limit rather than the observed finite-scale behavior.
Authors: We agree that direct numerical validation of the DMFT predictions is important for supporting our central claims. In the revised manuscript we will add explicit comparisons of DMFT-predicted statistics—including router logit variance, expert activation fractions, and aggregation moments—against finite-N trajectories at the widths used in our experiments. These comparisons will demonstrate that the scale-dependent pathologies identified by DMFT are present in finite-scale training and are mitigated by MSSP. While deriving rigorous quantitative error bounds on the mean-field approximation for this setting is a substantial open theoretical question that lies beyond the scope of the present work, we will explicitly note this limitation and rely on the empirical convergence evidence to confirm that the parameterization addresses observed finite-scale behavior rather than an infinite-width artifact. revision: partial
-
Referee: [Experiments] Experimental section (verification of LR transfer and monotonic improvement): the reported runs should include explicit scaling curves for each regime separately, with the effective N, Ne, M, K values stated and a demonstration that performance continues to improve as these parameters approach the DMFT limit. The current aggregate claim that MSSP “robustly recovers” the desired properties across regimes cannot be assessed without these controls.
Authors: We appreciate this recommendation, which will improve the transparency and interpretability of the experimental results. In the revised manuscript we will include separate scaling curves for each of the three regimes (I, II, and III). For every regime we will explicitly state the effective values of N, Ne, M, and K employed and plot performance metrics as these quantities increase toward the DMFT scaling limit. These per-regime plots will demonstrate both robust learning-rate transfer and continued monotonic improvement under MSSP, allowing readers to evaluate the claims independently for each scaling regime. revision: yes
Circularity Check
No significant circularity; derivation chain is self-contained via DMFT analysis
full rationale
The paper develops novel DMFT descriptions for the three MoE scaling regimes (I-III) and derives both the μP parameterization and the refined MSSP from the limiting dynamics and the maximal scale stability desiderata. No load-bearing step reduces the final parameterization to a fitted quantity defined by the paper's own data, a self-citation chain, or a self-definitional loop. The pathologies in μP are identified analytically from the DMFT observables, and MSSP is constructed to satisfy the new desiderata within the same framework. Experimental verification is presented separately and does not feed back into the derivation. This is a standard theoretical derivation without circular elements.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Dynamical Mean Field Theory provides an accurate description of the limiting training dynamics of MoE models in the three scaling regimes
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We trace these pathologies to scale-dependent observables in the aggregation dynamics... derive a Maximally Scale-Stable Parameterization (MSSP)... DMFT description of the limiting training dynamics
-
IndisputableMonolith/Foundation/DimensionForcing.leandimension_forcing_from_8tick unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
three different scaling regimes: (I) co-scaling N≍Ne, (II) co-scaling N≍M≍K, (III) full proportional scaling
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2512.22768 , year=
Understanding the Mechanisms of Fast Hyperparameter Transfer , author=. arXiv preprint arXiv:2512.22768 , year=
- [2]
-
[3]
Generalization and Scaling Laws for Mixture-of-Experts Transformers , author=. 2026 , note=
work page 2026
-
[4]
arXiv preprint arXiv:2407.04153 , year=
Mixture of a million experts , author=. arXiv preprint arXiv:2407.04153 , year=
-
[5]
arXiv preprint arXiv:2402.07871 , year=
Scaling laws for fine-grained mixture of experts , author=. arXiv preprint arXiv:2402.07871 , year=
-
[6]
arXiv preprint arXiv:2505.06839 , year=
The power of fine-grained experts: Granularity boosts expressivity in Mixture of Experts , author=. arXiv preprint arXiv:2505.06839 , year=
-
[7]
arXiv preprint arXiv:2603.18168 , year=
Resnets of all shapes and sizes: Convergence of training dynamics in the large-scale limit , author=. arXiv preprint arXiv:2603.18168 , year=
-
[8]
Scalable training of mixture-of-experts models with megatron core.arXiv preprint arXiv:2603.07685,
Scalable Training of Mixture-of-Experts Models with Megatron Core , author=. arXiv preprint arXiv:2603.07685 , year=
-
[9]
arXiv preprint arXiv:2601.20205 , year=
Hyperparameter Transfer with Mixture-of-Expert Layers , author=. arXiv preprint arXiv:2601.20205 , year=
-
[10]
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models , author=. arXiv preprint arXiv:2401.06066 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
arXiv preprint arXiv:2508.09752 , year=
mu-Parametrization for Mixture of Experts , author=. arXiv preprint arXiv:2508.09752 , year=
-
[12]
Olmoe: Open mixture-of-experts language models
Olmoe: Open mixture-of-experts language models , author=. arXiv preprint arXiv:2409.02060 , year=
-
[13]
ST-MoE: Designing Stable and Transferable Sparse Expert Models
St-moe: Designing stable and transferable sparse expert models , author=. arXiv preprint arXiv:2202.08906 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Learning Factored Representations in a Deep Mixture of Experts
Learning factored representations in a deep mixture of experts , author=. arXiv preprint arXiv:1312.4314 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Journal of Machine Learning Research , volume=
Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity , author=. Journal of Machine Learning Research , volume=
-
[17]
arXiv preprint arXiv:2208.02813 , year=
Towards understanding mixture of experts in deep learning , author=. arXiv preprint arXiv:2208.02813 , year=
-
[18]
arXiv preprint arXiv:2503.07137 , year=
A comprehensive survey of mixture-of-experts: Algorithms, theory, and applications , author=. arXiv preprint arXiv:2503.07137 , year=
-
[19]
IEEE Transactions on Knowledge and Data Engineering , year=
A survey on mixture of experts in large language models , author=. IEEE Transactions on Knowledge and Data Engineering , year=
-
[20]
Adaptive mixtures of local experts , author=. Neural computation , volume=. 1991 , publisher=
work page 1991
-
[21]
P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =
work page 2000
-
[22]
T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980
work page 1980
-
[23]
M. J. Kearns , title =
-
[24]
Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983
work page 1983
-
[25]
R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000
work page 2000
-
[26]
Suppressed for Anonymity , author=
-
[27]
A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981
work page 1981
-
[28]
A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959
work page 1959
-
[29]
Alexander Atanasov and Alexandru Meterez and James B Simon and Cengiz Pehlevan , booktitle=. The Optimization Landscape of
-
[30]
Gradient Descent Provably Optimizes Over-parameterized Neural Networks
Gradient descent provably optimizes over-parameterized neural networks , author=. arXiv preprint arXiv:1810.02054 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
arXiv preprint arXiv:2206.10012 , year=
Limitations of the ntk for understanding generalization in deep learning , author=. arXiv preprint arXiv:2206.10012 , year=
-
[32]
SIAM Journal on Applied Mathematics , volume=
Mean field analysis of neural networks: A law of large numbers , author=. SIAM Journal on Applied Mathematics , volume=. 2020 , publisher=
work page 2020
-
[33]
Communications on Pure and Applied Mathematics , volume=
Trainability and accuracy of artificial neural networks: An interacting particle system approach , author=. Communications on Pure and Applied Mathematics , volume=. 2022 , publisher=
work page 2022
-
[34]
International Conference on Learning Representations (ICLR) , year=
On large-batch training for deep learning: Generalization gap and sharp minima , author=. International Conference on Learning Representations (ICLR) , year=
-
[35]
Advances in neural information processing systems (NeurIPS) , volume=
Towards explaining the regularization effect of initial large learning rate in training neural networks , author=. Advances in neural information processing systems (NeurIPS) , volume=
-
[36]
International Conference on Machine Learning (ICML) , pages=
Sgd with large step sizes learns sparse features , author=. International Conference on Machine Learning (ICML) , pages=. 2023 , organization=
work page 2023
-
[37]
Advances in Neural Information Processing Systems (NeurIPS) , volume=
Large stepsize gradient descent for non-homogeneous two-layer networks: Margin improvement and fast optimization , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=
-
[38]
Advances in Neural Information Processing Systems (NeurIPS) , volume=
Scaling mlps: A tale of inductive bias , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=
-
[39]
International Conference on Learning Representations (ICLR) , year=
How feature learning can improve neural scaling laws , author=. International Conference on Learning Representations (ICLR) , year=
-
[40]
Transactions on Machine Learning Research (TMLR) , issn=
Temperature check: theory and practice for training models with softmax-cross-entropy losses , author=. Transactions on Machine Learning Research (TMLR) , issn=. 2023 , url=
work page 2023
-
[41]
Is your batch size the problem? Revisiting the Adam-SGD gap in language modeling , author=. arXiv:2506.12543 , year=
-
[42]
Advances in Neural Information Processing Systems (NeurIPS) , volume=
Finite versus infinite neural networks: an empirical study , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=
-
[43]
Advances in Neural Information Processing Systems (NeurIPS) , volume=
Why warmup the learning rate? underlying mechanisms and improvements , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=
-
[44]
2 OLMo 2 Furious , author=. arXiv:2501.00656 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[45]
Dynamically Stable Infinite-Width Limits of Neural Classifiers , author=. arXiv:2006.06574 , year=
-
[46]
On the infinite width limit of neural networks with a standard parameterization , author=. arXiv:2001.07301 , year=
-
[47]
IEEE Signal Processing Magazine , volume=
The mnist database of handwritten digit images for machine learning research , author=. IEEE Signal Processing Magazine , volume=. 2012 , publisher=
work page 2012
- [48]
-
[49]
Advances in neural information processing systems , volume=
On the global convergence of gradient descent for over-parameterized models using optimal transport , author=. Advances in neural information processing systems , volume=
-
[50]
Don't be lazy: CompleteP enables compute-efficient deep transformers , author=. arXiv:2505.01618 , year=
-
[51]
Benign Overfitting and the Geometry of the Ridge Regression Solution in Binary Classification , author=. arXiv:2503.07966 , year=
-
[52]
Minimax Optimal Convergence of Gradient Descent in Logistic Regression via Large and Adaptive Stepsizes , author=. arXiv:2504.04105 , year=
-
[53]
Emergence and scaling laws in SGD learning of shallow neural networks , author=. arXiv:2504.19983 , year=
-
[54]
Advances in Neural Information Processing Systems (NeurIPS) , volume=
An empirical analysis of compute-optimal large language model training , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=
-
[55]
The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS) , year=
4+3 Phases of Compute-Optimal Neural Scaling Laws , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS) , year=
-
[56]
Modular duality in deep learning , author=. arXiv:2410.21265 , year=
-
[57]
International Conference on Learning Representations (ICLR) , year=
Divergence of Empirical Neural Tangent Kernel in Classification Problems , author=. International Conference on Learning Representations (ICLR) , year=
-
[58]
Global Convergence and Rich Feature Learning in L -Layer Infinite-Width Neural Networks under mu-Parametrization , author=. arXiv:2503.09565 , year=
-
[59]
Forty-first International Conference on Machine Learning (ICML) , year=
Rotational Equilibrium: How Weight Decay Balances Learning Across Neural Networks , author=. Forty-first International Conference on Machine Learning (ICML) , year=
-
[60]
The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS) , year=
Why Do We Need Weight Decay in Modern Deep Learning? , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS) , year=
-
[61]
The llama 3 herd of models , author=. arXiv:2407.21783 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[62]
International Conference on Learning Representations (ICLR) , year=
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. International Conference on Learning Representations (ICLR) , year=
-
[63]
Peter J. Liu and Roman Novak and Jaehoon Lee and Mitchell Wortsman and Lechao Xiao and Katie Everett and Alexander A. Alemi and Mark Kurzeja and Pierre Marcenac and Izzeddin Gur and Simon Kornblith and Kelvin Xu and Gamaleldin Elsayed and Ian Fischer and Jeffrey Pennington and Ben Adlam and Jascha Sohl-Dickstein , title =. GitHub repository , volume =. 20...
work page 2024
-
[64]
DataComp-LM: In search of the next generation of training sets for language models , author=. 2024 , journal=
work page 2024
-
[65]
The Thirty Sixth Annual Conference on Learning Theory (COLT) , pages=
Sgd learning on neural networks: leap complexity and saddle-to-saddle dynamics , author=. The Thirty Sixth Annual Conference on Learning Theory (COLT) , pages=. 2023 , organization=
work page 2023
-
[66]
Deep Linear Network Training Dynamics from Random Initialization: Data, Width, Depth, and Hyperparameter Transfer , author=. arXiv:2502.02531 , year=
-
[67]
The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS) , year=
Super Consistency of Neural Network Landscapes and Learning Rate Transfer , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS) , year=
-
[68]
The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS) , year=
Stable minima cannot overfit in univariate ReLU networks: Generalization by large step sizes , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS) , year=
-
[69]
An Empirical Model of Large-Batch Training
An empirical model of large-batch training , author=. arXiv:1812.06162 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[70]
Measuring the Effects of Data Parallelism on Neural Network Training
Measuring the Effects of Data Parallelism on Neural Network Training. , author=. arXiv:1811.03600 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[71]
Advances in Neural Information Processing Systems (NeurIPS) , volume=
Language models are few-shot learners , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=
-
[72]
The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS) , year=
Why Warmup the Learning Rate? Underlying Mechanisms and Improvements , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS) , year=
-
[73]
International Conference on Learning Representations (ICLR) , year=
On the Variance of the Adaptive Learning Rate and Beyond , author=. International Conference on Learning Representations (ICLR) , year=
-
[74]
International Conference on Machine Learning (ICML) , pages=
On layer normalization in the transformer architecture , author=. International Conference on Machine Learning (ICML) , pages=. 2020 , organization=
work page 2020
-
[75]
A Closer Look at Deep Learning Heuristics: Learning rate restarts, Warmup and Distillation
A closer look at deep learning heuristics: Learning rate restarts, warmup and distillation , author=. arXiv:1810.13243 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[76]
On Feature Learning in Structured State Space Models , url =
Leena Chennuru Vankadara and Jin Xu and Moritz Haas and Volkan Cevher , booktitle =. On Feature Learning in Structured State Space Models , url =
-
[77]
Moritz Haas and Jin Xu and Volkan Cevher and Leena Chennuru Vankadara , booktitle =
-
[78]
The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS) , year=
Understanding and Minimising Outlier Features in Transformer Training , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS) , year=
-
[79]
Scaling exponents across parameterizations and optimizers , author=. arXiv:2407.05872 , year=
-
[80]
The Thirteenth International Conference on Learning Representations (ICLR) , year=
u- P: The Unit-Scaled Maximal Update Parametrization , author=. The Thirteenth International Conference on Learning Representations (ICLR) , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.