Proposes a three-term scaling law for model size, training steps and batch size that recovers optimal batch size scaling and can be fitted using fewer runs by incorporating suboptimal batch sizes.
Muon: An optimizer for hidden layers in neural networks , year =
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
How to Allocate Your Tokens? Scaling Laws with Training Steps and Batch Size
Proposes a three-term scaling law for model size, training steps and batch size that recovers optimal batch size scaling and can be fitted using fewer runs by incorporating suboptimal batch sizes.