Proposes a three-term scaling law for model size, training steps and batch size that recovers optimal batch size scaling and can be fitted using fewer runs by incorporating suboptimal batch sizes.
Title resolution pending
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
How to Allocate Your Tokens? Scaling Laws with Training Steps and Batch Size
Proposes a three-term scaling law for model size, training steps and batch size that recovers optimal batch size scaling and can be fitted using fewer runs by incorporating suboptimal batch sizes.