pith. sign in

arxiv: 2607.01487 · v1 · pith:WVPZ3RE4new · submitted 2026-07-01 · 💻 cs.LG · stat.ML

How to Allocate Your Tokens? Scaling Laws with Training Steps and Batch Size

Pith reviewed 2026-07-03 21:00 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords scaling lawsbatch sizetraining stepsmodel sizetoken allocationcritical batch sizeneural network training
0
0 comments X

The pith

A three-term scaling law separates training steps from batch size to recover optimal allocation scaling from suboptimal runs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a scaling law for neural network training loss that treats model size, number of training steps, and batch size as three separate inputs rather than combining steps and batch size into total tokens. This law is fitted to a collection of runs that deliberately includes many suboptimal batch sizes. When fitted this way, the law recovers the observed scaling relationship between model size and the best batch size. Because it learns from non-optimal runs, the law can be estimated reliably with far fewer total experiments than methods that need only optimal-batch data. The same fitted form also yields explicit scaling predictions for cases where batch size is held away from its optimum and matches earlier empirical patterns for the critical batch size.

Core claim

We propose a scaling law that takes into account model size and training data while explicitly splitting the latter into training steps and batch size (called three-term law). Fitting the proposed law on a large set of training runs, we find that it correctly recovers the scaling of the optimal batch size. Moreover, because it makes use of training runs with suboptimal batch size, our proposed law can be robustly fit with a significantly smaller amount of training runs. We further show that the three-term law can be used to derive scaling laws for suboptimal batch sizes, and that it matches previous empirical findings related to the critical batch size.

What carries the argument

The three-term scaling law, which expresses loss as a joint function of model size, training steps, and batch size treated as independent variables.

If this is right

  • The law recovers the scaling of optimal batch size with model size and data volume.
  • Robust fitting is possible with significantly fewer training runs by including suboptimal batch sizes.
  • Scaling laws for any fixed suboptimal batch size can be derived from the three-term form.
  • The law reproduces prior empirical observations on critical batch size.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Token budgets could be allocated more efficiently by first fitting the law on cheap suboptimal runs and then predicting the best batch size.
  • The separation of steps and batch size may let the same functional form guide choices for other training hyperparameters such as learning-rate schedules.
  • Testing the law on architectures or data modalities outside the original runs would show whether the three-term structure is broadly applicable.

Load-bearing premise

A single three-term functional form fitted to runs with varying batch sizes will correctly extrapolate to the optimal batch size regime without requiring separate data from optimal runs.

What would settle it

Dedicated experiments run at the batch sizes predicted to be optimal by the fitted law show loss values or scaling exponents that differ substantially from the law's forecasts.

Figures

Figures reproduced from arXiv: 2607.01487 by Fabian Schaipp.

Figure 1
Figure 1. Figure 1: (Left) Estimates for M⋆ -scaling coefficient γ β+γ for 3TL and each 2TL. Shaded area depicts min and max over five cross-validation fits. (Right) Implied scaling of M⋆ according to (4). Shaded area depicts min and max over cross-validation. Dots show the empirically best batch size from the train (black) and validation split (blue). Consistency across datasets. We run the same analysis on the OpenEuroLLM d… view at source ↗
Figure 2
Figure 2. Figure 2: MAD comparison of 2TL and 3TL on train (left) and validation (right) split. 4.2 Compute Savings Using the Three-term Law Fitting a scaling law for M⋆ with the approach of Li et al. (2025a) imposes massive computational costs, as it requires to obtain the optimal batch size for a set of different token budgets D (and possibly also varying the model size N). Li et al. (2025a) report that producing their enti… view at source ↗
Figure 3
Figure 3. Figure 3: Fitting on a reduced dataset, with only 3 values of [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: N = 268M. (Left) While the three-term law (3) accurately predicts optimal batch size, its predicted loss value for very large/small token budgets deviates from the empirical value. (Right) Empirical and predicted loss value across batch size b. Again, for very large/small token budgets the accuracy of the three-term law degrades. Dashed border marks datapoints not used for fitting 3TL. and batch size.3 In … view at source ↗
Figure 5
Figure 5. Figure 5: N = 268M. (Left) Batch size range [bmin, bmax] with ε-suboptimal loss derived from law (8) (with ε such that less than 5% compute is wasted). Shaded area is obtained from fitting a power-law on the values of bmin / max in-sample (solid lines). (Right) Em￾pirical and predicted loss value across batch size b. Here, the predicted values are from the law (8), fitted separately for each D. Black dotted lines ma… view at source ↗
Figure 6
Figure 6. Figure 6: Scaling of ε-suboptimal batch size across model sizes, for Li dataset (left) and OpenEuroLLM dataset (right). The scaling of suboptimal batch sizes [bmin, bmax] (grey area) is relatively consistent across the two datasets, after accounting for a factor of two due to the different sequence length. Takeaway: Under the three-term law, the number of steps KL¯(b) to reach a target loss L¯, as a function of the … view at source ↗
Figure 7
Figure 7. Figure 7: Under the three-term law, critical batch size changes with token budget [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Compute-optimal model size. to other training setups or tasks. Further, although more sophisticated scaling law formulations can in principle collapse back to the Chinchilla form, the resulting scaling can be quite different (Section 4.5). • While the three-term law explicitly models the batch size, we still need the optimal learning rate for each single combination of (N, D, b); thus, despite our finding … view at source ↗
Figure 9
Figure 9. Figure 9: Same as [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Same as [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Same as [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Same as [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Overview of the Li dataset used for fitting scaling laws. Dots with dashed border are part of the validation set. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Illustration of the reduced dataset used in [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Overview of the full Li dataset (before learning-rate selection). Each heatmap represents the final loss over a grid of batch size b (y-axis) and learning rate η (x-axis) for a single combination of (N, D). Blue squares mark the optimal combination of (η, b), gray squares mark optimal learning rate for the given row of batch size. Note that most marked squares do not lie on the border, therefore indicatin… view at source ↗
Figure 16
Figure 16. Figure 16: Same as [PITH_FULL_IMAGE:figures/full_fig_p026_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Same as [PITH_FULL_IMAGE:figures/full_fig_p027_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Same as [PITH_FULL_IMAGE:figures/full_fig_p027_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Same as [PITH_FULL_IMAGE:figures/full_fig_p027_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Same as [PITH_FULL_IMAGE:figures/full_fig_p028_20.png] view at source ↗
read the original abstract

We propose a scaling law that takes into account model size and training data while explicitly splitting the latter into training steps and batch size (called three-term law). Fitting the proposed law on a large set of training runs, we find that it correctly recovers the scaling of the optimal batch size. Moreover, because it makes use of training runs with suboptimal batch size, our proposed law can be robustly fit with a significantly smaller amount of training runs. We further show that the three-term law can be used to derive scaling laws for suboptimal batch sizes, and that it matches previous empirical findings related to the critical batch size.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a three-term scaling law that decomposes the training data term into separate contributions from training steps and batch size (in addition to model size). Fitting this functional form to a collection of runs that include suboptimal batch sizes is claimed to recover the scaling of the optimal batch size, to enable robust fitting with substantially fewer runs than would otherwise be required, to yield derived scaling laws for suboptimal regimes, and to match prior empirical results on critical batch size.

Significance. If the extrapolation from suboptimal to optimal regimes holds without regime-dependent misspecification, the approach would reduce the experimental cost of mapping optimal token allocation by allowing existing suboptimal runs to contribute to the fit, thereby extending the practical reach of scaling-law methodology.

major comments (2)
  1. [Abstract] Abstract: the central claim that the fitted law 'correctly recovers the scaling of the optimal batch size' is presented without any description of the fitting procedure, error bars, data-exclusion criteria, or confirmation that the functional form was not selected after inspection of the same data; these omissions make it impossible to evaluate whether the reported recovery is an independent prediction or a tautological consequence of the fit.
  2. [Experiments / fitting results] The manuscript provides no held-out validation or regime-specific stress test demonstrating that the three-term functional form remains an accurate local approximation when batch size approaches the critical regime; without such a test, the extrapolation from the mixture of suboptimal runs to the optimal-batch-size scaling cannot be taken as established.
minor comments (2)
  1. [Method] The exact algebraic expression for the three-term law (how the steps and batch-size terms are combined with the model-size term) should be stated explicitly, preferably as an equation in the main text.
  2. [Figures] Figure captions and axis labels should indicate whether plotted points are individual runs or aggregated statistics and whether error bars represent run-to-run variance or fit uncertainty.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The three-term scaling law is motivated by a decomposition of the data term that is independent of any particular dataset, and we address the concerns about presentation and validation below by clarifying the fitting details and committing to additional checks.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the fitted law 'correctly recovers the scaling of the optimal batch size' is presented without any description of the fitting procedure, error bars, data-exclusion criteria, or confirmation that the functional form was not selected after inspection of the same data; these omissions make it impossible to evaluate whether the reported recovery is an independent prediction or a tautological consequence of the fit.

    Authors: We agree that the abstract is too terse on these points. The functional form follows directly from splitting the standard Chinchilla-style data term into separate step and batch-size contributions, a decomposition already implicit in prior critical-batch-size literature; it was not chosen by inspecting the current runs. In revision we will expand the abstract by one sentence to note that the fit uses ordinary least-squares on log-loss with the full set of runs (including suboptimal batch sizes), with error bars obtained via bootstrap, and will add a short methods paragraph summarizing the exact procedure, exclusion criteria (runs that failed to converge), and pre-specification of the form. The main text already contains the full fitting details and will be cross-referenced. revision: yes

  2. Referee: [Experiments / fitting results] The manuscript provides no held-out validation or regime-specific stress test demonstrating that the three-term functional form remains an accurate local approximation when batch size approaches the critical regime; without such a test, the extrapolation from the mixture of suboptimal runs to the optimal-batch-size scaling cannot be taken as established.

    Authors: This is a fair criticism. While the current experiments already span a wide range of batch sizes (including some near the critical regime), we did not perform an explicit held-out stress test restricted to near-optimal batch sizes. In the revised manuscript we will add such a validation: we will reserve a subset of runs whose batch sizes are within a factor of two of the fitted critical batch size, refit the three-term law on the remaining data, and report the extrapolation error on the held-out near-optimal points. We will also include a regime-specific plot of residuals versus distance to the critical batch size. If the added test reveals systematic misspecification we will qualify the claims accordingly. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the three-term scaling law derivation

full rationale

The paper proposes an empirical three-term functional form incorporating model size, steps, and batch size, fits its parameters to a collection of training runs (explicitly including suboptimal batch sizes), and then uses the resulting fit to recover the scaling of optimal batch size, which is reported to match prior independent empirical results on critical batch size. No quoted equation or section reduces the optimal-batch-size prediction to a quantity already fixed by the fitted parameters themselves, nor does any step rely on a self-citation chain or imported uniqueness theorem that would make the central claim tautological. The extrapolation claim is therefore an empirical assertion subject to external validation rather than a self-contained re-expression of the input data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are described. The law is presented as an empirical fit, implying coefficients are determined from data.

pith-pipeline@v0.9.1-grok · 5620 in / 1117 out tokens · 31339 ms · 2026-07-03T21:00:35.006294+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 14 canonical work pages · 3 internal anchors

  1. [1]

    First-order methods in optimization , volume =

    Beck, Amir , mrclass =. First-order methods in optimization , volume =. 2017 , z_doi =

  2. [2]

    Power Lines: Scaling laws for weight decay and batch size in

    Bergsma, Shane and Dey, Nolan and Gosal, Gurpreet and Gray, Gavia and Soboleva, Daria and Hestness, Joel , booktitle =. Power Lines: Scaling laws for weight decay and batch size in. 2025 , z_editor =

  3. [3]

    arXiv , author =:2404.10102 , file =

    Chinchilla Scaling: A replication attempt , year =. arXiv , author =:2404.10102 , file =

  4. [4]

    arXiv , author =:2409.19913 , file =

    Scaling Optimal LR Across Token Horizons , year =. arXiv , author =:2409.19913 , file =

  5. [5]

    arXiv , author =:2405.13063 , file =

    A Foundation Model for the Earth System , year =. arXiv , author =:2405.13063 , file =

  6. [6]

    Stochastic model-based minimization of weakly convex functions , volume =

    Davis, Damek and Drusvyatskiy, Dmitriy , fjournal =. Stochastic model-based minimization of weakly convex functions , volume =. SIAM Journal on Optimization , mrclass =. 2019 , z_doi =

  7. [7]

    DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

    Deep. 2024 , z_doi =. arXiv , author =:2401.02954 , file =

  8. [8]

    arXiv , author =:2410.05838 , file =

    Time Transfer: On Optimal Learning Rate and Batch Size In The Infinite Data Limit , year =. arXiv , author =:2410.05838 , file =

  9. [9]

    Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations , volume =

    H\". Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations , volume =. Advances in Neural Information Processing Systems , pages =. 2024 , z_editor =

  10. [10]

    An empirical analysis of compute-optimal large language model training , volume =

    Hoffmann, Jordan and Borgeaud, Sebastian and Mensch, Arthur and Buchatskaya, Elena and Cai, Trevor and Rutherford, Eliza and de Las Casas, Diego and Hendricks, Lisa Anne and Welbl, Johannes and Clark, Aidan and Hennigan, Thomas and Noland, Eric and Millican, Katherine and van den Driessche, George and Damoc, Bogdan and Guy, Aurelia and Osindero, Simon and...

  11. [11]

    arXiv , author =:2603.21191 , file =

    On the Role of Batch Size in Stochastic Conditional Gradient Methods , year =. arXiv , author =:2603.21191 , file =

  12. [12]

    Scaling Laws for Neural Language Models

    Scaling Laws for Neural Language Models , year =. arXiv , author =:2001.08361 , file =

  13. [13]

    arXiv , author =:2503.12645 , file =

    Understanding Gradient Orthogonalization for Deep Learning via Non-Euclidean Trust-Region Optimization , year =. arXiv , author =:2503.12645 , file =

  14. [14]

    arXiv , author =:2503.04715 , file =

    Predictable Scale: Part I -- Optimal Hyperparameter Scaling Law in Large Language Model Pretraining , year =. arXiv , author =:2503.04715 , file =

  15. [15]

    Margaret Li and Sneha Kudugunta and Luke Zettlemoyer , bibsource =. (. International Conference on Learning Representations , timestamp =. 2025 , z_publisher =

  16. [16]

    Evolutionary-scale prediction of atomic-level protein structure with a language model , volume =

    Zeming Lin and Halil Akin and Roshan Rao and Brian Hie and Zhongkai Zhu and Wenting Lu and Nikita Smetanin and Robert Verkuil and Ori Kabeli and Yaniv Shmueli and Allan dos Santos Costa and Maryam Fazel-Zarandi and Tom Sercu and Salvatore Candido and Alexander Rives , eprint =. Evolutionary-scale prediction of atomic-level protein structure with a languag...

  17. [17]

    An Empirical Model of Large-Batch Training

    An Empirical Model of Large-Batch Training , year =. arXiv , author =:1812.06162 , file =

  18. [18]

    Resolving Discrepancies in Compute-Optimal Scaling of Language Models , volume =

    Porian, Tomer and Wortsman, Mitchell and Jitsev, Jenia and Schmidt, Ludwig and Carmon, Yair , booktitle =. Resolving Discrepancies in Compute-Optimal Scaling of Language Models , volume =. 2024 , z_editor =

  19. [19]

    International Conference on Learning Representations , title =

    Dimitri von R. International Conference on Learning Representations , title =. 2026 , z_url =

  20. [20]

    Topics in Stochastic Optimization: Learning with Implicit and Adaptive Steps , year =

    Schaipp, Fabian , groups =. Topics in Stochastic Optimization: Learning with Implicit and Adaptive Steps , year =

  21. [21]

    The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training , volume =

    Schaipp, Fabian and H\". The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training , volume =. International Conference on Machine Learning , pages =. 2025 , z_editor =

  22. [22]

    arXiv , author =:2408.13359 , file =

    Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler , year =. arXiv , author =:2408.13359 , file =

  23. [23]

    arXiv , author =:2603.15958 , file =

    Deriving Hyperparameter Scaling Laws via Modern Optimization Theory , year =. arXiv , author =:2603.15958 , file =

  24. [24]

    How to set

    Wang, Xi and Aitchison, Laurence , booktitle =. How to set. 2025 , z_editor =

  25. [25]

    Scaling Vision Transformers , year =

    Zhai, Xiaohua and Kolesnikov, Alexander and Houlsby, Neil and Beyer, Lucas , booktitle =. Scaling Vision Transformers , year =

  26. [26]

    Foster and Sham M

    Hanlin Zhang and Depen Morwani and Nikhil Vyas and Jingfeng Wu and Difan Zou and Udaya Ghai and Dean P. Foster and Sham M. Kakade , bibsource =. How Does Critical Batch Size Scale in Pre-training? , year =. International Conference on Learning Representations , timestamp =

  27. [27]

    arXiv , author =:2602.10300 , file =

    Configuration-to-Performance Scaling Law with Neural Ansatz , year =. arXiv , author =:2602.10300 , file =

  28. [28]

    Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model , z_url =

    Zhang, Guodong and Li, Lala and Nado, Zachary and Martens, James and Sachdeva, Sushant and Dahl, George and Shallue, Chris and Grosse, Roger B , booktitle =. Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model , z_url =

  29. [29]

    Measuring the Effects of Data Parallelism on Neural Network Training , author=. J. Mach. Learn. Res. , year=

  30. [30]

    2025 , z_month = may, archiveprefix =

    Practical Efficiency of. 2025 , z_month = may, archiveprefix =. 2505.02222 , keywords =

  31. [31]

    Muon: An optimizer for hidden layers in neural networks , year =

    Keller Jordan and Yuchen Jin and Vlado Boza and You Jiacheng and Franz Cesista and Laker Newhouse and Jeremy Bernstein , note =. Muon: An optimizer for hidden layers in neural networks , year =

  32. [32]

    Decoupled Weight Decay Regularization , year =

    Ilya Loshchilov and Frank Hutter , booktitle =. Decoupled Weight Decay Regularization , year =