pith. machine review for the scientific record. sign in

arxiv: 2605.09154 · v1 · submitted 2026-05-09 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Predicting Large Model Test Losses with a Noisy Quadratic System

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:35 UTC · model grok-4.3

classification 💻 cs.LG
keywords loss predictionscaling lawsbatch sizelarge language modelscompute efficiencynoisy quadraticpre-training optimization
0
0 comments X

The pith

A noisy quadratic system predicts the test loss of large models from their size, batch size, and number of weight updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a predictive model for the pre-training loss of large models using only the model size, batch size, and number of weight updates. This is the first such model that accounts for variations in batch size, and it projects losses more accurately than previous approaches when scaling to much larger compute budgets. Researchers can use it to select the best combination of model size, batch size, and steps that fit within given time, memory, or compute limits. Experiments show the chosen configurations perform close to the true optimum. The work argues that fitting loss curves directly is preferable to building ever-more-complex heuristic scaling laws.

Core claim

We show that the test loss follows a noisy quadratic system in the variables of model size N, batch size B, and number of updates K. Fitting this system to data from smaller-scale runs allows reliable prediction of loss at larger scales, even when batch size is not held constant, and outperforms the Chinchilla loss model for extrapolations up to 1000 times larger compute.

What carries the argument

Noisy quadratic system: a mathematical model that describes loss as a quadratic function of N, B, and K with added noise, used to fit and extrapolate training dynamics.

If this is right

  • The model can identify optimal N, B, K triples under constraints on time, memory, and compute.
  • Selected configurations achieve losses close to the ground-truth best performance.
  • It enables better planning for large training runs by forecasting losses without running them.
  • Loss prediction is positioned as a scalable alternative to complex heuristic laws.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the model holds, it could allow testing many batch size schedules in simulation before committing to full training.
  • Extensions might include incorporating other variables like learning rate schedules or data quality into the quadratic form.
  • Similar systems could be applied to predict not just loss but other metrics like downstream accuracy if correlations are strong.

Load-bearing premise

The fitted noisy quadratic system from small experiments remains valid when batch size is varied and when predicting at scales up to 1000 times larger in compute.

What would settle it

Training a model at a large extrapolated scale using the model's recommended N, B, K and measuring a test loss that differs substantially from the prediction.

Figures

Figures reproduced from arXiv: 2605.09154 by Chris J. Maddison, Chuning Li.

Figure 1
Figure 1. Figure 1: The Noisy Quadratic System (NQS) predicted LLM test losses on holdout data, outperforming Chinchilla’s loss model. Shown are “IsoFLOP” slices: data points arranged by their pre￾training compute budget (selectively labeled next to the curves, in PetaFLOPs). Top: NQS successfully predicted the loss of LLMs over variations of model size N, holding constant the total FLOPs budget, at compute budgets up to 1000… view at source ↗
Figure 2
Figure 2. Figure 2: Chinchilla Method 3 is not a great predictive model. (a) Chinchilla Method 3 fitted on the entire Hoffmann IsoFLOPs dataset describes the data well. (b) Once the dataset is divided into Train/Holdout, the performance on the holdout slices deteriorated. For this figure, we used the original Hoffmann series of LLM data points upon which the Chinchilla method was developed (Besiroglu et al., 2024). First, we … view at source ↗
Figure 3
Figure 3. Figure 3: Accounting for the effect of LayerNorm by setting the learning rate γ ∝ 1 ||w||2 helps NQS fit large models trained with small batch sizes. but adjusting for the effect of LayerNorm is necessary for NQS to perform well on smaller batch sizes. From this point on, to experiment with the NQS, it is suffi￾cient to rely on the simplified expressions of (5) and (6). 4.3. Inferring NQS parameters The procedure to… view at source ↗
Figure 4
Figure 4. Figure 4: NQS is robust in extrapolation. The x-axis represents gradually removing data points from the training set, starting from those training points with the highest compute budgets, so as to increase the compute gap between the training set and the holdout set. (a) In our experiments, Chinchilla can successfully extrapolate 20 times into higher compute budget. (b) NQS succeeded until the largest training run i… view at source ↗
Figure 5
Figure 5. Figure 5: The predictions of NQS can be used to select (N, B, K) under compound resource constraints. Each subplot: an IsoFLOP plane (C = 236 PF) with coordinates (x, y) representing N = x, B = y, K = 2.6 × 1014/(xy). The NQS model is trained on data with FLOP budget up to C = 147 PF. The red diamond is the default configuration, Chinchilla compute optimal model size trained at the Critical Batch Size (McCandlish et… view at source ↗
Figure 6
Figure 6. Figure 6: Chinchilla’s extrapolation performance on the original Hoffman dataset. The Hoffman dataset spans a smaller range than our Pythia + OWT2 dataset, and was less suitable for analysis of extrapolation performance. Nevertheless, the emerging trend is consistent with results on our LLM dataset. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: NQS performance on Llama + LM1B. The figure is based on a series of Llama-like models of sizes up to 0.5B and pre-training compute budgets between 5 PetaFLOPs and 1,000 PetaFLOPs. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The NQS is robust to changes in the training set. Shown is a 90% confidence interval around the NQS predictions, constructed using 100 trials, each using a random subset (a 50% subsample) of the training runs to infer the NQS parameters [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The NQS is robust to changes in the training set: the 90% confidence intervals are tight relative to the range of the predictions. The confidence intervals are constructed using 100 trials, each using a random subset (a 50% subsample) of the training runs to infer the NQS parameters. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
read the original abstract

We introduce a predictive model that estimates the pre-training loss of large models from model size (N), batch size (B) and number of weight updates (K). This is the first loss prediction model that can handle changing batch size. The model outperforms Chinchilla's loss model, a model of the test loss using the batch size and number of tokens, in terms of projecting the loss at extrapolated compute budgets (up to 1000 folds). A natural use of the model is to find optimal N, B, K configurations under explicit and compound resource constraints like time, memory and compute. In our experiments, the model-selected configurations are close to ground-truth optimal. Our work advocates for loss prediction as a better alternative to heuristic-based laws, which are growing in complexity. The implementation is available on https://github.com/chuningxdy/Noisy-Quadratic-System.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper introduces a noisy quadratic model for predicting pre-training test loss as a function of model size N, batch size B, and number of weight updates K. It claims to be the first loss prediction approach that explicitly handles varying batch sizes, outperforms Chinchilla's token-based scaling law when extrapolating to compute budgets up to 1000x larger, and enables selection of near-optimal N/B/K configurations under explicit resource constraints such as time, memory, and compute. Experiments reportedly show the model-selected configurations are close to ground-truth optima, with open-source code provided.

Significance. If the extrapolation and optimality claims hold under proper validation, the work would offer a more flexible, batch-size-aware alternative to existing heuristic scaling laws, potentially improving practical hyperparameter selection for large-model training. The open-source implementation and focus on compound constraints are positive contributions that could facilitate follow-up work on loss prediction as a tool for efficient scaling.

major comments (3)
  1. [§4.2 and §5.1] §4.2 and §5.1: The extrapolation experiments to 1000x compute budgets report outperformance over Chinchilla but provide no held-out validation results at intermediate scales (e.g., 10x or 100x) using the same fitted quadratic coefficients; without this, it is impossible to confirm that the quadratic form in (N, B, K) remains accurate outside the fitting regime rather than overfitting small-scale noise.
  2. [§3.1, Eq. (3)–(5)] §3.1, Eq. (3)–(5): The noisy quadratic system is fitted to loss trajectories from smaller runs, yet the manuscript does not specify the train/validation split used for coefficient estimation versus the extrapolation test sets; this leaves open the possibility that reported gains are partly driven by in-sample fitting rather than genuine out-of-distribution prediction.
  3. [Table 3 and §5.3] Table 3 and §5.3: The near-optimal configuration results lack error bars, multiple random seeds, or statistical tests comparing model-selected N/B/K against ground-truth optima; the claim that selections are “close to ground-truth optimal” therefore cannot be assessed for robustness across runs.
minor comments (3)
  1. [§3] The notation for the noise term and the precise definition of the quadratic coefficients (e.g., whether they are constant or depend on B) is introduced without a clear summary table; adding one would improve readability.
  2. [Figure 2] Figure 2 caption does not state the exact compute budgets or number of runs used for the extrapolation curves, making direct comparison with Chinchilla difficult.
  3. [Abstract and §5.1] The abstract states “up to 1000 folds” extrapolation but the main text does not list the precise maximum scale factor achieved in each experiment; this minor inconsistency should be aligned.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful comments, which help improve the clarity and rigor of our work. We address each major comment below.

read point-by-point responses
  1. Referee: [§4.2 and §5.1] §4.2 and §5.1: The extrapolation experiments to 1000x compute budgets report outperformance over Chinchilla but provide no held-out validation results at intermediate scales (e.g., 10x or 100x) using the same fitted quadratic coefficients; without this, it is impossible to confirm that the quadratic form in (N, B, K) remains accurate outside the fitting regime rather than overfitting small-scale noise.

    Authors: We agree that intermediate-scale validation would provide stronger evidence for the generalizability of the noisy quadratic model. In the original experiments, the model was fitted on small-scale runs and directly tested on much larger scales to demonstrate extrapolation. To address this concern, we will include additional held-out predictions at 10x and 100x compute budgets using the same fitted coefficients in the revised manuscript, along with comparisons to ground-truth losses where available. revision: yes

  2. Referee: [§3.1, Eq. (3)–(5)] §3.1, Eq. (3)–(5): The noisy quadratic system is fitted to loss trajectories from smaller runs, yet the manuscript does not specify the train/validation split used for coefficient estimation versus the extrapolation test sets; this leaves open the possibility that reported gains are partly driven by in-sample fitting rather than genuine out-of-distribution prediction.

    Authors: We apologize for the omission. The fitting procedure uses a specific set of small-scale training runs for estimating the quadratic coefficients, with separate larger runs reserved for extrapolation testing. We will clarify this split explicitly in Section 3.1 of the revised manuscript, including details on which runs were used for fitting versus evaluation to ensure transparency. revision: yes

  3. Referee: [Table 3 and §5.3] Table 3 and §5.3: The near-optimal configuration results lack error bars, multiple random seeds, or statistical tests comparing model-selected N/B/K against ground-truth optima; the claim that selections are “close to ground-truth optimal” therefore cannot be assessed for robustness across runs.

    Authors: We acknowledge that reporting variability across seeds would strengthen the results. The current experiments used single runs for each configuration due to computational constraints, but we will add error bars from multiple random seeds in the revised version of Table 3 and Section 5.3, along with a brief statistical comparison where feasible. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the noisy quadratic loss prediction model

full rationale

The paper defines a noisy quadratic system as a functional form for test loss in terms of N, B, and K, fits its parameters on smaller-scale experimental runs, and uses the resulting model to project losses at larger extrapolated compute budgets with variable batch sizes. This is a standard empirical scaling-law construction with independent predictive content; the extrapolation step does not reduce to the fitting inputs by definition or by renaming a fitted quantity as a prediction. No self-citations are load-bearing, no uniqueness theorems are invoked, and the outperformance claim versus Chinchilla is presented as a direct empirical comparison on held-out large-scale regimes. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that loss dynamics follow a noisy quadratic form in N, B, K space whose parameters can be fitted once and then used for extrapolation; no independent derivation or external benchmarks are mentioned in the abstract.

free parameters (1)
  • quadratic coefficients and noise parameters
    The noisy quadratic system requires several coefficients that must be fitted to observed loss curves.

pith-pipeline@v0.9.0 · 5444 in / 1192 out tokens · 25169 ms · 2026-05-12T04:35:41.291434+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 6 internal anchors

  1. [1]

    The American Mathematical Monthly , volume=

    An elementary view of Euler's summation formula , author=. The American Mathematical Monthly , volume=. 1999 , publisher=

  2. [2]

    Training Compute-Optimal Large Language Models

    Training Compute-Optimal Large Language Models , author =. arXiv preprint arXiv:2203.15556 , year =. doi:10.48550/arXiv.2203.15556 , url =

  3. [3]

    2017 , eprint=

    Deep Learning Scaling is Predictable, Empirically , author=. 2017 , eprint=

  4. [4]

    2020 , eprint=

    Scaling Laws for Neural Language Models , author=. 2020 , eprint=

  5. [5]

    2020 , eprint=

    Language Models are Few-Shot Learners , author=. 2020 , eprint=

  6. [6]

    2024 , eprint=

    Reconciling Kaplan and Chinchilla Scaling Laws , author=. 2024 , eprint=

  7. [7]

    arXiv:2404.10102 , year=

    Chinchilla scaling: A replication attempt , author =. arXiv preprint arXiv:2404.10102 , year =

  8. [8]

    2025 , eprint=

    Resolving Discrepancies in Compute-Optimal Scaling of Language Models , author=. 2025 , eprint=

  9. [9]

    DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

    DeepSeek LLM: Scaling open-source language models with longtermism , author =. arXiv preprint arXiv:2401.02954 , year =

  10. [10]

    LLaMA: Open and Efficient Foundation Language Models

    LLaMA: Open and efficient foundation language models , author =. arXiv preprint arXiv:2302.13971 , year =

  11. [11]

    2019 , institution =

    Language Models are Unsupervised Multitask Learners , author =. 2019 , institution =

  12. [12]

    Qwen2.5 Technical Report

    Qwen2.5 Technical Report , author =. arXiv preprint arXiv:2412.15115 , year =

  13. [13]

    2024 , howpublished =

    The. 2024 , howpublished =

  14. [14]

    2019 , eprint=

    Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model , author=. 2019 , eprint=

  15. [15]

    GPT-4 Technical Report

    Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

  16. [16]

    Journal of Machine Learning Research , year =

    James Martens , title =. Journal of Machine Learning Research , year =

  17. [17]

    2019 , eprint=

    Measuring the Effects of Data Parallelism on Neural Network Training , author=. 2019 , eprint=

  18. [18]

    2025 , eprint=

    Improved Scaling Laws in Linear Regression via Data Reuse , author=. 2025 , eprint=

  19. [19]

    2025 , eprint=

    Power Lines: Scaling Laws for Weight Decay and Batch Size in LLM Pre-training , author=. 2025 , eprint=

  20. [20]

    2025 , eprint=

    Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful , author=. 2025 , eprint=

  21. [21]

    An Empirical Model of Large-Batch Training

    An empirical model of large-batch training , author =. arXiv preprint arXiv:1812.06162 , year =

  22. [22]

    Zhang, D

    How does critical batch size scale in pre-training? , author =. arXiv preprint arXiv:2410.21676 , year =

  23. [23]

    2021 , eprint=

    The Depth-to-Width Interplay in Self-Attention , author=. 2021 , eprint=

  24. [24]

    Investigating the Overlooked Hessian Structure: From

    Qian-Yuan Tang and Yufei Gu and Yunfeng Cai and Mingming Sun and Ping Li and zhou Xun and Zeke Xie , booktitle=. Investigating the Overlooked Hessian Structure: From. 2025 , url=

  25. [25]

    2023 , eprint=

    Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling , author=. 2023 , eprint=

  26. [26]

    Transformers: State-of-the-Art Natural Language Processing

    Wolf, Thomas and Debut, Lysandre and Sanh, Victor and Chaumond, Julien and Delangue, Clement and Moi, Anthony and Cistac, Pierric and Rault, Tim and Louf, Remi and Funtowicz, Morgan and Davison, Joe and Shleifer, Sam and von Platen, Patrick and Ma, Clara and Jernite, Yacine and Plu, Julien and Xu, Canwen and Le Scao, Teven and Gugger, Sylvain and Drame, M...

  27. [27]

    Computer Science , year =

    OpenWebText Corpus , author =. Computer Science , year =

  28. [28]

    2014 , eprint=

    One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling , author=. 2014 , eprint=

  29. [29]

    C Users Journal , volume =

    A New Algorithm for Data Compression , author =. C Users Journal , volume =

  30. [30]

    S entence P iece: A simple and language independent subword tokenizer and detokenizer for neural text processing

    SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , author =. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , pages =. 2018 , publisher =. doi:10.18653/v1/D18-2012 , url =

  31. [31]

    2022 , eprint=

    A Solvable Model of Neural Scaling Laws , author=. 2022 , eprint=

  32. [32]

    2025 , eprint=

    How Feature Learning Can Improve Neural Scaling Laws , author=. 2025 , eprint=

  33. [33]

    2024 , eprint=

    A Dynamical Model of Neural Scaling Laws , author=. 2024 , eprint=

  34. [34]

    2025 , eprint=

    4+3 Phases of Compute-Optimal Neural Scaling Laws , author=. 2025 , eprint=

  35. [35]

    2025 , eprint=

    Scaling Laws in Linear Regression: Compute, Parameters, and Data , author=. 2025 , eprint=

  36. [36]

    2024 , eprint=

    The Quantization Model of Neural Scaling , author=. 2024 , eprint=

  37. [37]

    2018 , eprint=

    On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport , author=. 2018 , eprint=

  38. [38]

    2020 , eprint=

    Neural Tangent Kernel: Convergence and Generalization in Neural Networks , author=. 2020 , eprint=

  39. [39]

    Explaining neural scaling laws , volume=

    Bahri, Yasaman and Dyer, Ethan and Kaplan, Jared and Lee, Jaehoon and Sharma, Utkarsh , year=. Explaining neural scaling laws , volume=. Proceedings of the National Academy of Sciences , publisher=. doi:10.1073/pnas.2311878121 , number=

  40. [40]

    2025 , eprint=

    Learning quadratic neural networks in high dimensions: SGD dynamics and scaling laws , author=. 2025 , eprint=

  41. [41]

    2025 , eprint=

    Emergence and scaling laws in SGD learning of shallow neural networks , author=. 2025 , eprint=

  42. [42]

    2025 , eprint=

    Scaling Laws for Optimal Data Mixtures , author=. 2025 , eprint=

  43. [43]

    2025 , eprint=

    Scaling Data-Constrained Language Models , author=. 2025 , eprint=

  44. [44]

    2025 , eprint=

    MixMin: Finding Data Mixtures via Convex Minimization , author=. 2025 , eprint=

  45. [45]

    2022 , eprint=

    Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer , author=. 2022 , eprint=

  46. [46]

    2025 , eprint=

    Don't be lazy: CompleteP enables compute-efficient deep transformers , author=. 2025 , eprint=

  47. [47]

    2024 , eprint=

    Scaling Exponents Across Parameterizations and Optimizers , author=. 2024 , eprint=

  48. [48]

    2025 , eprint=

    Scaling Optimal LR Across Token Horizons , author=. 2025 , eprint=

  49. [49]

    2025 , eprint=

    A Multi-Power Law for Loss Curve Prediction Across Learning Rate Schedules , author=. 2025 , eprint=

  50. [50]

    2024 , eprint=

    The Road Less Scheduled , author=. 2024 , eprint=

  51. [51]

    2017 , eprint=

    Adam: A Method for Stochastic Optimization , author=. 2017 , eprint=

  52. [52]

    2019 , eprint=

    Decoupled Weight Decay Regularization , author=. 2019 , eprint=

  53. [53]

    2025 , eprint=

    Dimension-adapted Momentum Outscales SGD , author=. 2025 , eprint=

  54. [54]

    2023 , month = jun, day =

    Edward Rees , title =. 2023 , month = jun, day =

  55. [55]

    Aplikace matematiky , volume =

    Jan Ježek , title =. Aplikace matematiky , volume =. 1988 , url =

  56. [56]

    GitHub repository , year=

    JAX: composable transformations of Python+NumPy programs , author=. GitHub repository , year=

  57. [57]

    2025 , eprint=

    Scaling Collapse Reveals Universal Dynamics in Compute-Optimally Trained Neural Networks , author=. 2025 , eprint=

  58. [58]

    2017 , eprint=

    L2 Regularization versus Batch and Weight Normalization , author=. 2017 , eprint=

  59. [59]

    2019 , eprint=

    Norm matters: efficient and accurate normalization schemes in deep networks , author=. 2019 , eprint=

  60. [60]

    2020 , eprint=

    Reconciling Modern Deep Learning with Traditional Optimization Analyses: The Intrinsic Learning Rate , author=. 2020 , eprint=

  61. [61]

    2025 , eprint=

    Evaluating the Robustness of Chinchilla Compute-Optimal Scaling , author=. 2025 , eprint=

  62. [62]

    Nature Methods , volume =

    SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python , author =. Nature Methods , volume =. 2020 , doi =

  63. [63]

    2024 , eprint=

    Language models scale reliably with over-training and on downstream tasks , author=. 2024 , eprint=

  64. [64]

    2022 , eprint=

    Revisiting Neural Scaling Laws in Language and Vision , author=. 2022 , eprint=

  65. [65]

    Advances in Neural Information Processing Systems 6 , year=

    Learning Curves: Asymptotic Values and Rate of Convergence , author=. Advances in Neural Information Processing Systems 6 , year=