Recognition: 2 theorem links
· Lean TheoremPredicting Large Model Test Losses with a Noisy Quadratic System
Pith reviewed 2026-05-12 04:35 UTC · model grok-4.3
The pith
A noisy quadratic system predicts the test loss of large models from their size, batch size, and number of weight updates.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We show that the test loss follows a noisy quadratic system in the variables of model size N, batch size B, and number of updates K. Fitting this system to data from smaller-scale runs allows reliable prediction of loss at larger scales, even when batch size is not held constant, and outperforms the Chinchilla loss model for extrapolations up to 1000 times larger compute.
What carries the argument
Noisy quadratic system: a mathematical model that describes loss as a quadratic function of N, B, and K with added noise, used to fit and extrapolate training dynamics.
If this is right
- The model can identify optimal N, B, K triples under constraints on time, memory, and compute.
- Selected configurations achieve losses close to the ground-truth best performance.
- It enables better planning for large training runs by forecasting losses without running them.
- Loss prediction is positioned as a scalable alternative to complex heuristic laws.
Where Pith is reading between the lines
- If the model holds, it could allow testing many batch size schedules in simulation before committing to full training.
- Extensions might include incorporating other variables like learning rate schedules or data quality into the quadratic form.
- Similar systems could be applied to predict not just loss but other metrics like downstream accuracy if correlations are strong.
Load-bearing premise
The fitted noisy quadratic system from small experiments remains valid when batch size is varied and when predicting at scales up to 1000 times larger in compute.
What would settle it
Training a model at a large extrapolated scale using the model's recommended N, B, K and measuring a test loss that differs substantially from the prediction.
Figures
read the original abstract
We introduce a predictive model that estimates the pre-training loss of large models from model size (N), batch size (B) and number of weight updates (K). This is the first loss prediction model that can handle changing batch size. The model outperforms Chinchilla's loss model, a model of the test loss using the batch size and number of tokens, in terms of projecting the loss at extrapolated compute budgets (up to 1000 folds). A natural use of the model is to find optimal N, B, K configurations under explicit and compound resource constraints like time, memory and compute. In our experiments, the model-selected configurations are close to ground-truth optimal. Our work advocates for loss prediction as a better alternative to heuristic-based laws, which are growing in complexity. The implementation is available on https://github.com/chuningxdy/Noisy-Quadratic-System.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a noisy quadratic model for predicting pre-training test loss as a function of model size N, batch size B, and number of weight updates K. It claims to be the first loss prediction approach that explicitly handles varying batch sizes, outperforms Chinchilla's token-based scaling law when extrapolating to compute budgets up to 1000x larger, and enables selection of near-optimal N/B/K configurations under explicit resource constraints such as time, memory, and compute. Experiments reportedly show the model-selected configurations are close to ground-truth optima, with open-source code provided.
Significance. If the extrapolation and optimality claims hold under proper validation, the work would offer a more flexible, batch-size-aware alternative to existing heuristic scaling laws, potentially improving practical hyperparameter selection for large-model training. The open-source implementation and focus on compound constraints are positive contributions that could facilitate follow-up work on loss prediction as a tool for efficient scaling.
major comments (3)
- [§4.2 and §5.1] §4.2 and §5.1: The extrapolation experiments to 1000x compute budgets report outperformance over Chinchilla but provide no held-out validation results at intermediate scales (e.g., 10x or 100x) using the same fitted quadratic coefficients; without this, it is impossible to confirm that the quadratic form in (N, B, K) remains accurate outside the fitting regime rather than overfitting small-scale noise.
- [§3.1, Eq. (3)–(5)] §3.1, Eq. (3)–(5): The noisy quadratic system is fitted to loss trajectories from smaller runs, yet the manuscript does not specify the train/validation split used for coefficient estimation versus the extrapolation test sets; this leaves open the possibility that reported gains are partly driven by in-sample fitting rather than genuine out-of-distribution prediction.
- [Table 3 and §5.3] Table 3 and §5.3: The near-optimal configuration results lack error bars, multiple random seeds, or statistical tests comparing model-selected N/B/K against ground-truth optima; the claim that selections are “close to ground-truth optimal” therefore cannot be assessed for robustness across runs.
minor comments (3)
- [§3] The notation for the noise term and the precise definition of the quadratic coefficients (e.g., whether they are constant or depend on B) is introduced without a clear summary table; adding one would improve readability.
- [Figure 2] Figure 2 caption does not state the exact compute budgets or number of runs used for the extrapolation curves, making direct comparison with Chinchilla difficult.
- [Abstract and §5.1] The abstract states “up to 1000 folds” extrapolation but the main text does not list the precise maximum scale factor achieved in each experiment; this minor inconsistency should be aligned.
Simulated Author's Rebuttal
We thank the referee for their thoughtful comments, which help improve the clarity and rigor of our work. We address each major comment below.
read point-by-point responses
-
Referee: [§4.2 and §5.1] §4.2 and §5.1: The extrapolation experiments to 1000x compute budgets report outperformance over Chinchilla but provide no held-out validation results at intermediate scales (e.g., 10x or 100x) using the same fitted quadratic coefficients; without this, it is impossible to confirm that the quadratic form in (N, B, K) remains accurate outside the fitting regime rather than overfitting small-scale noise.
Authors: We agree that intermediate-scale validation would provide stronger evidence for the generalizability of the noisy quadratic model. In the original experiments, the model was fitted on small-scale runs and directly tested on much larger scales to demonstrate extrapolation. To address this concern, we will include additional held-out predictions at 10x and 100x compute budgets using the same fitted coefficients in the revised manuscript, along with comparisons to ground-truth losses where available. revision: yes
-
Referee: [§3.1, Eq. (3)–(5)] §3.1, Eq. (3)–(5): The noisy quadratic system is fitted to loss trajectories from smaller runs, yet the manuscript does not specify the train/validation split used for coefficient estimation versus the extrapolation test sets; this leaves open the possibility that reported gains are partly driven by in-sample fitting rather than genuine out-of-distribution prediction.
Authors: We apologize for the omission. The fitting procedure uses a specific set of small-scale training runs for estimating the quadratic coefficients, with separate larger runs reserved for extrapolation testing. We will clarify this split explicitly in Section 3.1 of the revised manuscript, including details on which runs were used for fitting versus evaluation to ensure transparency. revision: yes
-
Referee: [Table 3 and §5.3] Table 3 and §5.3: The near-optimal configuration results lack error bars, multiple random seeds, or statistical tests comparing model-selected N/B/K against ground-truth optima; the claim that selections are “close to ground-truth optimal” therefore cannot be assessed for robustness across runs.
Authors: We acknowledge that reporting variability across seeds would strengthen the results. The current experiments used single runs for each configuration due to computational constraints, but we will add error bars from multiple random seeds in the revised version of Table 3 and Section 5.3, along with a brief statistical comparison where feasible. revision: yes
Circularity Check
No significant circularity in the noisy quadratic loss prediction model
full rationale
The paper defines a noisy quadratic system as a functional form for test loss in terms of N, B, and K, fits its parameters on smaller-scale experimental runs, and uses the resulting model to project losses at larger extrapolated compute budgets with variable batch sizes. This is a standard empirical scaling-law construction with independent predictive content; the extrapolation step does not reduce to the fitting inputs by definition or by renaming a fitted quantity as a prediction. No self-citations are load-bearing, no uniqueness theorems are invoked, and the outperformance claim versus Chinchilla is presented as a direct empirical comparison on held-out large-scale regimes. The derivation chain therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- quadratic coefficients and noise parameters
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We model LLM test losses as the expected risk of stochastic optimization on a quadratic loss surface Q_NQS(w) ... Assumptions 4.1-4.3 on power-law spectra for bias/variance.
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leancostAlphaLog_fourth_deriv_at_zero unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
NQS expression (5) with E_app, E_bias, E_var terms obtained from SGD updates on eigen-directions of H.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
The American Mathematical Monthly , volume=
An elementary view of Euler's summation formula , author=. The American Mathematical Monthly , volume=. 1999 , publisher=
work page 1999
-
[2]
Training Compute-Optimal Large Language Models
Training Compute-Optimal Large Language Models , author =. arXiv preprint arXiv:2203.15556 , year =. doi:10.48550/arXiv.2203.15556 , url =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2203.15556
-
[3]
Deep Learning Scaling is Predictable, Empirically , author=. 2017 , eprint=
work page 2017
- [4]
- [5]
-
[6]
Reconciling Kaplan and Chinchilla Scaling Laws , author=. 2024 , eprint=
work page 2024
-
[7]
Chinchilla scaling: A replication attempt , author =. arXiv preprint arXiv:2404.10102 , year =
-
[8]
Resolving Discrepancies in Compute-Optimal Scaling of Language Models , author=. 2025 , eprint=
work page 2025
-
[9]
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
DeepSeek LLM: Scaling open-source language models with longtermism , author =. arXiv preprint arXiv:2401.02954 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
LLaMA: Open and Efficient Foundation Language Models
LLaMA: Open and efficient foundation language models , author =. arXiv preprint arXiv:2302.13971 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Language Models are Unsupervised Multitask Learners , author =. 2019 , institution =
work page 2019
-
[12]
Qwen2.5 Technical Report , author =. arXiv preprint arXiv:2412.15115 , year =
work page internal anchor Pith review Pith/arXiv arXiv
- [13]
-
[14]
Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model , author=. 2019 , eprint=
work page 2019
-
[15]
Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Journal of Machine Learning Research , year =
James Martens , title =. Journal of Machine Learning Research , year =
-
[17]
Measuring the Effects of Data Parallelism on Neural Network Training , author=. 2019 , eprint=
work page 2019
-
[18]
Improved Scaling Laws in Linear Regression via Data Reuse , author=. 2025 , eprint=
work page 2025
-
[19]
Power Lines: Scaling Laws for Weight Decay and Batch Size in LLM Pre-training , author=. 2025 , eprint=
work page 2025
-
[20]
Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful , author=. 2025 , eprint=
work page 2025
-
[21]
An Empirical Model of Large-Batch Training
An empirical model of large-batch training , author =. arXiv preprint arXiv:1812.06162 , year =
- [22]
-
[23]
The Depth-to-Width Interplay in Self-Attention , author=. 2021 , eprint=
work page 2021
-
[24]
Investigating the Overlooked Hessian Structure: From
Qian-Yuan Tang and Yufei Gu and Yunfeng Cai and Mingming Sun and Ping Li and zhou Xun and Zeke Xie , booktitle=. Investigating the Overlooked Hessian Structure: From. 2025 , url=
work page 2025
-
[25]
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling , author=. 2023 , eprint=
work page 2023
-
[26]
Transformers: State-of-the-Art Natural Language Processing
Wolf, Thomas and Debut, Lysandre and Sanh, Victor and Chaumond, Julien and Delangue, Clement and Moi, Anthony and Cistac, Pierric and Rault, Tim and Louf, Remi and Funtowicz, Morgan and Davison, Joe and Shleifer, Sam and von Platen, Patrick and Ma, Clara and Jernite, Yacine and Plu, Julien and Xu, Canwen and Le Scao, Teven and Gugger, Sylvain and Drame, M...
- [27]
-
[28]
One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling , author=. 2014 , eprint=
work page 2014
-
[29]
A New Algorithm for Data Compression , author =. C Users Journal , volume =
-
[30]
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , author =. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , pages =. 2018 , publisher =. doi:10.18653/v1/D18-2012 , url =
work page internal anchor Pith review doi:10.18653/v1/d18-2012 2018
- [31]
-
[32]
How Feature Learning Can Improve Neural Scaling Laws , author=. 2025 , eprint=
work page 2025
- [33]
-
[34]
4+3 Phases of Compute-Optimal Neural Scaling Laws , author=. 2025 , eprint=
work page 2025
-
[35]
Scaling Laws in Linear Regression: Compute, Parameters, and Data , author=. 2025 , eprint=
work page 2025
- [36]
-
[37]
On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport , author=. 2018 , eprint=
work page 2018
-
[38]
Neural Tangent Kernel: Convergence and Generalization in Neural Networks , author=. 2020 , eprint=
work page 2020
-
[39]
Explaining neural scaling laws , volume=
Bahri, Yasaman and Dyer, Ethan and Kaplan, Jared and Lee, Jaehoon and Sharma, Utkarsh , year=. Explaining neural scaling laws , volume=. Proceedings of the National Academy of Sciences , publisher=. doi:10.1073/pnas.2311878121 , number=
-
[40]
Learning quadratic neural networks in high dimensions: SGD dynamics and scaling laws , author=. 2025 , eprint=
work page 2025
-
[41]
Emergence and scaling laws in SGD learning of shallow neural networks , author=. 2025 , eprint=
work page 2025
- [42]
- [43]
-
[44]
MixMin: Finding Data Mixtures via Convex Minimization , author=. 2025 , eprint=
work page 2025
-
[45]
Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer , author=. 2022 , eprint=
work page 2022
-
[46]
Don't be lazy: CompleteP enables compute-efficient deep transformers , author=. 2025 , eprint=
work page 2025
-
[47]
Scaling Exponents Across Parameterizations and Optimizers , author=. 2024 , eprint=
work page 2024
- [48]
-
[49]
A Multi-Power Law for Loss Curve Prediction Across Learning Rate Schedules , author=. 2025 , eprint=
work page 2025
- [50]
- [51]
- [52]
- [53]
- [54]
-
[55]
Aplikace matematiky , volume =
Jan Ježek , title =. Aplikace matematiky , volume =. 1988 , url =
work page 1988
-
[56]
JAX: composable transformations of Python+NumPy programs , author=. GitHub repository , year=
-
[57]
Scaling Collapse Reveals Universal Dynamics in Compute-Optimally Trained Neural Networks , author=. 2025 , eprint=
work page 2025
-
[58]
L2 Regularization versus Batch and Weight Normalization , author=. 2017 , eprint=
work page 2017
-
[59]
Norm matters: efficient and accurate normalization schemes in deep networks , author=. 2019 , eprint=
work page 2019
-
[60]
Reconciling Modern Deep Learning with Traditional Optimization Analyses: The Intrinsic Learning Rate , author=. 2020 , eprint=
work page 2020
-
[61]
Evaluating the Robustness of Chinchilla Compute-Optimal Scaling , author=. 2025 , eprint=
work page 2025
-
[62]
SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python , author =. Nature Methods , volume =. 2020 , doi =
work page 2020
-
[63]
Language models scale reliably with over-training and on downstream tasks , author=. 2024 , eprint=
work page 2024
-
[64]
Revisiting Neural Scaling Laws in Language and Vision , author=. 2022 , eprint=
work page 2022
-
[65]
Advances in Neural Information Processing Systems 6 , year=
Learning Curves: Asymptotic Values and Rate of Convergence , author=. Advances in Neural Information Processing Systems 6 , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.