arxiv: 2605.09154 · v1 · submitted 2026-05-09 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Predicting Large Model Test Losses with a Noisy Quadratic System

Chuning Li , Chris J. Maddison

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:35 UTC · model grok-4.3

classification 💻 cs.LG

keywords loss predictionscaling lawsbatch sizelarge language modelscompute efficiencynoisy quadraticpre-training optimization

0 comments

The pith

A noisy quadratic system predicts the test loss of large models from their size, batch size, and number of weight updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a predictive model for the pre-training loss of large models using only the model size, batch size, and number of weight updates. This is the first such model that accounts for variations in batch size, and it projects losses more accurately than previous approaches when scaling to much larger compute budgets. Researchers can use it to select the best combination of model size, batch size, and steps that fit within given time, memory, or compute limits. Experiments show the chosen configurations perform close to the true optimum. The work argues that fitting loss curves directly is preferable to building ever-more-complex heuristic scaling laws.

Core claim

We show that the test loss follows a noisy quadratic system in the variables of model size N, batch size B, and number of updates K. Fitting this system to data from smaller-scale runs allows reliable prediction of loss at larger scales, even when batch size is not held constant, and outperforms the Chinchilla loss model for extrapolations up to 1000 times larger compute.

What carries the argument

Noisy quadratic system: a mathematical model that describes loss as a quadratic function of N, B, and K with added noise, used to fit and extrapolate training dynamics.

If this is right

The model can identify optimal N, B, K triples under constraints on time, memory, and compute.
Selected configurations achieve losses close to the ground-truth best performance.
It enables better planning for large training runs by forecasting losses without running them.
Loss prediction is positioned as a scalable alternative to complex heuristic laws.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the model holds, it could allow testing many batch size schedules in simulation before committing to full training.
Extensions might include incorporating other variables like learning rate schedules or data quality into the quadratic form.
Similar systems could be applied to predict not just loss but other metrics like downstream accuracy if correlations are strong.

Load-bearing premise

The fitted noisy quadratic system from small experiments remains valid when batch size is varied and when predicting at scales up to 1000 times larger in compute.

What would settle it

Training a model at a large extrapolated scale using the model's recommended N, B, K and measuring a test loss that differs substantially from the prediction.

Figures

Figures reproduced from arXiv: 2605.09154 by Chris J. Maddison, Chuning Li.

**Figure 1.** Figure 1: The Noisy Quadratic System (NQS) predicted LLM test losses on holdout data, outperforming Chinchilla’s loss model. Shown are “IsoFLOP” slices: data points arranged by their pretraining compute budget (selectively labeled next to the curves, in PetaFLOPs). Top: NQS successfully predicted the loss of LLMs over variations of model size N, holding constant the total FLOPs budget, at compute budgets up to 1000… view at source ↗

**Figure 2.** Figure 2: Chinchilla Method 3 is not a great predictive model. (a) Chinchilla Method 3 fitted on the entire Hoffmann IsoFLOPs dataset describes the data well. (b) Once the dataset is divided into Train/Holdout, the performance on the holdout slices deteriorated. For this figure, we used the original Hoffmann series of LLM data points upon which the Chinchilla method was developed (Besiroglu et al., 2024). First, we … view at source ↗

**Figure 3.** Figure 3: Accounting for the effect of LayerNorm by setting the learning rate γ ∝ 1 ||w||2 helps NQS fit large models trained with small batch sizes. but adjusting for the effect of LayerNorm is necessary for NQS to perform well on smaller batch sizes. From this point on, to experiment with the NQS, it is sufficient to rely on the simplified expressions of (5) and (6). 4.3. Inferring NQS parameters The procedure to… view at source ↗

**Figure 4.** Figure 4: NQS is robust in extrapolation. The x-axis represents gradually removing data points from the training set, starting from those training points with the highest compute budgets, so as to increase the compute gap between the training set and the holdout set. (a) In our experiments, Chinchilla can successfully extrapolate 20 times into higher compute budget. (b) NQS succeeded until the largest training run i… view at source ↗

**Figure 5.** Figure 5: The predictions of NQS can be used to select (N, B, K) under compound resource constraints. Each subplot: an IsoFLOP plane (C = 236 PF) with coordinates (x, y) representing N = x, B = y, K = 2.6 × 1014/(xy). The NQS model is trained on data with FLOP budget up to C = 147 PF. The red diamond is the default configuration, Chinchilla compute optimal model size trained at the Critical Batch Size (McCandlish et… view at source ↗

**Figure 6.** Figure 6: Chinchilla’s extrapolation performance on the original Hoffman dataset. The Hoffman dataset spans a smaller range than our Pythia + OWT2 dataset, and was less suitable for analysis of extrapolation performance. Nevertheless, the emerging trend is consistent with results on our LLM dataset. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: NQS performance on Llama + LM1B. The figure is based on a series of Llama-like models of sizes up to 0.5B and pre-training compute budgets between 5 PetaFLOPs and 1,000 PetaFLOPs. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: The NQS is robust to changes in the training set. Shown is a 90% confidence interval around the NQS predictions, constructed using 100 trials, each using a random subset (a 50% subsample) of the training runs to infer the NQS parameters [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: The NQS is robust to changes in the training set: the 90% confidence intervals are tight relative to the range of the predictions. The confidence intervals are constructed using 100 trials, each using a random subset (a 50% subsample) of the training runs to infer the NQS parameters. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

read the original abstract

We introduce a predictive model that estimates the pre-training loss of large models from model size (N), batch size (B) and number of weight updates (K). This is the first loss prediction model that can handle changing batch size. The model outperforms Chinchilla's loss model, a model of the test loss using the batch size and number of tokens, in terms of projecting the loss at extrapolated compute budgets (up to 1000 folds). A natural use of the model is to find optimal N, B, K configurations under explicit and compound resource constraints like time, memory and compute. In our experiments, the model-selected configurations are close to ground-truth optimal. Our work advocates for loss prediction as a better alternative to heuristic-based laws, which are growing in complexity. The implementation is available on https://github.com/chuningxdy/Noisy-Quadratic-System.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a workable noisy quadratic model for loss prediction that handles changing batch sizes and supports config optimization under constraints, but the 1000x extrapolation claims rest on unverified assumptions about the form holding far outside the fit range.

read the letter

The main thing to know is that this work models test loss as a noisy quadratic in model size N, batch size B, and update count K. It is the first such predictor that explicitly allows B to vary during training, unlike Chinchilla-style laws that treat batch size as fixed and work in tokens or compute. The authors show the model can be used to pick N, B, K triples that come close to ground-truth optima when resources like memory or wall time are constrained, and they report better extrapolation performance than Chinchilla up to 1000-fold larger budgets. The code is on GitHub, which is helpful for checking the implementation directly. That combination of variable-batch handling and downstream optimization use case is the clearest advance. The experiments apparently confirm that the selected configurations perform well on the runs they tested. The soft spot is the extrapolation step. Fitting the quadratic coefficients and noise terms on smaller runs and then projecting three orders of magnitude further assumes the quadratic surface, constant curvature, and noise structure remain accurate when B increases and total compute grows dramatically. The abstract gives no details on held-out validation at intermediate scales (say 10x or 100x) or on whether the fit parameters were chosen independently of the extrapolation test data. If the quadratic is only locally accurate, the reported gains over Chinchilla could be an artifact of the training domain rather than a general property. This is worth a serious referee for groups working on scaling laws or resource allocation in large-model training. A reader who wants a predictive tool instead of heuristic rules will find the formulation and the optimization examples useful, even if the current evidence for far extrapolation is thin. I would send it to review with a request for clearer validation splits and intermediate-scale checks before accepting the 1000x claims.

Referee Report

3 major / 3 minor

Summary. The paper introduces a noisy quadratic model for predicting pre-training test loss as a function of model size N, batch size B, and number of weight updates K. It claims to be the first loss prediction approach that explicitly handles varying batch sizes, outperforms Chinchilla's token-based scaling law when extrapolating to compute budgets up to 1000x larger, and enables selection of near-optimal N/B/K configurations under explicit resource constraints such as time, memory, and compute. Experiments reportedly show the model-selected configurations are close to ground-truth optima, with open-source code provided.

Significance. If the extrapolation and optimality claims hold under proper validation, the work would offer a more flexible, batch-size-aware alternative to existing heuristic scaling laws, potentially improving practical hyperparameter selection for large-model training. The open-source implementation and focus on compound constraints are positive contributions that could facilitate follow-up work on loss prediction as a tool for efficient scaling.

major comments (3)

[§4.2 and §5.1] §4.2 and §5.1: The extrapolation experiments to 1000x compute budgets report outperformance over Chinchilla but provide no held-out validation results at intermediate scales (e.g., 10x or 100x) using the same fitted quadratic coefficients; without this, it is impossible to confirm that the quadratic form in (N, B, K) remains accurate outside the fitting regime rather than overfitting small-scale noise.
[§3.1, Eq. (3)–(5)] §3.1, Eq. (3)–(5): The noisy quadratic system is fitted to loss trajectories from smaller runs, yet the manuscript does not specify the train/validation split used for coefficient estimation versus the extrapolation test sets; this leaves open the possibility that reported gains are partly driven by in-sample fitting rather than genuine out-of-distribution prediction.
[Table 3 and §5.3] Table 3 and §5.3: The near-optimal configuration results lack error bars, multiple random seeds, or statistical tests comparing model-selected N/B/K against ground-truth optima; the claim that selections are “close to ground-truth optimal” therefore cannot be assessed for robustness across runs.

minor comments (3)

[§3] The notation for the noise term and the precise definition of the quadratic coefficients (e.g., whether they are constant or depend on B) is introduced without a clear summary table; adding one would improve readability.
[Figure 2] Figure 2 caption does not state the exact compute budgets or number of runs used for the extrapolation curves, making direct comparison with Chinchilla difficult.
[Abstract and §5.1] The abstract states “up to 1000 folds” extrapolation but the main text does not list the precise maximum scale factor achieved in each experiment; this minor inconsistency should be aligned.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful comments, which help improve the clarity and rigor of our work. We address each major comment below.

read point-by-point responses

Referee: [§4.2 and §5.1] §4.2 and §5.1: The extrapolation experiments to 1000x compute budgets report outperformance over Chinchilla but provide no held-out validation results at intermediate scales (e.g., 10x or 100x) using the same fitted quadratic coefficients; without this, it is impossible to confirm that the quadratic form in (N, B, K) remains accurate outside the fitting regime rather than overfitting small-scale noise.

Authors: We agree that intermediate-scale validation would provide stronger evidence for the generalizability of the noisy quadratic model. In the original experiments, the model was fitted on small-scale runs and directly tested on much larger scales to demonstrate extrapolation. To address this concern, we will include additional held-out predictions at 10x and 100x compute budgets using the same fitted coefficients in the revised manuscript, along with comparisons to ground-truth losses where available. revision: yes
Referee: [§3.1, Eq. (3)–(5)] §3.1, Eq. (3)–(5): The noisy quadratic system is fitted to loss trajectories from smaller runs, yet the manuscript does not specify the train/validation split used for coefficient estimation versus the extrapolation test sets; this leaves open the possibility that reported gains are partly driven by in-sample fitting rather than genuine out-of-distribution prediction.

Authors: We apologize for the omission. The fitting procedure uses a specific set of small-scale training runs for estimating the quadratic coefficients, with separate larger runs reserved for extrapolation testing. We will clarify this split explicitly in Section 3.1 of the revised manuscript, including details on which runs were used for fitting versus evaluation to ensure transparency. revision: yes
Referee: [Table 3 and §5.3] Table 3 and §5.3: The near-optimal configuration results lack error bars, multiple random seeds, or statistical tests comparing model-selected N/B/K against ground-truth optima; the claim that selections are “close to ground-truth optimal” therefore cannot be assessed for robustness across runs.

Authors: We acknowledge that reporting variability across seeds would strengthen the results. The current experiments used single runs for each configuration due to computational constraints, but we will add error bars from multiple random seeds in the revised version of Table 3 and Section 5.3, along with a brief statistical comparison where feasible. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the noisy quadratic loss prediction model

full rationale

The paper defines a noisy quadratic system as a functional form for test loss in terms of N, B, and K, fits its parameters on smaller-scale experimental runs, and uses the resulting model to project losses at larger extrapolated compute budgets with variable batch sizes. This is a standard empirical scaling-law construction with independent predictive content; the extrapolation step does not reduce to the fitting inputs by definition or by renaming a fitted quantity as a prediction. No self-citations are load-bearing, no uniqueness theorems are invoked, and the outperformance claim versus Chinchilla is presented as a direct empirical comparison on held-out large-scale regimes. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that loss dynamics follow a noisy quadratic form in N, B, K space whose parameters can be fitted once and then used for extrapolation; no independent derivation or external benchmarks are mentioned in the abstract.

free parameters (1)

quadratic coefficients and noise parameters
The noisy quadratic system requires several coefficients that must be fitted to observed loss curves.

pith-pipeline@v0.9.0 · 5444 in / 1192 out tokens · 25169 ms · 2026-05-12T04:35:41.291434+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We model LLM test losses as the expected risk of stochastic optimization on a quadratic loss surface Q_NQS(w) ... Assumptions 4.1-4.3 on power-law spectra for bias/variance.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean costAlphaLog_fourth_deriv_at_zero unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

NQS expression (5) with E_app, E_bias, E_var terms obtained from SGD updates on eigen-directions of H.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 6 internal anchors

[1]

The American Mathematical Monthly , volume=

An elementary view of Euler's summation formula , author=. The American Mathematical Monthly , volume=. 1999 , publisher=

work page 1999
[2]

Training Compute-Optimal Large Language Models

Training Compute-Optimal Large Language Models , author =. arXiv preprint arXiv:2203.15556 , year =. doi:10.48550/arXiv.2203.15556 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2203.15556
[3]

2017 , eprint=

Deep Learning Scaling is Predictable, Empirically , author=. 2017 , eprint=

work page 2017
[4]

2020 , eprint=

Scaling Laws for Neural Language Models , author=. 2020 , eprint=

work page 2020
[5]

2020 , eprint=

Language Models are Few-Shot Learners , author=. 2020 , eprint=

work page 2020
[6]

2024 , eprint=

Reconciling Kaplan and Chinchilla Scaling Laws , author=. 2024 , eprint=

work page 2024
[7]

arXiv:2404.10102 , year=

Chinchilla scaling: A replication attempt , author =. arXiv preprint arXiv:2404.10102 , year =

work page arXiv
[8]

2025 , eprint=

Resolving Discrepancies in Compute-Optimal Scaling of Language Models , author=. 2025 , eprint=

work page 2025
[9]

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

DeepSeek LLM: Scaling open-source language models with longtermism , author =. arXiv preprint arXiv:2401.02954 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[10]

LLaMA: Open and Efficient Foundation Language Models

LLaMA: Open and efficient foundation language models , author =. arXiv preprint arXiv:2302.13971 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[11]

2019 , institution =

Language Models are Unsupervised Multitask Learners , author =. 2019 , institution =

work page 2019
[12]

Qwen2.5 Technical Report

Qwen2.5 Technical Report , author =. arXiv preprint arXiv:2412.15115 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[13]

2024 , howpublished =

The. 2024 , howpublished =

work page 2024
[14]

2019 , eprint=

Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model , author=. 2019 , eprint=

work page 2019
[15]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Journal of Machine Learning Research , year =

James Martens , title =. Journal of Machine Learning Research , year =

work page
[17]

2019 , eprint=

Measuring the Effects of Data Parallelism on Neural Network Training , author=. 2019 , eprint=

work page 2019
[18]

2025 , eprint=

Improved Scaling Laws in Linear Regression via Data Reuse , author=. 2025 , eprint=

work page 2025
[19]

2025 , eprint=

Power Lines: Scaling Laws for Weight Decay and Batch Size in LLM Pre-training , author=. 2025 , eprint=

work page 2025
[20]

2025 , eprint=

Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful , author=. 2025 , eprint=

work page 2025
[21]

An Empirical Model of Large-Batch Training

An empirical model of large-batch training , author =. arXiv preprint arXiv:1812.06162 , year =

work page Pith review arXiv
[22]

Zhang, D

How does critical batch size scale in pre-training? , author =. arXiv preprint arXiv:2410.21676 , year =

work page arXiv
[23]

2021 , eprint=

The Depth-to-Width Interplay in Self-Attention , author=. 2021 , eprint=

work page 2021
[24]

Investigating the Overlooked Hessian Structure: From

Qian-Yuan Tang and Yufei Gu and Yunfeng Cai and Mingming Sun and Ping Li and zhou Xun and Zeke Xie , booktitle=. Investigating the Overlooked Hessian Structure: From. 2025 , url=

work page 2025
[25]

2023 , eprint=

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling , author=. 2023 , eprint=

work page 2023
[26]

Transformers: State-of-the-Art Natural Language Processing

Wolf, Thomas and Debut, Lysandre and Sanh, Victor and Chaumond, Julien and Delangue, Clement and Moi, Anthony and Cistac, Pierric and Rault, Tim and Louf, Remi and Funtowicz, Morgan and Davison, Joe and Shleifer, Sam and von Platen, Patrick and Ma, Clara and Jernite, Yacine and Plu, Julien and Xu, Canwen and Le Scao, Teven and Gugger, Sylvain and Drame, M...

work page doi:10.18653/v1/2020.emnlp-demos.6 2020
[27]

Computer Science , year =

OpenWebText Corpus , author =. Computer Science , year =

work page
[28]

2014 , eprint=

One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling , author=. 2014 , eprint=

work page 2014
[29]

C Users Journal , volume =

A New Algorithm for Data Compression , author =. C Users Journal , volume =

work page
[30]

S entence P iece: A simple and language independent subword tokenizer and detokenizer for neural text processing

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , author =. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , pages =. 2018 , publisher =. doi:10.18653/v1/D18-2012 , url =

work page internal anchor Pith review doi:10.18653/v1/d18-2012 2018
[31]

2022 , eprint=

A Solvable Model of Neural Scaling Laws , author=. 2022 , eprint=

work page 2022
[32]

2025 , eprint=

How Feature Learning Can Improve Neural Scaling Laws , author=. 2025 , eprint=

work page 2025
[33]

2024 , eprint=

A Dynamical Model of Neural Scaling Laws , author=. 2024 , eprint=

work page 2024
[34]

2025 , eprint=

4+3 Phases of Compute-Optimal Neural Scaling Laws , author=. 2025 , eprint=

work page 2025
[35]

2025 , eprint=

Scaling Laws in Linear Regression: Compute, Parameters, and Data , author=. 2025 , eprint=

work page 2025
[36]

2024 , eprint=

The Quantization Model of Neural Scaling , author=. 2024 , eprint=

work page 2024
[37]

2018 , eprint=

On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport , author=. 2018 , eprint=

work page 2018
[38]

2020 , eprint=

Neural Tangent Kernel: Convergence and Generalization in Neural Networks , author=. 2020 , eprint=

work page 2020
[39]

Explaining neural scaling laws , volume=

Bahri, Yasaman and Dyer, Ethan and Kaplan, Jared and Lee, Jaehoon and Sharma, Utkarsh , year=. Explaining neural scaling laws , volume=. Proceedings of the National Academy of Sciences , publisher=. doi:10.1073/pnas.2311878121 , number=

work page doi:10.1073/pnas.2311878121
[40]

2025 , eprint=

Learning quadratic neural networks in high dimensions: SGD dynamics and scaling laws , author=. 2025 , eprint=

work page 2025
[41]

2025 , eprint=

Emergence and scaling laws in SGD learning of shallow neural networks , author=. 2025 , eprint=

work page 2025
[42]

2025 , eprint=

Scaling Laws for Optimal Data Mixtures , author=. 2025 , eprint=

work page 2025
[43]

2025 , eprint=

Scaling Data-Constrained Language Models , author=. 2025 , eprint=

work page 2025
[44]

2025 , eprint=

MixMin: Finding Data Mixtures via Convex Minimization , author=. 2025 , eprint=

work page 2025
[45]

2022 , eprint=

Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer , author=. 2022 , eprint=

work page 2022
[46]

2025 , eprint=

Don't be lazy: CompleteP enables compute-efficient deep transformers , author=. 2025 , eprint=

work page 2025
[47]

2024 , eprint=

Scaling Exponents Across Parameterizations and Optimizers , author=. 2024 , eprint=

work page 2024
[48]

2025 , eprint=

Scaling Optimal LR Across Token Horizons , author=. 2025 , eprint=

work page 2025
[49]

2025 , eprint=

A Multi-Power Law for Loss Curve Prediction Across Learning Rate Schedules , author=. 2025 , eprint=

work page 2025
[50]

2024 , eprint=

The Road Less Scheduled , author=. 2024 , eprint=

work page 2024
[51]

2017 , eprint=

Adam: A Method for Stochastic Optimization , author=. 2017 , eprint=

work page 2017
[52]

2019 , eprint=

Decoupled Weight Decay Regularization , author=. 2019 , eprint=

work page 2019
[53]

2025 , eprint=

Dimension-adapted Momentum Outscales SGD , author=. 2025 , eprint=

work page 2025
[54]

2023 , month = jun, day =

Edward Rees , title =. 2023 , month = jun, day =

work page 2023
[55]

Aplikace matematiky , volume =

Jan Ježek , title =. Aplikace matematiky , volume =. 1988 , url =

work page 1988
[56]

GitHub repository , year=

JAX: composable transformations of Python+NumPy programs , author=. GitHub repository , year=

work page
[57]

2025 , eprint=

Scaling Collapse Reveals Universal Dynamics in Compute-Optimally Trained Neural Networks , author=. 2025 , eprint=

work page 2025
[58]

2017 , eprint=

L2 Regularization versus Batch and Weight Normalization , author=. 2017 , eprint=

work page 2017
[59]

2019 , eprint=

Norm matters: efficient and accurate normalization schemes in deep networks , author=. 2019 , eprint=

work page 2019
[60]

2020 , eprint=

Reconciling Modern Deep Learning with Traditional Optimization Analyses: The Intrinsic Learning Rate , author=. 2020 , eprint=

work page 2020
[61]

2025 , eprint=

Evaluating the Robustness of Chinchilla Compute-Optimal Scaling , author=. 2025 , eprint=

work page 2025
[62]

Nature Methods , volume =

SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python , author =. Nature Methods , volume =. 2020 , doi =

work page 2020
[63]

2024 , eprint=

Language models scale reliably with over-training and on downstream tasks , author=. 2024 , eprint=

work page 2024
[64]

2022 , eprint=

Revisiting Neural Scaling Laws in Language and Vision , author=. 2022 , eprint=

work page 2022
[65]

Advances in Neural Information Processing Systems 6 , year=

Learning Curves: Asymptotic Values and Rate of Convergence , author=. Advances in Neural Information Processing Systems 6 , year=

work page