pith. sign in

arxiv: 2605.26248 · v1 · pith:FL2NWU66new · submitted 2026-05-25 · 💻 cs.LG · cs.AI· cs.NE

Unified Neural Scaling Laws

Pith reviewed 2026-06-29 22:54 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.NE
keywords neural scaling lawsdeep learningperformance extrapolationmodel scalingunified scalinglarge language modelsvision modelsreinforcement learning
0
0 comments X

The pith

A single functional form accurately models and extrapolates neural network scaling as model parameters, data size, training steps, inference steps, compute and hyperparameters all vary together.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the Unified Neural Scaling Law, a functional form that fits performance metrics when multiple training dimensions change simultaneously. It is tested on various architectures across large-scale vision, language, math and reinforcement learning tasks, both upstream and downstream. The form produces extrapolations that are considerably more accurate than earlier scaling expressions. A sympathetic reader would care because reliable multi-dimensional predictions let practitioners estimate outcomes of larger experiments without running them. The unification claim rests on the form working without architecture-specific or task-specific retuning.

Core claim

The authors introduce the Unified Neural Scaling Law (UNSL) as a functional form that models how an evaluation metric varies when the number of model parameters, training dataset size, number of training steps, number of inference steps, amount of compute, and various hyperparameters are all changed at once. This form is shown to accurately fit observed scaling behavior and to extrapolate to unseen larger values for diverse architectures on a range of tasks in vision, language, math, and reinforcement learning, outperforming other functional forms on extrapolation accuracy.

What carries the argument

The UNSL functional form, an expression that combines scaling terms for each dimension into one joint model without separate adjustments per architecture or task.

If this is right

  • Performance forecasts become possible when several variables such as model size and dataset size change together rather than one at a time.
  • The same expression applies without modification to vision, language, math and reinforcement learning tasks.
  • Hyperparameter scaling effects are incorporated directly into the unified prediction.
  • Compute budget planning can use the form to compare outcomes across different combinations of training length and model scale.
  • Extrapolation error decreases relative to earlier scaling expressions on the tested set of tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the form holds, scaling behavior may share an underlying structure that is largely independent of task domain.
  • Training runs could be optimized by solving for the combination of dimensions that reaches a target performance at lowest cost.
  • The approach might be tested on multimodal or new architecture families to check whether the unification extends further.
  • Accurate multi-dimensional laws could reduce the need for exhaustive hyperparameter sweeps at large scales.

Load-bearing premise

A single functional form can simultaneously capture scaling across model parameters, dataset size, training steps, inference steps, compute, and hyperparameters without architecture- or task-specific adjustments.

What would settle it

Measuring actual performance on a held-out large-scale task or architecture at higher combined values of multiple dimensions and finding that UNSL extrapolations deviate substantially from the measured values while prior forms do not.

Figures

Figures reproduced from arXiv: 2605.26248 by David Krueger, Ethan Caballero, Irina Rish, Priyank Jaini.

Figure 1
Figure 1. Figure 1: An illustration of a Unified Neural Scaling Law (UNSL) (dark solid lines) with two input [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An illustration of an example configuration of Equation 5 with two input dimensions, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: UNSL accurately Extrapolating Downstream Performance; there are many additional [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Extrapolation of UNSL on scaling behavior of an MLP trained for a single epoch on the (n, [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Varying the number of observed points used for fitting UNSL functional form from [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Extrapolation of UNSL on scaling behavior of an MLP trained for a single epoch on the (n, [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Extrapolation Results on scaling behavior of an MLP trained for a single epoch on the (n, [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Extrapolation Results of functional form [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Extrapolation Results of UNSL functional form. Scaling behavior is that of an MLP [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Extrapolation Results of functional form [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Extrapolation Results of UNSL. This trivariate scaling behavior is that of an MLP trained [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Extrapolation Results of ablation baseline of Equation 8. This trivariate scaling behavior [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Extrapolation Results of UNSL on scaling behavior of reinforcement learning. Experi [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Extrapolation Results of UNSL on scaling behavior of inference scaling. Experimental [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Extrapolation Results of UNSL on multivariate scaling behavior as width and depth [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Extrapolation Results of UNSL on multivariate scaling behavior as batch size and number [PITH_FULL_IMAGE:figures/full_fig_p025_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Extrapolation Results of UNSL on bivariate scaling behavior of downstream vision [PITH_FULL_IMAGE:figures/full_fig_p026_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Extrapolation Results of “DC” functional form of Muennighoff et al. (2023) on bivariate [PITH_FULL_IMAGE:figures/full_fig_p027_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Extrapolation Results of A1 functional form on bivariate scaling behavior of downstream [PITH_FULL_IMAGE:figures/full_fig_p028_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Extrapolation Results of A2 functional form on bivariate scaling behavior of downstream [PITH_FULL_IMAGE:figures/full_fig_p029_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Extrapolation Results of A3 functional form on bivariate scaling behavior of downstream [PITH_FULL_IMAGE:figures/full_fig_p030_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Extrapolation Results of UNSL functional form on trivariate scaling behavior of down [PITH_FULL_IMAGE:figures/full_fig_p031_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Extrapolation Results of “DC” functional form of Muennighoff et al. (2023) on trivariate [PITH_FULL_IMAGE:figures/full_fig_p031_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Extrapolation Results of A1 functional form on trivariate scaling behavior of downstream [PITH_FULL_IMAGE:figures/full_fig_p032_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Extrapolation Results of A2 functional form on trivariate scaling behavior of downstream [PITH_FULL_IMAGE:figures/full_fig_p032_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Extrapolation Results of A3 functional form on trivariate scaling behavior of downstream [PITH_FULL_IMAGE:figures/full_fig_p033_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Extrapolation Results of UNSL on trivariate scaling behavior of language performance. [PITH_FULL_IMAGE:figures/full_fig_p034_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Extrapolation Results of “DC” functional form of Muennighoff et al. (2023) on trivariate [PITH_FULL_IMAGE:figures/full_fig_p035_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Extrapolation Results of A1 functional form on trivariate scaling behavior of language [PITH_FULL_IMAGE:figures/full_fig_p036_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Extrapolation Results of A2 functional form on trivariate scaling behavior of language [PITH_FULL_IMAGE:figures/full_fig_p037_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: Extrapolation Results of A3 functional form on trivariate scaling behavior of language [PITH_FULL_IMAGE:figures/full_fig_p038_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: Extrapolation Results of UNSL on bivariate scaling behavior of downstream (and [PITH_FULL_IMAGE:figures/full_fig_p039_32.png] view at source ↗
Figure 33
Figure 33. Figure 33: Extrapolation Results of “CF” functional form of Hoffmann et al. (2022) on bivariate [PITH_FULL_IMAGE:figures/full_fig_p040_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: Extrapolation Results of A1 functional form on bivariate scaling behavior of downstream [PITH_FULL_IMAGE:figures/full_fig_p041_34.png] view at source ↗
Figure 35
Figure 35. Figure 35: Extrapolation Results of A2 functional form on bivariate scaling behavior of downstream [PITH_FULL_IMAGE:figures/full_fig_p042_35.png] view at source ↗
read the original abstract

We present a functional form (that we refer to as a Unified Neural Scaling Law (UNSL)) that accurately models and extrapolates the scaling behaviors of deep neural networks as multiple dimensions all vary simultaneously (i.e. how the evaluation metric of interest varies as one simultaneously varies the number of model parameters, training dataset size, number of training steps, number of inference steps, amount of compute, and various hyperparameters) for various architectures and for each of various tasks within a varied set of upstream and downstream tasks. This set includes large-scale vision, language, math, and reinforcement learning. When compared to other functional forms for neural scaling, this functional form yields extrapolations of scaling behavior that are considerably more accurate on this set.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper presents a Unified Neural Scaling Law (UNSL) functional form claimed to accurately model and extrapolate the scaling behaviors of deep neural networks as multiple dimensions vary simultaneously (model parameters, dataset size, training steps, inference steps, compute, and hyperparameters) for various architectures and tasks across large-scale vision, language, math, and reinforcement learning, both upstream and downstream. It asserts that this form yields considerably more accurate extrapolations than other functional forms for neural scaling.

Significance. If the central claim holds with a fixed algebraic structure across cases, this would be a significant contribution to neural scaling laws research, offering a practical tool for predicting performance across diverse settings and reducing the need for exhaustive experimentation. The unification across multiple dimensions and task types, if demonstrated without per-case structural modifications, would strengthen the result beyond existing scaling laws that often require separate forms per regime.

major comments (3)
  1. [Abstract] Abstract: The claim that a single functional form simultaneously captures scaling across all listed dimensions and tasks for multiple architectures is load-bearing for the unification result, yet the abstract supplies no equation, no fitting procedure, and no cross-task consistency check; the full manuscript must demonstrate that the algebraic structure itself remains identical (rather than merely reusing the same variable names with different coefficients or added terms).
  2. [Functional form and experiments sections] Functional form and experiments sections: To support the claim of superior extrapolation accuracy, the paper must report the exact UNSL equation, the validation protocol (including how extrapolations are tested against held-out data), error bars on all reported metrics, and explicit data exclusion rules; without these, the asserted accuracy advantage cannot be verified or reproduced.
  3. [Cross-architecture/task analysis] Cross-architecture/task analysis: The manuscript must provide explicit evidence (e.g., a table or section comparing fitted forms) that no architecture- or task-specific functional pieces are introduced; if any such modifications are needed to achieve the reported fits, the unification claim is undermined even if separate coefficient sets are used per case.
minor comments (1)
  1. [Abstract] Abstract: Consider adding a one-sentence description of the UNSL functional form or a key quantitative result to make the contribution more immediately accessible.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and have revised the manuscript to improve clarity where the concerns are valid.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that a single functional form simultaneously captures scaling across all listed dimensions and tasks for multiple architectures is load-bearing for the unification result, yet the abstract supplies no equation, no fitting procedure, and no cross-task consistency check; the full manuscript must demonstrate that the algebraic structure itself remains identical (rather than merely reusing the same variable names with different coefficients or added terms).

    Authors: We agree the abstract benefits from including the equation. The revised abstract now states the UNSL form explicitly. Sections 3 and 4 of the manuscript already detail the identical algebraic structure (same equation used for all cases), the fitting procedure, and cross-task consistency via shared structure with per-case coefficients only. revision: yes

  2. Referee: [Functional form and experiments sections] Functional form and experiments sections: To support the claim of superior extrapolation accuracy, the paper must report the exact UNSL equation, the validation protocol (including how extrapolations are tested against held-out data), error bars on all reported metrics, and explicit data exclusion rules; without these, the asserted accuracy advantage cannot be verified or reproduced.

    Authors: Equation (1) gives the exact UNSL form. Section 5 describes the validation protocol with held-out extrapolation tests. We have added error bars to all metrics and explicit data exclusion rules in the revised experiments section to support reproducibility. revision: yes

  3. Referee: [Cross-architecture/task analysis] Cross-architecture/task analysis: The manuscript must provide explicit evidence (e.g., a table or section comparing fitted forms) that no architecture- or task-specific functional pieces are introduced; if any such modifications are needed to achieve the reported fits, the unification claim is undermined even if separate coefficient sets are used per case.

    Authors: Section 6 and Table 4 already supply the requested comparison, showing the same algebraic structure is used across all architectures and tasks with no added terms or modifications—only coefficients change. This directly supports the unification claim without per-case structural changes. revision: no

Circularity Check

0 steps flagged

No derivation chain or equations visible; circularity not detectable

full rationale

The provided manuscript text consists only of the abstract, which describes a Unified Neural Scaling Law functional form and its claimed accuracy but supplies neither the explicit functional form, any equations, fitting procedure, derivation steps, nor self-citations. Without mathematical content or a claimed derivation chain to inspect, no load-bearing reductions to inputs by construction, fitted predictions, or self-citation patterns can be identified. The default finding of no significant circularity therefore applies, as the paper is self-contained against external benchmarks only in the sense that nothing is presented to evaluate.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no information on free parameters, axioms, or invented entities can be extracted.

pith-pipeline@v0.9.1-grok · 5647 in / 1091 out tokens · 27971 ms · 2026-06-29T22:54:09.500431+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 26 canonical work pages · 13 internal anchors

  1. [1]

    Revisiting neural scaling laws in language and vision

    Ibrahim Mansour I Alabdulmohsin, Behnam Neyshabur, and Xiaohua Zhai. Revisiting neural scaling laws in language and vision. InNeurIPS 2022,

  2. [2]

    and others , title =

    ISSN 2835-8856. Yasaman Bahri, Ethan Dyer, Jared Kaplan, Jaehoon Lee, and Utkarsh Sharma. Explaining neural scaling laws.arXiv preprint arXiv:2102.06701,

  3. [3]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al

    URL https://github.com/ google-deepmind/kfac-jax. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

  4. [4]

    org/abs/2210.14891

    URL https://arxiv. org/abs/2210.14891. Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technolog...

  5. [5]

    Leveraging procedural generation to benchmark reinforcement learning

    Karl Cobbe, Chris Hesse, Jacob Hilton, and John Schulman. Leveraging procedural generation to benchmark reinforcement learning. InInternational conference on machine learning, pp. 2048–2056. PMLR,

  6. [6]

    Mathematics of Control, Signals and Systems , author =

    doi: 10.1007/BF02551274. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee,

  7. [7]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929,

  8. [8]

    Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

    Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training ImageNet in 1 hour.arXiv preprint arXiv:1706.02677,

  9. [9]

    Scaling Laws for Autoregressive Generative Modeling

    Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B. Brown, Prafulla Dhariwal, Scott Gray, Chris Hallacy, Benjamin Mann, Alec Radford, Aditya Ramesh, Nick Ryder, Daniel M. Ziegler, John Schulman, Dario Amodei, and Sam Mc- Candlish. Scaling laws for autoregressive generative modeling.arXiv preprint arXiv:2010.14701,

  10. [10]

    Scaling Laws for Transfer

    Danny Hernandez, Jared Kaplan, Tom Henighan, and Sam McCandlish. Scaling laws for transfer. arXiv preprint arXiv:2102.01293,

  11. [11]

    Deep Learning Scaling is Predictable, Empirically

    Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep Learning Scaling is Predictable, Empirically.arXiv e-prints, art. arXiv:1712.00409, December

  12. [12]

    Scaling laws for single-agent reinforcement learning

    Jacob Hilton, Jie Tang, and John Schulman. Scaling laws for single-agent reinforcement learning. arXiv preprint arXiv:2301.13442,

  13. [13]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556,

  14. [14]

    Neural Netw

    doi: 10.1016/0893-6080(91)90009-T. Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. Sequential model-based optimization for general algorithm configuration. InLearning and Intelligent Optimization (LION), pp. 507–523. Springer,

  15. [15]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling Laws for Neural Language Models.arXiv e-prints, art. arXiv:2001.08361, January

  16. [16]

    Neural Networks , author =

    doi: 10.1016/S0893-6080(05)80131-5. Hao Li, Yang Zou, Ying Wang, Orchid Majumder, Yusheng Xie, R. Manmatha, Ashwin Swaminathan, Zhuowen Tu, Stefano Ermon, and Stefano Soatto. On the scalability of diffusion-based text-to- image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9400–9409,

  17. [17]

    Scaling laws for diffusion transformers.arXiv preprint arXiv:2410.08184,

    Zhengyang Liang, Hao He, Ceyuan Yang, and Bo Dai. Scaling laws for diffusion transformers.arXiv preprint arXiv:2410.08184,

  18. [18]

    An Empirical Model of Large-Batch Training

    Accessed: 2025-10-05. Sam McCandlish, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. An empirical model of large-batch training.arXiv preprint arXiv:1812.06162,

  19. [19]

    Can a suit of armor conduct electricity? a new dataset for open book question answering

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2381–2391,

  20. [20]

    Scaling Data-Constrained Language Models

    Niklas Muennighoff, Alexander M Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Nouamane Tazi, Sampo Pyysalo, Thomas Wolf, and Colin Raffel. Scaling data-constrained language models. arXiv preprint arXiv:2305.16264,

  21. [21]

    Deep double descent: Where bigger models and more data hurt.arXiv preprint arXiv:1912.02292,

    Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: Where bigger models and more data hurt.arXiv preprint arXiv:1912.02292,

  22. [22]

    Scaling laws for a multi-agent reinforcement learning model

    Oren Neumann and Claudius Gros. Scaling laws for a multi-agent reinforcement learning model. arXiv preprint arXiv:2210.00849,

  23. [23]

    Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

    Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Gen- eralization beyond overfitting on small algorithmic datasets.arXiv preprint arXiv:2201.02177,

  24. [25]

    org/abs/1909.12673

    URL http://arxiv. org/abs/1909.12673. Ranajoy Sadhukhan, Zhuoming Chen, Haizhong Zheng, Yang Zhou, Emma Strubell, and Beidi Chen. Kinetics: Rethinking test-time scaling laws.arXiv preprint arXiv:2506.05333,

  25. [26]

    Scaling laws for linear complexity language models.arXiv preprint arXiv:2406.16690,

    Xuyang Shen, Dong Li, Ruitao Leng, Zhen Qin, Weigao Sun, and Yiran Zhong. Scaling laws for linear complexity language models.arXiv preprint arXiv:2406.16690,

  26. [27]

    Don't Decay the Learning Rate, Increase the Batch Size

    Samuel L. Smith, Pieter-Jan Kindermans, Chris Ying, and Quoc V . Le. Don’t decay the learning rate, increase the batch size.arXiv preprint arXiv:1711.00489,

  27. [28]

    Freeze-Thaw Bayesian Optimization

    URL http://www. incompleteideas.net/IncIdeas/BitterLesson.html. Kevin Swersky, Jasper Snoek, and Ryan Prescott Adams. Freeze-thaw Bayesian optimization.arXiv preprint arXiv:1406.3896,

  28. [29]

    Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao

    Greg Yang, Edward J. Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tensor programs V: Tuning large neural networks via zero-shot hyperparameter transfer.arXiv preprint arXiv:2203.03466,

  29. [30]

    Large Batch Training of Convolutional Networks

    Yang You, Igor Gitman, and Boris Ginsburg. Large batch training of convolutional networks.arXiv preprint arXiv:1708.03888,

  30. [32]

    Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer

    URLhttps://arxiv.org/abs/2106.04560. Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12104–12113,