pith. sign in

arxiv: 2606.25008 · v1 · pith:WGSDSQMPnew · submitted 2026-06-23 · 💻 cs.LG · cs.CL

Neural Scaling Universality: If Exponents Are Fixed, Time to Understand Coefficients

Pith reviewed 2026-06-25 23:44 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords neural scaling lawspower law exponentsuniversality classsoftmax nonlinearityrepresentational superpositiontransformer layersscaling coefficientscompute optimal
0
0 comments X

The pith

The exponents of neural scaling laws are fixed by generic mechanisms, so the key to improvement is understanding the coefficients.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that scaling law exponents are determined by broad mechanisms rather than fine details of models or data. Specifically, time scaling follows a one-third power from the softmax function's nonlinearity, width scales inversely due to how representations overlap, and depth scales inversely from averaging effects in transformer layers. These create a fixed-exponent universality class for current large language models. Because the coefficients of these power laws respond to data and architecture choices, they set the practical limits on optimal model dimensions and efficient compute use. Therefore, the next steps for better performance lie in analyzing and optimizing those coefficients.

Core claim

The exponents of these power laws are fixed by generic mechanisms: a one-third time scaling due to the strong nonlinearity of Softmax, an inverse width scaling due to representational superposition, and an inverse depth scaling due to ensemble averaging of Transformer layers. These mechanisms are robust to a wide range of data structures and architectural details, placing current large language models in a universality class with fixed exponents. The coefficients, however, are expected to be sensitive to data and architecture details, and directly determine practical quantities such as the optimal model shape and the compute-optimal frontier.

What carries the argument

The universality class of fixed scaling exponents generated by softmax nonlinearity, representational superposition, and transformer layer averaging.

If this is right

  • The coefficients determine the optimal model shape.
  • The coefficients determine the compute-optimal frontier.
  • Near-term performance improvements require understanding coefficients.
  • Current large language models share this universality class with fixed exponents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Architectures that modify softmax or avoid superposition might achieve different exponents and better scaling.
  • Systematic variation of data properties could map how coefficients change and suggest better training regimes.
  • Testing the mechanisms in non-transformer models could reveal if the universality class extends beyond current designs.

Load-bearing premise

The generic mechanisms remain robust to a wide range of data structures and architectural details.

What would settle it

Observing different scaling exponents in a model that uses softmax, superposition, and transformer layers on standard data would contradict the fixed-exponents claim.

Figures

Figures reproduced from arXiv: 2606.25008 by Jeff Gore, Yizhou Liu.

Figure 1
Figure 1. Figure 1: Loss can be decomposed into three leading terms, one due to imperfect training, one due [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The neural scaling universality class, defined by one-third time scaling, inverse width [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: At the optimal shape, determined by the coefficients, width and depth scaling laws lead [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: At the optimal token-to-parameter ratio or compute-optimal frontier, loss follows a one [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Neural scaling universality extends the world of scaling behaviors and provides concrete [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Coefficients and optimal ratios inferred from fitting vary across model families. (a) Time [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
read the original abstract

Neural scaling laws describe how pre-training loss decays as power laws with training time, model size, and compute. This position paper argues that the exponents of these power laws are fixed by generic mechanisms: a one-third time scaling due to the strong nonlinearity of Softmax, an inverse width scaling due to representational superposition, and an inverse depth scaling due to ensemble averaging of Transformer layers. These mechanisms are robust to a wide range of data structures and architectural details, placing current large language models in a universality class with fixed exponents. The coefficients, however, are expected to be sensitive to data and architecture details, and directly determine practical quantities such as the optimal model shape and the compute-optimal frontier. We therefore argue that understanding the coefficients is the key to near-term performance improvements, and that a closer examination of the current universality class may reveal pathways to better universality classes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. This position paper argues that the exponents of neural scaling laws are fixed by generic mechanisms: a one-third time scaling due to the strong nonlinearity of Softmax, an inverse width scaling due to representational superposition, and an inverse depth scaling due to ensemble averaging of Transformer layers. These mechanisms are robust to a wide range of data structures and architectural details, placing current large language models in a universality class with fixed exponents. The coefficients are sensitive to data and architecture details and determine practical quantities such as the optimal model shape and the compute-optimal frontier. The paper argues that understanding the coefficients is key to near-term performance improvements and that examining the current universality class may reveal pathways to better ones.

Significance. If the proposed mechanisms hold, the paper offers a conceptual framework that could usefully redirect attention from exponent fitting to coefficient analysis and optimization, with potential implications for model shape selection and compute frontiers. It also raises the prospect of identifying improved universality classes. As a position paper the significance rests on the heuristic arguments stimulating targeted empirical and theoretical follow-up rather than on new derivations or data.

major comments (2)
  1. [Abstract] Abstract: the claim that the three mechanisms are 'robust to a wide range of data structures and architectural details' is asserted without explicit first-principles derivations that vary the data distribution, activation function, or architectural components while recovering the same exponents; this invariance is load-bearing for the universality-class statement.
  2. [The section describing the time-scaling mechanism] The section describing the time-scaling mechanism: the one-third exponent is attributed to Softmax nonlinearity via a qualitative argument, but no derivation is supplied that demonstrates the exponent remains unchanged when the nonlinearity is altered or when the loss landscape statistics are varied.
minor comments (2)
  1. The manuscript would benefit from a short table or explicit list contrasting the three mechanisms with the scaling variables (time, width, depth) they are claimed to govern.
  2. Notation for the coefficients (prefactors) could be introduced more formally when they are first distinguished from the exponents.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on this position paper. The feedback highlights important points about the strength of the universality claims. We address each major comment below and will make revisions to clarify the heuristic nature of the arguments while preserving the paper's intent to stimulate further research.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that the three mechanisms are 'robust to a wide range of data structures and architectural details' is asserted without explicit first-principles derivations that vary the data distribution, activation function, or architectural components while recovering the same exponents; this invariance is load-bearing for the universality-class statement.

    Authors: We agree that the abstract asserts robustness without supplying explicit first-principles derivations that test invariance under changes to data distributions, activation functions, or architectural components. As a position paper, the claims rest on heuristic reasoning drawn from known properties of the mechanisms rather than formal proofs of universality. We will revise the abstract to state that the mechanisms are proposed as generic on the basis of qualitative arguments and prior literature, and we will add a dedicated paragraph in the discussion section that explicitly flags the absence of such derivations and calls for targeted theoretical and empirical follow-up to test the claimed invariance. revision: yes

  2. Referee: [The section describing the time-scaling mechanism] The section describing the time-scaling mechanism: the one-third exponent is attributed to Softmax nonlinearity via a qualitative argument, but no derivation is supplied that demonstrates the exponent remains unchanged when the nonlinearity is altered or when the loss landscape statistics are varied.

    Authors: The observation is accurate: the time-scaling section offers a qualitative argument connecting the one-third exponent to the strong nonlinearity of Softmax but does not include a derivation establishing that the exponent is invariant when the nonlinearity or loss-landscape statistics are changed. We will revise the section to label the argument explicitly as heuristic, to note the lack of a formal invariance proof, and to outline concrete directions (e.g., controlled ablations or mean-field analyses) that would be needed to verify stability of the exponent under such variations. revision: yes

Circularity Check

0 steps flagged

No circularity: position paper advances qualitative mechanisms without self-referential derivations or fitted inputs renamed as predictions.

full rationale

The manuscript is a position paper whose central argument consists of naming three mechanisms (Softmax nonlinearity for 1/3 time exponent, superposition for inverse-width, layer ensembling for inverse-depth) and asserting their robustness. No equations, parameter fits, or self-citations are supplied in the abstract or described structure that would reduce any claimed exponent to an input by construction. The universality-class statement is therefore an interpretive claim rather than a closed loop of the form 'exponent E is predicted from mechanism M which was itself fitted to E'. Because the text supplies no load-bearing self-citation chain, no ansatz smuggled via prior work, and no renaming of known results as new derivations, the derivation chain does not collapse. This is the expected outcome for a position paper that does not attempt quantitative first-principles derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the three listed mechanisms are generic and robust without providing independent evidence or derivations in the abstract; no free parameters or invented entities are explicitly introduced.

axioms (1)
  • domain assumption The listed mechanisms (Softmax nonlinearity, superposition, ensemble averaging) fix the scaling exponents and are robust across data structures and architectures
    Invoked directly in the abstract to define the universality class.

pith-pipeline@v0.9.1-grok · 5671 in / 1144 out tokens · 27770 ms · 2026-06-25T23:44:07.110758+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 33 canonical work pages · 14 internal anchors

  1. [1]

    Deep Learning Scaling is Predictable, Empirically

    Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically.arXiv preprint arXiv:1712.00409, 2017

  2. [2]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020. 10

  3. [3]

    Scaling Language Models: Methods, Analysis & Insights from Training Gopher

    Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. Scaling language models: Methods, analysis & insights from training gopher.arXiv preprint arXiv:2112.11446, 2021

  4. [4]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 2022

  5. [5]

    Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

  6. [6]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  7. [7]

    Asymptotic learning curves of kernel methods: empirical data versus teacher–student paradigm.Journal of Statistical Mechanics: Theory and Experiment, 2020(12):124001, 2020

    Stefano Spigler, Mario Geiger, and Matthieu Wyart. Asymptotic learning curves of kernel methods: empirical data versus teacher–student paradigm.Journal of Statistical Mechanics: Theory and Experiment, 2020(12):124001, 2020

  8. [8]

    Learning curve theory.arXiv preprint arXiv:2102.04074, 2021

    Marcus Hutter. Learning curve theory.arXiv preprint arXiv:2102.04074, 2021

  9. [9]

    A solvable model of neural scaling laws.arXiv preprint arXiv:2210.16859, 2022

    Alexander Maloney, Daniel A Roberts, and James Sully. A solvable model of neural scaling laws.arXiv preprint arXiv:2210.16859, 2022

  10. [10]

    Scaling laws from the data manifold dimension.Journal of Machine Learning Research, 23(9):1–34, 2022

    Utkarsh Sharma and Jared Kaplan. Scaling laws from the data manifold dimension.Journal of Machine Learning Research, 23(9):1–34, 2022

  11. [11]

    The quantization model of neural scaling.Advances in Neural Information Processing Systems, 36:28699–28722, 2023

    Eric Michaud, Ziming Liu, Uzay Girit, and Max Tegmark. The quantization model of neural scaling.Advances in Neural Information Processing Systems, 36:28699–28722, 2023

  12. [12]

    Explaining neural scaling laws.Proceedings of the National Academy of Sciences, 121(27):e2311878121, 2024

    Yasaman Bahri, Ethan Dyer, Jared Kaplan, Jaehoon Lee, and Utkarsh Sharma. Explaining neural scaling laws.Proceedings of the National Academy of Sciences, 121(27):e2311878121, 2024

  13. [13]

    How feature learning can improve neural scaling laws.Journal of Statistical Mechanics: Theory and Experiment, 2025(8):084002, 2025

    Blake Bordelon, Alexander Atanasov, and Cengiz Pehlevan. How feature learning can improve neural scaling laws.Journal of Statistical Mechanics: Theory and Experiment, 2025(8):084002, 2025

  14. [14]

    Theory of scaling laws for in-context regression: Depth, width, context and time.arXiv preprint arXiv:2510.01098, 2025

    Blake Bordelon, Mary I Letey, and Cengiz Pehlevan. Theory of scaling laws for in-context regression: Depth, width, context and time.arXiv preprint arXiv:2510.01098, 2025

  15. [15]

    Universal One-third Time Scaling in Learning Peaked Distributions

    Yizhou Liu, Ziming Liu, Cengiz Pehlevan, and Jeff Gore. Universal one-third time scaling in learning peaked distributions.arXiv preprint arXiv:2602.03685, 2026

  16. [16]

    Superposition Yields Robust Neural Scaling

    Yizhou Liu, Ziming Liu, and Jeff Gore. Superposition yields robust neural scaling.arXiv preprint arXiv:2505.10465, 2025

  17. [17]

    Inverse Depth Scaling From Most Layers Being Similar

    Yizhou Liu, Sara Kangaslahti, Ziming Liu, and Jeff Gore. Inverse depth scaling from most layers being similar.arXiv preprint arXiv:2602.05970, 2026

  18. [18]

    Neural scaling laws trilogy: Representation, transformation, and training

    Yizhou Liu. Neural scaling laws trilogy: Representation, transformation, and training. https://liuyz0.github.io/blog/, 2026

  19. [19]

    On the origin of neural scaling laws: from random graphs to natural language.arXiv preprint arXiv:2601.10684, 2026

    Maissam Barkeshli, Alberto Alfarano, and Andrey Gromov. On the origin of neural scaling laws: from random graphs to natural language.arXiv preprint arXiv:2601.10684, 2026

  20. [20]

    Linear algebraic structure of word senses, with applications to polysemy.Transactions of the Association for Computational Linguistics, 6:483–495, 2018

    Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. Linear algebraic structure of word senses, with applications to polysemy.Transactions of the Association for Computational Linguistics, 6:483–495, 2018. 11

  21. [21]

    Toy models of superposition.Transformer Circuits Thread, 2022

    Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition.Transformer Circuits Thread, 2022

  22. [22]

    The unreasonable ineffectiveness of the deeper layers.arXiv preprint arXiv:2403.17887, 2024

    Andrey Gromov, Kushal Tirumala, Hassan Shapourian, Paolo Glorioso, and Daniel A Roberts. The unreasonable ineffectiveness of the deeper layers.arXiv preprint arXiv:2403.17887, 2024

  23. [23]

    When attention collapses: How degenerate layers in llms enable smaller, stronger models.arXiv preprint arXiv:2404.08634, 2024

    Sunny Sanyal, Ravid Shwartz-Ziv, Alexandros G Dimakis, and Sujay Sanghavi. When attention collapses: How degenerate layers in llms enable smaller, stronger models.arXiv preprint arXiv:2404.08634, 2024

  24. [24]

    The curse of depth in large language models.arXiv preprint arXiv:2502.05795, 2025

    Wenfang Sun, Xinyuan Song, Pengxiang Li, Lu Yin, Yefeng Zheng, and Shiwei Liu. The curse of depth in large language models.arXiv preprint arXiv:2502.05795, 2025

  25. [25]

    Shortgpt: Layers in large language models are more redundant than you expect

    Xin Men, Mingyu Xu, Qingyu Zhang, Qianhao Yuan, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. Shortgpt: Layers in large language models are more redundant than you expect. InFindings of the Association for Computational Linguistics: ACL 2025, pages 20192–20204, 2025

  26. [26]

    Do language models use their depth efficiently?arXiv preprint arXiv:2505.13898, 2025

    Róbert Csordás, Christopher D Manning, and Christopher Potts. Do language models use their depth efficiently?arXiv preprint arXiv:2505.13898, 2025

  27. [27]

    Pythia: A suite for analyzing large language models across training and scaling

    Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR, 2023

  28. [28]

    nanochat: The best chatgpt that $100 can buy, 2025

    Andrej Karpathy. nanochat: The best chatgpt that $100 can buy, 2025

  29. [29]

    Power lines: Scaling laws for weight decay and batch size in LLM pre-training

    Shane Bergsma, Nolan Dey, Gurpreet Gosal, Gavia Gray, Daria Soboleva, and Joel Hestness. Power lines: Scaling laws for weight decay and batch size in llm pre-training.arXiv preprint arXiv:2505.13738, 2025

  30. [30]

    Chinchilla scaling: A replication attempt.arXiv preprint arXiv:2404.10102, 2024

    Tamay Besiroglu, Ege Erdil, Matthew Barnett, and Josh You. Chinchilla scaling: A replication attempt.arXiv preprint arXiv:2404.10102, 2024

  31. [31]

    Predictable scale: Part ii, farseer: A refined scaling law in large language models.arXiv preprint arXiv:2506.10972, 2025

    Houyi Li, Wenzhen Zheng, Qiufeng Wang, Zhenyu Ding, Haoying Wang, Zili Wang, Shijie Xuyang, Ning Ding, Shuigeng Zhou, Xiangyu Zhang, et al. Predictable scale: Part ii, farseer: A refined scaling law in large language models.arXiv preprint arXiv:2506.10972, 2025

  32. [32]

    Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

    Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks.arXiv preprint arXiv:1312.6120, 2013

  33. [33]

    Learning curves for sgd on structured features.arXiv preprint arXiv:2106.02713, 2021

    Blake Bordelon and Cengiz Pehlevan. Learning curves for sgd on structured features.arXiv preprint arXiv:2106.02713, 2021

  34. [34]

    Scaling laws in linear regression: Compute, parameters, and data.Advances in Neural Information Processing Systems, 37:60556–60606, 2024

    Licong Lin, Jingfeng Wu, Sham M Kakade, Peter L Bartlett, and Jason D Lee. Scaling laws in linear regression: Compute, parameters, and data.Advances in Neural Information Processing Systems, 37:60556–60606, 2024

  35. [35]

    A dynamical model of neural scaling laws.arXiv preprint arXiv:2402.01092, 2024

    Blake Bordelon, Alexander Atanasov, and Cengiz Pehlevan. A dynamical model of neural scaling laws.arXiv preprint arXiv:2402.01092, 2024

  36. [36]

    Analyzing neural scaling laws in two-layer networks with power-law data spectra.arXiv preprint arXiv:2410.09005, 2024

    Roman Worschech and Bernd Rosenow. Analyzing neural scaling laws in two-layer networks with power-law data spectra.arXiv preprint arXiv:2410.09005, 2024

  37. [37]

    An exactly solvable model for emergence and scaling laws in the multitask sparse parity problem.Advances in Neural Information Processing Systems, 37:39632–39693, 2024

    Nayara Fonseca, Seok Hyeong Lee, Chris Mingard, Ard Louis, et al. An exactly solvable model for emergence and scaling laws in the multitask sparse parity problem.Advances in Neural Information Processing Systems, 37:39632–39693, 2024

  38. [38]

    4+ 3 phases of compute-optimal neural scaling laws.Advances in Neural Information Processing Systems, 37:16459–16537, 2024

    Elliot Paquette, Courtney Paquette, Lechao Xiao, and Jeffrey Pennington. 4+ 3 phases of compute-optimal neural scaling laws.Advances in Neural Information Processing Systems, 37:16459–16537, 2024. 12

  39. [39]

    Scaling Laws and Spectra of Shallow Neural Networks in the Feature Learning Regime

    Leonardo Defilippis, Yizhou Xu, Julius Girardin, Emanuele Troiani, Vittorio Erba, Lenka Zdeborová, Bruno Loureiro, and Florent Krzakala. Scaling laws and spectra of shallow neural networks in the feature learning regime.arXiv preprint arXiv:2509.24882, 2025

  40. [40]

    A theory for emergence of complex skills in language models.arXiv preprint arXiv:2307.15936, 2023

    Sanjeev Arora and Anirudh Goyal. A theory for emergence of complex skills in language models.arXiv preprint arXiv:2307.15936, 2023

  41. [41]

    Physics of skill learning

    Ziming Liu, Yizhou Liu, Eric J Michaud, Jeff Gore, and Max Tegmark. Physics of skill learning. arXiv preprint arXiv:2501.12391, 2025

  42. [42]

    Deriving neural scaling laws from the statistics of natural language.arXiv preprint arXiv:2602.07488, 2026

    Francesco Cagnetta, Allan Raventós, Surya Ganguli, and Matthieu Wyart. Deriving neural scaling laws from the statistics of natural language.arXiv preprint arXiv:2602.07488, 2026

  43. [43]

    Muon is Scalable for LLM Training

    Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025

  44. [44]

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

    William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

  45. [45]

    Unified scaling laws for routed language models

    Aidan Clark, Diego de Las Casas, Aurelia Guy, Arthur Mensch, Michela Paganini, Jordan Hoffmann, Bogdan Damoc, Blake Hechtman, Trevor Cai, Sebastian Borgeaud, et al. Unified scaling laws for routed language models. InInternational conference on machine learning, pages 4057–4086. PMLR, 2022

  46. [46]

    Looped transformers for length generalization.arXiv preprint arXiv:2409.15647, 2024

    Ying Fan, Yilun Du, Kannan Ramchandran, and Kangwook Lee. Looped transformers for length generalization.arXiv preprint arXiv:2409.15647, 2024

  47. [47]

    Parcae: Scaling Laws For Stable Looped Language Models

    Hayden Prairie, Zachary Novack, Taylor Berg-Kirkpatrick, and Daniel Y Fu. Parcae: Scaling laws for stable looped language models.arXiv preprint arXiv:2604.12946, 2026

  48. [48]

    Attention Residuals

    Kimi Team, Guangyu Chen, Yu Zhang, Jianlin Su, Weixin Xu, Siyuan Pan, Yaoyu Wang, Yucheng Wang, Guanduo Chen, Bohong Yin, et al. Attention residuals.arXiv preprint arXiv:2603.15031, 2026

  49. [49]

    Superposition unifies power-law training dynamics.arXiv preprint arXiv:2602.01045, 2026

    Zixin Jessie Chen, Hao Chen, Yizhou Liu, and Jeff Gore. Superposition unifies power-law training dynamics.arXiv preprint arXiv:2602.01045, 2026

  50. [50]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027, 2020

  51. [51]

    The fineweb datasets: Decanting the web for the finest text data at scale.Advances in Neural Information Processing Systems, 37:30811–30849, 2024

    Guilherme Penedo, Hynek Kydlí ˇcek, Anton Lozhkov, Margaret Mitchell, Colin A Raffel, Leandro V on Werra, Thomas Wolf, et al. The fineweb datasets: Decanting the web for the finest text data at scale.Advances in Neural Information Processing Systems, 37:30811–30849, 2024. A Pythia models We fit the scaling law L= cτ τ 1/3 + cm m + cℓ ℓ +L 0 (14) to loss d...