Smooth Scaling Laws Hide Stepwise Token Learning

Debing Zhang; Fu Guo; Peiru Yang; Pingjie Wang; Zechen Hu

arxiv: 2606.29858 · v1 · pith:UO6RGJ5Anew · submitted 2026-06-29 · 💻 cs.CL

Smooth Scaling Laws Hide Stepwise Token Learning

Pingjie Wang , Zechen Hu , Peiru Yang , Fu Guo , Debing Zhang This is my paper

Pith reviewed 2026-06-30 06:05 UTC · model grok-4.3

classification 💻 cs.CL

keywords scaling lawstoken learninglanguage model pretrainingsigmoid fittinglearning time spectrumvalidation lossdata distribution

0 comments

The pith

The distribution of when individual tokens are learned governs the shape of scaling laws in language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Language model loss follows smooth power laws over model and data size, but the paper decomposes this regularity into discrete learning events for contextualized tokens. Fitting each token's loss trajectory to a sigmoid isolates localized transitions and assembles them into a learning-time spectrum. This spectrum quantitatively matches how validation loss changes with training steps, data volume, and model size across more than 100 large runs. The same spectrum can be used to reorder training data, producing an 11 percent faster drop in loss.

Core claim

By fitting sigmoid curves to the loss trajectories of individual contextualized tokens across large-scale pre-training runs, the authors obtain a learning-time spectrum that dominates the scaling-law shape. This spectrum quantitatively reconstructs the validation loss derivative along the training-step T, data-scale D, and model-scale M axes. Reshaping the training distribution according to when tokens become learnable alters the optimization trajectory and yields faster validation-loss reduction.

What carries the argument

The learning-time spectrum extracted by fitting sigmoids to individual token loss trajectories.

If this is right

The spectrum reconstructs validation loss derivatives along the T, D, and M axes.
Reshaping the training distribution by token learnability times changes the optimization trajectory.
The approach produces an 11 percent faster reduction in validation loss.
Scaling laws are governed primarily by the distribution of token-level learning times rather than by a heavy-tailed difficulty spectrum.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The spectrum could be used to design data curricula that prioritize tokens according to their measured learnability times.
If token interference remains low, the same fitting procedure might reveal stepwise structure in other sequence modeling domains.
The method supplies a concrete way to test whether apparent smoothness in other scaling phenomena likewise hides discrete per-element transitions.

Load-bearing premise

Individual token loss trajectories can be modeled accurately and independently as sigmoids that capture true localized learning events without substantial interference from other tokens or context changes.

What would settle it

New scaling runs in which the derivatives reconstructed from the measured spectrum fail to match the observed changes in validation loss along the T, D, or M axes would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.29858 by Debing Zhang, Fu Guo, Peiru Yang, Pingjie Wang, Zechen Hu.

**Figure 2.** Figure 2: Representative token-level loss trajectories across three independent runs and the corre [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: A. Empirical validation loss on the step axis and the aggregate loss implied by token-wise sigmoid fits for a 1.5B model trained on 300B tokens. Both exhibit a clear power-law regime after the initial transient stage. B. Learning pulses estimated from tokens grouped into 40 bins by learning time; after grouping by learning time, the pulse shapes remain highly similar across bins. C. Empirical learning-time… view at source ↗

**Figure 4.** Figure 4 [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: A. Training-loss trajectories from a 60-run model-scale sweep spanning 0.42 to 5.00 GFLOPs/token, used to construct the model-scale frontier. B. Frontier validation loss on the M axis together with its sigmoid fit; the resulting envelope again follows a clear power law. C. Learningtime spectrum, empirical loss derivative, sigmoid-implied derivative, power-law fit, and spectrumdominated reconstruction on … view at source ↗

**Figure 6.** Figure 6: Intervening on the training distribution using the measured learning-time signal. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Alignment between token learning time and three macroscopic data regularities. [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Synthetic learning dynamics under the uniform- [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Token-level loss trajectories on uniform- [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Scatter plots of token-level sigmoid parameters against learning time on uniform- [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Similarity of learning pulses across learning times on uniform- [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: Reconstruction of synthetic aggregate dynamics from token-level sigmoid fits. [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: Overview of synthetic learning dynamics under the power-law [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

read the original abstract

Language model loss follows remarkably regular scaling laws over model and data size, yet it remains unclear why the aggregate loss should exhibit a power-law form. Existing explanations often attribute this regularity to a heavy-tailed spectrum of pattern difficulty in natural language, but this view has not been directly validated at token-level granularity in large-scale real-data training. We present a token-level framework that decomposes scaling laws into localized learning events of individual contextualized tokens. By fitting token loss trajectories with sigmoids, we show that token learning is concentrated in localized transitions, giving rise to a learning-time spectrum that dominates the scaling-law shape. Across more than one hundred pre-training runs on large and diverse real-language corpora with modern LLM architectures, scaling up to 6B parameters and 300B training tokens, the measured learning-time spectrum quantitatively reconstructs the validation loss derivative along the training-step $T$, data-scale $D$, and model-scale $M$ axes. We further show that the same signal is actionable: by reshaping the training distribution according to when tokens become learnable, we alter the optimization trajectory and achieve 11\% faster validation-loss reduction. These results provide direct empirical evidence that scaling laws are governed primarily by the distribution of token-level learning times, and that this distribution can be used not only to explain scaling behavior but also to improve training performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Token-level sigmoid fits and an 11% training speedup are the concrete parts; the derivative reconstruction looks like it could follow from the math of averaging the fits.

read the letter

The two things to know are that the authors fit sigmoids to per-token loss trajectories across more than 100 runs up to 6B parameters, extract a spectrum of learning times, and claim this spectrum quantitatively accounts for how validation loss changes with steps, data, and model size. They also report that reweighting the training distribution according to those times yields an 11% faster drop in validation loss.

The scale of the experiments and the intervention are the parts that stand on their own. A hundred-plus runs on real corpora with modern architectures is enough to make the token-level observations worth looking at, and the 11% result is an external test that does not rely on the same fits being explained.

The reconstruction claim is the softer spot. Validation loss is the average of the contextualized token losses, so once sigmoids are fitted to those trajectories their sum will recover the aggregate curve and its derivatives by construction. That makes the reported matches along the T, D, and M axes less surprising and less diagnostic of a governing mechanism. The concern is sharper for the data-scale and model-scale axes, where any scale-dependent interference could be absorbed into the per-token parameters without being isolated. The paper would need to show that the spectrum predicts held-out behavior or that the sigmoid fits leave little residual structure to carry more weight here.

The assumption that individual token losses behave as clean, independent sigmoids also sits on top of the usual context and batch interactions, which the abstract does not address in detail.

This is for people who work on scaling-law explanations or on making pretraining more efficient. The experimental volume and the practical result are enough to justify sending it to referees, even if the explanatory part needs tighter controls against the summation artifact.

Referee Report

2 major / 2 minor

Summary. The paper claims that fitting sigmoids to per-token loss trajectories across >100 pretraining runs (up to 6B parameters, 300B tokens) extracts a 'learning-time spectrum' that dominates scaling-law shape. This spectrum is asserted to quantitatively reconstruct validation-loss derivatives along the T, D, and M axes, and reshaping the training distribution according to token learnability yields an 11% faster validation-loss reduction.

Significance. If the reconstruction supplies evidence independent of the fitting process itself, the work would supply a token-granular, mechanistic account of why aggregate scaling laws are smooth and power-law-like, backed by unusually large-scale empirical coverage and an actionable downstream experiment. The scale of the experimental campaign is a clear strength.

major comments (2)

[Abstract / reconstruction claim] Abstract and reconstruction results: because validation loss is defined as the mean of the contextualized token losses, and each trajectory is independently fit to a sigmoid, summation of the fitted sigmoids necessarily recovers the aggregate loss curve and its derivative by algebraic construction. The manuscript must demonstrate that the spectrum extracted from one subset of runs or tokens predicts the derivatives on held-out scales, architectures, or data regimes without refitting, rather than merely reproducing the curves from which the spectrum was derived.
[actionable experiment] The 11% improvement experiment: the claim that reshaping the training distribution according to the spectrum alters the optimization trajectory requires explicit controls showing that the gain is not explained by changes in token frequency, total compute, or post-hoc selection of easier tokens. Without these, the result does not yet isolate the spectrum as the causal mechanism.

minor comments (2)

[Abstract / results] The abstract and results sections should report fit quality metrics (R², residual distributions), error bars on the spectrum parameters, and any correction for multiple testing or post-hoc token selection.
[methods] Notation for the per-token sigmoid parameters (midpoint, slope, asymptotes) should be introduced once and used consistently when discussing the spectrum.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which highlight important aspects of our claims regarding the reconstruction and the causal interpretation of the distribution-reshaping experiment. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract / reconstruction claim] Abstract and reconstruction results: because validation loss is defined as the mean of the contextualized token losses, and each trajectory is independently fit to a sigmoid, summation of the fitted sigmoids necessarily recovers the aggregate loss curve and its derivative by algebraic construction. The manuscript must demonstrate that the spectrum extracted from one subset of runs or tokens predicts the derivatives on held-out scales, architectures, or data regimes without refitting, rather than merely reproducing the curves from which the spectrum was derived.

Authors: We agree that, for any fixed set of fitted trajectories, the mean of the sigmoids algebraically recovers the observed aggregate loss and its derivative. The manuscript's reconstruction claim is that the extracted learning-time spectrum (i.e., the distribution of sigmoid midpoints and widths) is the dominant factor shaping the observed derivatives when T, D, or M is varied. To address the concern about independence from the fitting process, we will add a new subsection with cross-validation experiments: the spectrum parameters will be estimated from a subset of runs (or tokens) at smaller scales and then used, without refitting individual trajectories, to predict the shape of the loss derivatives on held-out larger-scale runs. These results will be reported in the revised manuscript. revision: yes
Referee: [actionable experiment] The 11% improvement experiment: the claim that reshaping the training distribution according to the spectrum alters the optimization trajectory requires explicit controls showing that the gain is not explained by changes in token frequency, total compute, or post-hoc selection of easier tokens. Without these, the result does not yet isolate the spectrum as the causal mechanism.

Authors: We accept that the current presentation of the 11% result does not sufficiently isolate the spectrum as the causal factor. In the revision we will add matched-control experiments that (i) preserve the same token-frequency distribution and total compute budget while applying a non-learnability-based reweighting, (ii) compare against a baseline that simply up-weights tokens that happen to be easier under the original distribution, and (iii) report training curves with identical optimizer settings and data ordering except for the spectrum-derived weights. These controls will be included to demonstrate that the observed acceleration is attributable to the timing information in the spectrum rather than frequency or selection artifacts. revision: yes

Circularity Check

1 steps flagged

Fitted sigmoid spectrum reconstructs aggregate loss derivative by direct summation

specific steps

fitted input called prediction [Abstract]
"By fitting token loss trajectories with sigmoids, we show that token learning is concentrated in localized transitions, giving rise to a learning-time spectrum that dominates the scaling-law shape. [...] the measured learning-time spectrum quantitatively reconstructs the validation loss derivative along the training-step $T$, data-scale $D$, and model-scale $M$ axes."

Validation loss equals the mean of the contextualized token losses. Once sigmoids are fitted to the individual trajectories, their average directly yields the aggregate loss curve (and its derivative) by summation. The reported 'quantitative reconstruction' therefore holds by construction from the same fitted quantities rather than providing independent confirmation that the distribution of learning times governs the scaling laws.

full rationale

The paper extracts a learning-time spectrum by fitting sigmoids to per-token loss trajectories, then claims this spectrum quantitatively reconstructs the validation-loss derivatives. Because validation loss is defined as the mean of those same token losses, averaging the fitted sigmoids necessarily recovers the aggregate curve and its derivative by algebraic construction. This matches the 'fitted_input_called_prediction' pattern: the reconstruction is statistically forced once the per-token fits are performed and does not supply independent evidence that the spectrum governs scaling-law shape. The 11% training-distribution experiment supplies limited external grounding, but the core reconstruction step remains tied to the fitted inputs, producing partial circularity.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The framework rests on fitting independent sigmoids to each token's loss trajectory, introducing many free parameters per token and treating the resulting spectrum as explanatory without an independent generative model of why learning times follow the observed distribution.

free parameters (1)

per-token sigmoid parameters (midpoint, slope, asymptotes)
Each token's loss trajectory is fit independently with a sigmoid, yielding multiple fitted values per token that are then aggregated into the spectrum.

axioms (1)

domain assumption Token losses evolve independently and can be modeled as isolated sigmoid transitions without significant cross-token interference during training.
Invoked when decomposing aggregate loss into per-token events and fitting sigmoids.

invented entities (1)

learning-time spectrum no independent evidence
purpose: To explain and reconstruct scaling-law derivatives from token-level events.
Derived directly from the collection of fitted sigmoid midpoints; no independent falsifiable prediction outside the fitted data is stated.

pith-pipeline@v0.9.1-grok · 5772 in / 1499 out tokens · 28506 ms · 2026-06-30T06:05:48.132135+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 13 canonical work pages · 4 internal anchors

[1]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[2]

Training compute-optimal large language models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. InProceedings of the 36th International Conference on Neural Information Processing Systems, pages 30016–30030, 2022

2022
[3]

Language models scale reliably with over-training and on downstream tasks

Samir Yitzhak Gadre, Georgios Smyrnis, Vaishaal Shankar, Suchin Gururangan, Mitchell Wortsman, Rulin Shao, Jean Mercat, Alex Fang, Jeffrey Li, Sedrick Keh, et al. Language models scale reliably with over-training and on downstream tasks. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[4]

Hutter, Learning curve theory, arXiv preprint arXiv:2102.04074 (2021)

Marcus Hutter. Learning curve theory.arXiv preprint arXiv:2102.04074, 2021

work page arXiv 2021
[5]

The quantization model of neural scaling.Advances in Neural Information Processing Systems, 36:28699–28722, 2023

Eric Michaud, Ziming Liu, Uzay Girit, and Max Tegmark. The quantization model of neural scaling.Advances in Neural Information Processing Systems, 36:28699–28722, 2023

2023
[6]

Roberts, and James Sully

Alexander Maloney, Daniel A. Roberts, and James Sully. A solvable model of neural scaling laws.arXiv preprint arXiv:2210.16859, 2022

work page arXiv 2022
[7]

Neural scaling laws rooted in the data distribution.arXiv preprint arXiv:2412.07942, 2024

Ari Brill. Neural scaling laws rooted in the data distribution.arXiv preprint arXiv:2412.07942, 2024

work page arXiv 2024
[8]

Learning curves theory for hierarchi- cally compositional data with power-law distributed features

Francesco Cagnetta, Hyunmo Kang, and Matthieu Wyart. Learning curves theory for hierarchi- cally compositional data with power-law distributed features. InInternational Conference on Machine Learning, pages 6149–6164. PMLR, 2025

2025
[9]

Zipf’s word frequency law in natural language: A critical review and future directions.Psychonomic bulletin & review, 21(5):1112–1130, 2014

Steven T Piantadosi. Zipf’s word frequency law in natural language: A critical review and future directions.Psychonomic bulletin & review, 21(5):1112–1130, 2014

2014
[10]

Zipf’s law revisited: Spoken dialog, linguistic units, parameters, and the principle of least effort.Psychonomic Bulletin & Review, 30(1):77–101, 2023

Guido M Linders and Max M Louwerse. Zipf’s law revisited: Spoken dialog, linguistic units, parameters, and the principle of least effort.Psychonomic Bulletin & Review, 30(1):77–101, 2023

2023
[11]

Yian Zhang, Alex Warstadt, Xiaocheng Li, and Samuel Bowman. When do you need billions of words of pretraining data? InProceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: long papers), pages 1112–1125, 2021. 10

2021
[12]

On the spectral bias of neural networks

Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred Hamprecht, Yoshua Bengio, and Aaron Courville. On the spectral bias of neural networks. InInternational conference on machine learning, pages 5301–5310. PMLR, 2019

2019
[13]

Curriculum learning for natural language understanding

Benfeng Xu, Licheng Zhang, Zhendong Mao, Quan Wang, Hongtao Xie, and Yongdong Zhang. Curriculum learning for natural language understanding. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 6095–6104, 2020

2020
[14]

Doremi: Optimizing data mixtures speeds up language model pretraining.Advances in Neural Information Processing Systems, 36: 69798–69818, 2023

Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy S Liang, Quoc V Le, Tengyu Ma, and Adams Wei Yu. Doremi: Optimizing data mixtures speeds up language model pretraining.Advances in Neural Information Processing Systems, 36: 69798–69818, 2023

2023
[15]

Efficient online data mixing for language model pre-training.arXiv preprint arXiv:2312.02406, 2023

Alon Albalak, Liangming Pan, Colin Raffel, and William Yang Wang. Efficient online data mixing for language model pre-training.arXiv preprint arXiv:2312.02406, 2023

work page arXiv 2023
[16]

Beyond random sampling: Efficient language model pretraining via curriculum learning

Yang Zhang, Amr Mohamed, Hadi Abdine, Guokan Shang, and Michalis Vazirgiannis. Beyond random sampling: Efficient language model pretraining via curriculum learning. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5776–5794, 2026

2026
[17]

Resolving discrepancies in compute-optimal scaling of language models.Advances in Neural Information Processing Systems, 37:100535–100570, 2024

Tomer Porian, Mitchell Wortsman, Jenia Jitsev, Ludwig Schmidt, and Yair Carmon. Resolving discrepancies in compute-optimal scaling of language models.Advances in Neural Information Processing Systems, 37:100535–100570, 2024

2024
[18]

Revisiting scaling laws for language models: The role of data quality and training strategies

Zhengyu Chen, Siqi Wang, Teng Xiao, Yudong Wang, Shiqi Chen, Xunliang Cai, Junxian He, and Jingang Wang. Revisiting scaling laws for language models: The role of data quality and training strategies. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23881–23899, 2025

2025
[19]

Functional scaling laws in kernel regression: Loss dynamics and learning rate schedules.arXiv preprint arXiv:2509.19189, 2025

Binghui Li, Fengling Chen, Zixun Huang, Lean Wang, and Lei Wu. Functional scaling laws in kernel regression: Loss dynamics and learning rate schedules.arXiv preprint arXiv:2509.19189, 2025

work page arXiv 2025
[20]

A multi-power law for loss curve prediction across learning rate schedules

Kairong Luo, Haodong Wen, Shengding Hu, Zhenbo Sun, Zhiyuan Liu, Maosong Sun, Kaifeng Lyu, and Wenguang Chen. A multi-power law for loss curve prediction across learning rate schedules. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[21]

Physics of language models: Part 1, learning hierarchical language structures.arXiv preprint arXiv:2305.13673, 2023

Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 1, learning hierarchical language structures.arXiv preprint arXiv:2305.13673, 2023

work page arXiv 2023
[22]

On the origin of neural scaling laws: from random graphs to natural language.arXiv preprint arXiv:2601.10684, 2026

Maissam Barkeshli, Alberto Alfarano, and Andrey Gromov. On the origin of neural scaling laws: from random graphs to natural language.arXiv preprint arXiv:2601.10684, 2026

work page arXiv 2026
[23]

Quantifying memorization across neural language models

Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. Quantifying memorization across neural language models. InThe Eleventh International Conference on Learning Representations, 2022

2022
[24]

Pythia: A suite for analyzing large language models across training and scaling

Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International conference on machine learning, pages 2397–2430. PMLR, 2023

2023
[25]

Training trajectories of language models across scales

Mengzhou Xia, Mikel Artetxe, Chunting Zhou, Xi Victoria Lin, Ramakanth Pasunuru, Danqi Chen, Luke Zettlemoyer, and Veselin Stoyanov. Training trajectories of language models across scales. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13711–13738, 2023

2023
[26]

Not all tokens are what you need for pretraining.Advances in Neural Information Processing Systems, 37:29029–29063, 2024

Zhenghao Lin, Zhibin Gou, Yeyun Gong, Xiao Liu, Yelong Shen, Ruochen Xu, Chen Lin, Yujiu Yang, Jian Jiao, Nan Duan, et al. Not all tokens are what you need for pretraining.Advances in Neural Information Processing Systems, 37:29029–29063, 2024. 11

2024
[27]

Tyler A Chang, Zhuowen Tu, and Benjamin K Bergen. Characterizing learning curves during language model pre-training: Learning, forgetting, and stability.Transactions of the Association for Computational Linguistics, 12:1346–1362, 2024

2024
[28]

Language model behavioral phases are consistent across architecture, training data, and scale

James A Michaelov, Roger P Levy, and Ben Bergen. Language model behavioral phases are consistent across architecture, training data, and scale. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[29]

An exactly solvable model for emergence and scaling laws in the multitask sparse parity problem.Advances in Neural Information Processing Systems, 37:39632–39693, 2024

Yoonsoo Nam, Nayara Fonseca, Seok H Lee, Chris Mingard, and Ard A Louis. An exactly solvable model for emergence and scaling laws in the multitask sparse parity problem.Advances in Neural Information Processing Systems, 37:39632–39693, 2024

2024
[30]

Emergent Abilities of Large Language Models

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models.arXiv preprint arXiv:2206.07682, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[31]

Deep Learning Scaling is Predictable, Empirically

Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically.arXiv preprint arXiv:1712.00409, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[32]

A hitchhiker’s guide to scaling law estimation

Leshem Choshen, Yang Zhang, and Jacob Andreas. A hitchhiker’s guide to scaling law estimation. InInternational Conference on Machine Learning, pages 10683–10699. PMLR, 2025

2025
[33]

A dynamical model of neural scaling laws

Blake Bordelon, Alexander Atanasov, and Cengiz Pehlevan. A dynamical model of neural scaling laws. InProceedings of the 41st International Conference on Machine Learning, pages 4345–4382, 2024

2024
[34]

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al. Deepseek llm: Scaling open-source language models with longtermism.arXiv preprint arXiv:2401.02954, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Skill-it! a data-driven skills framework for understanding and training language models.Advances in Neural Information Processing Systems, 36:36000–36040, 2023

Mayee Chen, Nicholas Roberts, Kush Bhatia, Jue Wang, Ce Zhang, Frederic Sala, and Christo- pher Ré. Skill-it! a data-driven skills framework for understanding and training language models.Advances in Neural Information Processing Systems, 36:36000–36040, 2023

2023
[36]

Curriculum learning for small code language models

Marwa Naïr, Kamel Yamani, Lynda Lhadj, and Riyadh Baghdadi. Curriculum learning for small code language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop), pages 390–401, 2024

2024
[37]

Competence-based curriculum learning for neural machine translation

Emmanouil Antonios Platanios, Otilia Stretcu, Graham Neubig, Barnabas Poczos, and Tom Mitchell. Competence-based curriculum learning for neural machine translation. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 1162–1...

2019
[38]

Frequency explains the inverse correlation of large language models’ size, training data amount, and surprisal’s fit to reading times

Byung-Doh Oh, Shisen Yue, and William Schuler. Frequency explains the inverse correlation of large language models’ size, training data amount, and surprisal’s fit to reading times. InPro- ceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2644–2663, 2024

2024
[39]

Learning in-context n-grams with transformers: Sub-n-grams are near-stationary points

Aditya Varre, Gizem Yüce, and Nicolas Flammarion. Learning in-context n-grams with transformers: Sub-n-grams are near-stationary points. InInternational Conference on Machine Learning, pages 60924–60963. PMLR, 2025

2025
[40]

Deriving neural scaling laws from the statistics of natural language.arXiv preprint arXiv:2602.07488, 2026

Francesco Cagnetta, Allan Raventós, Surya Ganguli, and Matthieu Wyart. Deriving neural scaling laws from the statistics of natural language.arXiv preprint arXiv:2602.07488, 2026

work page arXiv 2026
[41]

Physics of language models: Part 4.1, architecture design and the magic of canon layers

Zeyuan Allen-Zhu. Physics of language models: Part 4.1, architecture design and the magic of canon layers.arXiv preprint arXiv:2512.17351, 2025. 12 A Appendix Contents A Appendix 13 A.1 Why the token distribution in natural language is power-law . . . . . . . . . . . . 13 A.2 Alignment Between Learning Time and Macroscopic Data Consensus . . . . . . . 14 ...

work page arXiv 2025

[1] [1]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[2] [2]

Training compute-optimal large language models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. InProceedings of the 36th International Conference on Neural Information Processing Systems, pages 30016–30030, 2022

2022

[3] [3]

Language models scale reliably with over-training and on downstream tasks

Samir Yitzhak Gadre, Georgios Smyrnis, Vaishaal Shankar, Suchin Gururangan, Mitchell Wortsman, Rulin Shao, Jean Mercat, Alex Fang, Jeffrey Li, Sedrick Keh, et al. Language models scale reliably with over-training and on downstream tasks. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[4] [4]

Hutter, Learning curve theory, arXiv preprint arXiv:2102.04074 (2021)

Marcus Hutter. Learning curve theory.arXiv preprint arXiv:2102.04074, 2021

work page arXiv 2021

[5] [5]

The quantization model of neural scaling.Advances in Neural Information Processing Systems, 36:28699–28722, 2023

Eric Michaud, Ziming Liu, Uzay Girit, and Max Tegmark. The quantization model of neural scaling.Advances in Neural Information Processing Systems, 36:28699–28722, 2023

2023

[6] [6]

Roberts, and James Sully

Alexander Maloney, Daniel A. Roberts, and James Sully. A solvable model of neural scaling laws.arXiv preprint arXiv:2210.16859, 2022

work page arXiv 2022

[7] [7]

Neural scaling laws rooted in the data distribution.arXiv preprint arXiv:2412.07942, 2024

Ari Brill. Neural scaling laws rooted in the data distribution.arXiv preprint arXiv:2412.07942, 2024

work page arXiv 2024

[8] [8]

Learning curves theory for hierarchi- cally compositional data with power-law distributed features

Francesco Cagnetta, Hyunmo Kang, and Matthieu Wyart. Learning curves theory for hierarchi- cally compositional data with power-law distributed features. InInternational Conference on Machine Learning, pages 6149–6164. PMLR, 2025

2025

[9] [9]

Zipf’s word frequency law in natural language: A critical review and future directions.Psychonomic bulletin & review, 21(5):1112–1130, 2014

Steven T Piantadosi. Zipf’s word frequency law in natural language: A critical review and future directions.Psychonomic bulletin & review, 21(5):1112–1130, 2014

2014

[10] [10]

Zipf’s law revisited: Spoken dialog, linguistic units, parameters, and the principle of least effort.Psychonomic Bulletin & Review, 30(1):77–101, 2023

Guido M Linders and Max M Louwerse. Zipf’s law revisited: Spoken dialog, linguistic units, parameters, and the principle of least effort.Psychonomic Bulletin & Review, 30(1):77–101, 2023

2023

[11] [11]

Yian Zhang, Alex Warstadt, Xiaocheng Li, and Samuel Bowman. When do you need billions of words of pretraining data? InProceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: long papers), pages 1112–1125, 2021. 10

2021

[12] [12]

On the spectral bias of neural networks

Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred Hamprecht, Yoshua Bengio, and Aaron Courville. On the spectral bias of neural networks. InInternational conference on machine learning, pages 5301–5310. PMLR, 2019

2019

[13] [13]

Curriculum learning for natural language understanding

Benfeng Xu, Licheng Zhang, Zhendong Mao, Quan Wang, Hongtao Xie, and Yongdong Zhang. Curriculum learning for natural language understanding. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 6095–6104, 2020

2020

[14] [14]

Doremi: Optimizing data mixtures speeds up language model pretraining.Advances in Neural Information Processing Systems, 36: 69798–69818, 2023

Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy S Liang, Quoc V Le, Tengyu Ma, and Adams Wei Yu. Doremi: Optimizing data mixtures speeds up language model pretraining.Advances in Neural Information Processing Systems, 36: 69798–69818, 2023

2023

[15] [15]

Efficient online data mixing for language model pre-training.arXiv preprint arXiv:2312.02406, 2023

Alon Albalak, Liangming Pan, Colin Raffel, and William Yang Wang. Efficient online data mixing for language model pre-training.arXiv preprint arXiv:2312.02406, 2023

work page arXiv 2023

[16] [16]

Beyond random sampling: Efficient language model pretraining via curriculum learning

Yang Zhang, Amr Mohamed, Hadi Abdine, Guokan Shang, and Michalis Vazirgiannis. Beyond random sampling: Efficient language model pretraining via curriculum learning. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5776–5794, 2026

2026

[17] [17]

Resolving discrepancies in compute-optimal scaling of language models.Advances in Neural Information Processing Systems, 37:100535–100570, 2024

Tomer Porian, Mitchell Wortsman, Jenia Jitsev, Ludwig Schmidt, and Yair Carmon. Resolving discrepancies in compute-optimal scaling of language models.Advances in Neural Information Processing Systems, 37:100535–100570, 2024

2024

[18] [18]

Revisiting scaling laws for language models: The role of data quality and training strategies

Zhengyu Chen, Siqi Wang, Teng Xiao, Yudong Wang, Shiqi Chen, Xunliang Cai, Junxian He, and Jingang Wang. Revisiting scaling laws for language models: The role of data quality and training strategies. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23881–23899, 2025

2025

[19] [19]

Functional scaling laws in kernel regression: Loss dynamics and learning rate schedules.arXiv preprint arXiv:2509.19189, 2025

Binghui Li, Fengling Chen, Zixun Huang, Lean Wang, and Lei Wu. Functional scaling laws in kernel regression: Loss dynamics and learning rate schedules.arXiv preprint arXiv:2509.19189, 2025

work page arXiv 2025

[20] [20]

A multi-power law for loss curve prediction across learning rate schedules

Kairong Luo, Haodong Wen, Shengding Hu, Zhenbo Sun, Zhiyuan Liu, Maosong Sun, Kaifeng Lyu, and Wenguang Chen. A multi-power law for loss curve prediction across learning rate schedules. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[21] [21]

Physics of language models: Part 1, learning hierarchical language structures.arXiv preprint arXiv:2305.13673, 2023

Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 1, learning hierarchical language structures.arXiv preprint arXiv:2305.13673, 2023

work page arXiv 2023

[22] [22]

On the origin of neural scaling laws: from random graphs to natural language.arXiv preprint arXiv:2601.10684, 2026

Maissam Barkeshli, Alberto Alfarano, and Andrey Gromov. On the origin of neural scaling laws: from random graphs to natural language.arXiv preprint arXiv:2601.10684, 2026

work page arXiv 2026

[23] [23]

Quantifying memorization across neural language models

Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. Quantifying memorization across neural language models. InThe Eleventh International Conference on Learning Representations, 2022

2022

[24] [24]

Pythia: A suite for analyzing large language models across training and scaling

Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International conference on machine learning, pages 2397–2430. PMLR, 2023

2023

[25] [25]

Training trajectories of language models across scales

Mengzhou Xia, Mikel Artetxe, Chunting Zhou, Xi Victoria Lin, Ramakanth Pasunuru, Danqi Chen, Luke Zettlemoyer, and Veselin Stoyanov. Training trajectories of language models across scales. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13711–13738, 2023

2023

[26] [26]

Not all tokens are what you need for pretraining.Advances in Neural Information Processing Systems, 37:29029–29063, 2024

Zhenghao Lin, Zhibin Gou, Yeyun Gong, Xiao Liu, Yelong Shen, Ruochen Xu, Chen Lin, Yujiu Yang, Jian Jiao, Nan Duan, et al. Not all tokens are what you need for pretraining.Advances in Neural Information Processing Systems, 37:29029–29063, 2024. 11

2024

[27] [27]

Tyler A Chang, Zhuowen Tu, and Benjamin K Bergen. Characterizing learning curves during language model pre-training: Learning, forgetting, and stability.Transactions of the Association for Computational Linguistics, 12:1346–1362, 2024

2024

[28] [28]

Language model behavioral phases are consistent across architecture, training data, and scale

James A Michaelov, Roger P Levy, and Ben Bergen. Language model behavioral phases are consistent across architecture, training data, and scale. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025

[29] [29]

An exactly solvable model for emergence and scaling laws in the multitask sparse parity problem.Advances in Neural Information Processing Systems, 37:39632–39693, 2024

Yoonsoo Nam, Nayara Fonseca, Seok H Lee, Chris Mingard, and Ard A Louis. An exactly solvable model for emergence and scaling laws in the multitask sparse parity problem.Advances in Neural Information Processing Systems, 37:39632–39693, 2024

2024

[30] [30]

Emergent Abilities of Large Language Models

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models.arXiv preprint arXiv:2206.07682, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[31] [31]

Deep Learning Scaling is Predictable, Empirically

Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically.arXiv preprint arXiv:1712.00409, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[32] [32]

A hitchhiker’s guide to scaling law estimation

Leshem Choshen, Yang Zhang, and Jacob Andreas. A hitchhiker’s guide to scaling law estimation. InInternational Conference on Machine Learning, pages 10683–10699. PMLR, 2025

2025

[33] [33]

A dynamical model of neural scaling laws

Blake Bordelon, Alexander Atanasov, and Cengiz Pehlevan. A dynamical model of neural scaling laws. InProceedings of the 41st International Conference on Machine Learning, pages 4345–4382, 2024

2024

[34] [34]

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al. Deepseek llm: Scaling open-source language models with longtermism.arXiv preprint arXiv:2401.02954, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [35]

Skill-it! a data-driven skills framework for understanding and training language models.Advances in Neural Information Processing Systems, 36:36000–36040, 2023

Mayee Chen, Nicholas Roberts, Kush Bhatia, Jue Wang, Ce Zhang, Frederic Sala, and Christo- pher Ré. Skill-it! a data-driven skills framework for understanding and training language models.Advances in Neural Information Processing Systems, 36:36000–36040, 2023

2023

[36] [36]

Curriculum learning for small code language models

Marwa Naïr, Kamel Yamani, Lynda Lhadj, and Riyadh Baghdadi. Curriculum learning for small code language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop), pages 390–401, 2024

2024

[37] [37]

Competence-based curriculum learning for neural machine translation

Emmanouil Antonios Platanios, Otilia Stretcu, Graham Neubig, Barnabas Poczos, and Tom Mitchell. Competence-based curriculum learning for neural machine translation. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 1162–1...

2019

[38] [38]

Frequency explains the inverse correlation of large language models’ size, training data amount, and surprisal’s fit to reading times

Byung-Doh Oh, Shisen Yue, and William Schuler. Frequency explains the inverse correlation of large language models’ size, training data amount, and surprisal’s fit to reading times. InPro- ceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2644–2663, 2024

2024

[39] [39]

Learning in-context n-grams with transformers: Sub-n-grams are near-stationary points

Aditya Varre, Gizem Yüce, and Nicolas Flammarion. Learning in-context n-grams with transformers: Sub-n-grams are near-stationary points. InInternational Conference on Machine Learning, pages 60924–60963. PMLR, 2025

2025

[40] [40]

Deriving neural scaling laws from the statistics of natural language.arXiv preprint arXiv:2602.07488, 2026

Francesco Cagnetta, Allan Raventós, Surya Ganguli, and Matthieu Wyart. Deriving neural scaling laws from the statistics of natural language.arXiv preprint arXiv:2602.07488, 2026

work page arXiv 2026

[41] [41]

Physics of language models: Part 4.1, architecture design and the magic of canon layers

Zeyuan Allen-Zhu. Physics of language models: Part 4.1, architecture design and the magic of canon layers.arXiv preprint arXiv:2512.17351, 2025. 12 A Appendix Contents A Appendix 13 A.1 Why the token distribution in natural language is power-law . . . . . . . . . . . . 13 A.2 Alignment Between Learning Time and Macroscopic Data Consensus . . . . . . . 14 ...

work page arXiv 2025