pith. sign in

arxiv: 2606.29858 · v1 · pith:UO6RGJ5Anew · submitted 2026-06-29 · 💻 cs.CL

Smooth Scaling Laws Hide Stepwise Token Learning

Pith reviewed 2026-06-30 06:05 UTC · model grok-4.3

classification 💻 cs.CL
keywords scaling lawstoken learninglanguage model pretrainingsigmoid fittinglearning time spectrumvalidation lossdata distribution
0
0 comments X

The pith

The distribution of when individual tokens are learned governs the shape of scaling laws in language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Language model loss follows smooth power laws over model and data size, but the paper decomposes this regularity into discrete learning events for contextualized tokens. Fitting each token's loss trajectory to a sigmoid isolates localized transitions and assembles them into a learning-time spectrum. This spectrum quantitatively matches how validation loss changes with training steps, data volume, and model size across more than 100 large runs. The same spectrum can be used to reorder training data, producing an 11 percent faster drop in loss.

Core claim

By fitting sigmoid curves to the loss trajectories of individual contextualized tokens across large-scale pre-training runs, the authors obtain a learning-time spectrum that dominates the scaling-law shape. This spectrum quantitatively reconstructs the validation loss derivative along the training-step T, data-scale D, and model-scale M axes. Reshaping the training distribution according to when tokens become learnable alters the optimization trajectory and yields faster validation-loss reduction.

What carries the argument

The learning-time spectrum extracted by fitting sigmoids to individual token loss trajectories.

If this is right

  • The spectrum reconstructs validation loss derivatives along the T, D, and M axes.
  • Reshaping the training distribution by token learnability times changes the optimization trajectory.
  • The approach produces an 11 percent faster reduction in validation loss.
  • Scaling laws are governed primarily by the distribution of token-level learning times rather than by a heavy-tailed difficulty spectrum.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The spectrum could be used to design data curricula that prioritize tokens according to their measured learnability times.
  • If token interference remains low, the same fitting procedure might reveal stepwise structure in other sequence modeling domains.
  • The method supplies a concrete way to test whether apparent smoothness in other scaling phenomena likewise hides discrete per-element transitions.

Load-bearing premise

Individual token loss trajectories can be modeled accurately and independently as sigmoids that capture true localized learning events without substantial interference from other tokens or context changes.

What would settle it

New scaling runs in which the derivatives reconstructed from the measured spectrum fail to match the observed changes in validation loss along the T, D, or M axes would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.29858 by Debing Zhang, Fu Guo, Peiru Yang, Pingjie Wang, Zechen Hu.

Figure 1
Figure 1. Figure 1 [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Representative token-level loss trajectories across three independent runs and the corre [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: A. Empirical validation loss on the step axis and the aggregate loss implied by token-wise sigmoid fits for a 1.5B model trained on 300B tokens. Both exhibit a clear power-law regime after the initial transient stage. B. Learning pulses estimated from tokens grouped into 40 bins by learning time; after grouping by learning time, the pulse shapes remain highly similar across bins. C. Empirical learning-time… view at source ↗
Figure 4
Figure 4. Figure 4 [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: A. Training-loss trajectories from a 60-run model-scale sweep spanning 0.42 to 5.00 GFLOPs/token, used to construct the model-scale frontier. B. Frontier validation loss on the M axis together with its sigmoid fit; the resulting envelope again follows a clear power law. C. Learning￾time spectrum, empirical loss derivative, sigmoid-implied derivative, power-law fit, and spectrum￾dominated reconstruction on … view at source ↗
Figure 6
Figure 6. Figure 6: Intervening on the training distribution using the measured learning-time signal. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Alignment between token learning time and three macroscopic data regularities. [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Synthetic learning dynamics under the uniform- [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Token-level loss trajectories on uniform- [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Scatter plots of token-level sigmoid parameters against learning time on uniform- [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Similarity of learning pulses across learning times on uniform- [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Reconstruction of synthetic aggregate dynamics from token-level sigmoid fits. [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Overview of synthetic learning dynamics under the power-law [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
read the original abstract

Language model loss follows remarkably regular scaling laws over model and data size, yet it remains unclear why the aggregate loss should exhibit a power-law form. Existing explanations often attribute this regularity to a heavy-tailed spectrum of pattern difficulty in natural language, but this view has not been directly validated at token-level granularity in large-scale real-data training. We present a token-level framework that decomposes scaling laws into localized learning events of individual contextualized tokens. By fitting token loss trajectories with sigmoids, we show that token learning is concentrated in localized transitions, giving rise to a learning-time spectrum that dominates the scaling-law shape. Across more than one hundred pre-training runs on large and diverse real-language corpora with modern LLM architectures, scaling up to 6B parameters and 300B training tokens, the measured learning-time spectrum quantitatively reconstructs the validation loss derivative along the training-step $T$, data-scale $D$, and model-scale $M$ axes. We further show that the same signal is actionable: by reshaping the training distribution according to when tokens become learnable, we alter the optimization trajectory and achieve 11\% faster validation-loss reduction. These results provide direct empirical evidence that scaling laws are governed primarily by the distribution of token-level learning times, and that this distribution can be used not only to explain scaling behavior but also to improve training performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that fitting sigmoids to per-token loss trajectories across >100 pretraining runs (up to 6B parameters, 300B tokens) extracts a 'learning-time spectrum' that dominates scaling-law shape. This spectrum is asserted to quantitatively reconstruct validation-loss derivatives along the T, D, and M axes, and reshaping the training distribution according to token learnability yields an 11% faster validation-loss reduction.

Significance. If the reconstruction supplies evidence independent of the fitting process itself, the work would supply a token-granular, mechanistic account of why aggregate scaling laws are smooth and power-law-like, backed by unusually large-scale empirical coverage and an actionable downstream experiment. The scale of the experimental campaign is a clear strength.

major comments (2)
  1. [Abstract / reconstruction claim] Abstract and reconstruction results: because validation loss is defined as the mean of the contextualized token losses, and each trajectory is independently fit to a sigmoid, summation of the fitted sigmoids necessarily recovers the aggregate loss curve and its derivative by algebraic construction. The manuscript must demonstrate that the spectrum extracted from one subset of runs or tokens predicts the derivatives on held-out scales, architectures, or data regimes without refitting, rather than merely reproducing the curves from which the spectrum was derived.
  2. [actionable experiment] The 11% improvement experiment: the claim that reshaping the training distribution according to the spectrum alters the optimization trajectory requires explicit controls showing that the gain is not explained by changes in token frequency, total compute, or post-hoc selection of easier tokens. Without these, the result does not yet isolate the spectrum as the causal mechanism.
minor comments (2)
  1. [Abstract / results] The abstract and results sections should report fit quality metrics (R², residual distributions), error bars on the spectrum parameters, and any correction for multiple testing or post-hoc token selection.
  2. [methods] Notation for the per-token sigmoid parameters (midpoint, slope, asymptotes) should be introduced once and used consistently when discussing the spectrum.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which highlight important aspects of our claims regarding the reconstruction and the causal interpretation of the distribution-reshaping experiment. We address each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract / reconstruction claim] Abstract and reconstruction results: because validation loss is defined as the mean of the contextualized token losses, and each trajectory is independently fit to a sigmoid, summation of the fitted sigmoids necessarily recovers the aggregate loss curve and its derivative by algebraic construction. The manuscript must demonstrate that the spectrum extracted from one subset of runs or tokens predicts the derivatives on held-out scales, architectures, or data regimes without refitting, rather than merely reproducing the curves from which the spectrum was derived.

    Authors: We agree that, for any fixed set of fitted trajectories, the mean of the sigmoids algebraically recovers the observed aggregate loss and its derivative. The manuscript's reconstruction claim is that the extracted learning-time spectrum (i.e., the distribution of sigmoid midpoints and widths) is the dominant factor shaping the observed derivatives when T, D, or M is varied. To address the concern about independence from the fitting process, we will add a new subsection with cross-validation experiments: the spectrum parameters will be estimated from a subset of runs (or tokens) at smaller scales and then used, without refitting individual trajectories, to predict the shape of the loss derivatives on held-out larger-scale runs. These results will be reported in the revised manuscript. revision: yes

  2. Referee: [actionable experiment] The 11% improvement experiment: the claim that reshaping the training distribution according to the spectrum alters the optimization trajectory requires explicit controls showing that the gain is not explained by changes in token frequency, total compute, or post-hoc selection of easier tokens. Without these, the result does not yet isolate the spectrum as the causal mechanism.

    Authors: We accept that the current presentation of the 11% result does not sufficiently isolate the spectrum as the causal factor. In the revision we will add matched-control experiments that (i) preserve the same token-frequency distribution and total compute budget while applying a non-learnability-based reweighting, (ii) compare against a baseline that simply up-weights tokens that happen to be easier under the original distribution, and (iii) report training curves with identical optimizer settings and data ordering except for the spectrum-derived weights. These controls will be included to demonstrate that the observed acceleration is attributable to the timing information in the spectrum rather than frequency or selection artifacts. revision: yes

Circularity Check

1 steps flagged

Fitted sigmoid spectrum reconstructs aggregate loss derivative by direct summation

specific steps
  1. fitted input called prediction [Abstract]
    "By fitting token loss trajectories with sigmoids, we show that token learning is concentrated in localized transitions, giving rise to a learning-time spectrum that dominates the scaling-law shape. [...] the measured learning-time spectrum quantitatively reconstructs the validation loss derivative along the training-step $T$, data-scale $D$, and model-scale $M$ axes."

    Validation loss equals the mean of the contextualized token losses. Once sigmoids are fitted to the individual trajectories, their average directly yields the aggregate loss curve (and its derivative) by summation. The reported 'quantitative reconstruction' therefore holds by construction from the same fitted quantities rather than providing independent confirmation that the distribution of learning times governs the scaling laws.

full rationale

The paper extracts a learning-time spectrum by fitting sigmoids to per-token loss trajectories, then claims this spectrum quantitatively reconstructs the validation-loss derivatives. Because validation loss is defined as the mean of those same token losses, averaging the fitted sigmoids necessarily recovers the aggregate curve and its derivative by algebraic construction. This matches the 'fitted_input_called_prediction' pattern: the reconstruction is statistically forced once the per-token fits are performed and does not supply independent evidence that the spectrum governs scaling-law shape. The 11% training-distribution experiment supplies limited external grounding, but the core reconstruction step remains tied to the fitted inputs, producing partial circularity.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The framework rests on fitting independent sigmoids to each token's loss trajectory, introducing many free parameters per token and treating the resulting spectrum as explanatory without an independent generative model of why learning times follow the observed distribution.

free parameters (1)
  • per-token sigmoid parameters (midpoint, slope, asymptotes)
    Each token's loss trajectory is fit independently with a sigmoid, yielding multiple fitted values per token that are then aggregated into the spectrum.
axioms (1)
  • domain assumption Token losses evolve independently and can be modeled as isolated sigmoid transitions without significant cross-token interference during training.
    Invoked when decomposing aggregate loss into per-token events and fitting sigmoids.
invented entities (1)
  • learning-time spectrum no independent evidence
    purpose: To explain and reconstruct scaling-law derivatives from token-level events.
    Derived directly from the collection of fitted sigmoid midpoints; no independent falsifiable prediction outside the fitted data is stated.

pith-pipeline@v0.9.1-grok · 5772 in / 1499 out tokens · 28506 ms · 2026-06-30T06:05:48.132135+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 13 canonical work pages · 4 internal anchors

  1. [1]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

  2. [2]

    Training compute-optimal large language models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. InProceedings of the 36th International Conference on Neural Information Processing Systems, pages 30016–30030, 2022

  3. [3]

    Language models scale reliably with over-training and on downstream tasks

    Samir Yitzhak Gadre, Georgios Smyrnis, Vaishaal Shankar, Suchin Gururangan, Mitchell Wortsman, Rulin Shao, Jean Mercat, Alex Fang, Jeffrey Li, Sedrick Keh, et al. Language models scale reliably with over-training and on downstream tasks. InThe Thirteenth International Conference on Learning Representations, 2025

  4. [4]

    Learning curve theory.arXiv preprint arXiv:2102.04074,

    Marcus Hutter. Learning curve theory.arXiv preprint arXiv:2102.04074, 2021

  5. [5]

    The quantization model of neural scaling.Advances in Neural Information Processing Systems, 36:28699–28722, 2023

    Eric Michaud, Ziming Liu, Uzay Girit, and Max Tegmark. The quantization model of neural scaling.Advances in Neural Information Processing Systems, 36:28699–28722, 2023

  6. [6]

    Roberts, and James Sully

    Alexander Maloney, Daniel A. Roberts, and James Sully. A solvable model of neural scaling laws.arXiv preprint arXiv:2210.16859, 2022

  7. [7]

    Neural scaling laws rooted in the data distribution.arXiv preprint arXiv:2412.07942, 2024

    Ari Brill. Neural scaling laws rooted in the data distribution.arXiv preprint arXiv:2412.07942, 2024

  8. [8]

    Learning curves theory for hierarchi- cally compositional data with power-law distributed features

    Francesco Cagnetta, Hyunmo Kang, and Matthieu Wyart. Learning curves theory for hierarchi- cally compositional data with power-law distributed features. InInternational Conference on Machine Learning, pages 6149–6164. PMLR, 2025

  9. [9]

    Zipf’s word frequency law in natural language: A critical review and future directions.Psychonomic bulletin & review, 21(5):1112–1130, 2014

    Steven T Piantadosi. Zipf’s word frequency law in natural language: A critical review and future directions.Psychonomic bulletin & review, 21(5):1112–1130, 2014

  10. [10]

    Zipf’s law revisited: Spoken dialog, linguistic units, parameters, and the principle of least effort.Psychonomic Bulletin & Review, 30(1):77–101, 2023

    Guido M Linders and Max M Louwerse. Zipf’s law revisited: Spoken dialog, linguistic units, parameters, and the principle of least effort.Psychonomic Bulletin & Review, 30(1):77–101, 2023

  11. [11]

    Yian Zhang, Alex Warstadt, Xiaocheng Li, and Samuel Bowman. When do you need billions of words of pretraining data? InProceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: long papers), pages 1112–1125, 2021. 10

  12. [12]

    On the spectral bias of neural networks

    Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred Hamprecht, Yoshua Bengio, and Aaron Courville. On the spectral bias of neural networks. InInternational conference on machine learning, pages 5301–5310. PMLR, 2019

  13. [13]

    Curriculum learning for natural language understanding

    Benfeng Xu, Licheng Zhang, Zhendong Mao, Quan Wang, Hongtao Xie, and Yongdong Zhang. Curriculum learning for natural language understanding. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 6095–6104, 2020

  14. [14]

    Doremi: Optimizing data mixtures speeds up language model pretraining.Advances in Neural Information Processing Systems, 36: 69798–69818, 2023

    Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy S Liang, Quoc V Le, Tengyu Ma, and Adams Wei Yu. Doremi: Optimizing data mixtures speeds up language model pretraining.Advances in Neural Information Processing Systems, 36: 69798–69818, 2023

  15. [15]

    Efficient online data mixing for language model pre-training.arXiv preprint arXiv:2312.02406, 2023

    Alon Albalak, Liangming Pan, Colin Raffel, and William Yang Wang. Efficient online data mixing for language model pre-training.arXiv preprint arXiv:2312.02406, 2023

  16. [16]

    Beyond random sampling: Efficient language model pretraining via curriculum learning

    Yang Zhang, Amr Mohamed, Hadi Abdine, Guokan Shang, and Michalis Vazirgiannis. Beyond random sampling: Efficient language model pretraining via curriculum learning. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5776–5794, 2026

  17. [17]

    Resolving discrepancies in compute-optimal scaling of language models.Advances in Neural Information Processing Systems, 37:100535–100570, 2024

    Tomer Porian, Mitchell Wortsman, Jenia Jitsev, Ludwig Schmidt, and Yair Carmon. Resolving discrepancies in compute-optimal scaling of language models.Advances in Neural Information Processing Systems, 37:100535–100570, 2024

  18. [18]

    Revisiting scaling laws for language models: The role of data quality and training strategies

    Zhengyu Chen, Siqi Wang, Teng Xiao, Yudong Wang, Shiqi Chen, Xunliang Cai, Junxian He, and Jingang Wang. Revisiting scaling laws for language models: The role of data quality and training strategies. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23881–23899, 2025

  19. [19]

    Functional scaling laws in kernel regression: Loss dynamics and learning rate schedules.arXiv preprint arXiv:2509.19189, 2025

    Binghui Li, Fengling Chen, Zixun Huang, Lean Wang, and Lei Wu. Functional scaling laws in kernel regression: Loss dynamics and learning rate schedules.arXiv preprint arXiv:2509.19189, 2025

  20. [20]

    A multi-power law for loss curve prediction across learning rate schedules

    Kairong Luo, Haodong Wen, Shengding Hu, Zhenbo Sun, Zhiyuan Liu, Maosong Sun, Kaifeng Lyu, and Wenguang Chen. A multi-power law for loss curve prediction across learning rate schedules. InThe Thirteenth International Conference on Learning Representations, 2025

  21. [21]

    Physics of language models: Part 1, learning hierarchical language structures.arXiv preprint arXiv:2305.13673, 2023

    Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 1, learning hierarchical language structures.arXiv preprint arXiv:2305.13673, 2023

  22. [22]

    On the origin of neural scaling laws: from random graphs to natural language.arXiv preprint arXiv:2601.10684, 2026

    Maissam Barkeshli, Alberto Alfarano, and Andrey Gromov. On the origin of neural scaling laws: from random graphs to natural language.arXiv preprint arXiv:2601.10684, 2026

  23. [23]

    Quantifying memorization across neural language models

    Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. Quantifying memorization across neural language models. InThe Eleventh International Conference on Learning Representations, 2022

  24. [24]

    Pythia: A suite for analyzing large language models across training and scaling

    Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International conference on machine learning, pages 2397–2430. PMLR, 2023

  25. [25]

    Training trajectories of language models across scales

    Mengzhou Xia, Mikel Artetxe, Chunting Zhou, Xi Victoria Lin, Ramakanth Pasunuru, Danqi Chen, Luke Zettlemoyer, and Veselin Stoyanov. Training trajectories of language models across scales. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13711–13738, 2023

  26. [26]

    Not all tokens are what you need for pretraining.Advances in Neural Information Processing Systems, 37:29029–29063, 2024

    Zhenghao Lin, Zhibin Gou, Yeyun Gong, Xiao Liu, Yelong Shen, Ruochen Xu, Chen Lin, Yujiu Yang, Jian Jiao, Nan Duan, et al. Not all tokens are what you need for pretraining.Advances in Neural Information Processing Systems, 37:29029–29063, 2024. 11

  27. [27]

    Tyler A Chang, Zhuowen Tu, and Benjamin K Bergen. Characterizing learning curves during language model pre-training: Learning, forgetting, and stability.Transactions of the Association for Computational Linguistics, 12:1346–1362, 2024

  28. [28]

    Language model behavioral phases are consistent across architecture, training data, and scale

    James A Michaelov, Roger P Levy, and Ben Bergen. Language model behavioral phases are consistent across architecture, training data, and scale. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  29. [29]

    An exactly solvable model for emergence and scaling laws in the multitask sparse parity problem.Advances in Neural Information Processing Systems, 37:39632–39693, 2024

    Yoonsoo Nam, Nayara Fonseca, Seok H Lee, Chris Mingard, and Ard A Louis. An exactly solvable model for emergence and scaling laws in the multitask sparse parity problem.Advances in Neural Information Processing Systems, 37:39632–39693, 2024

  30. [30]

    Emergent Abilities of Large Language Models

    Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models.arXiv preprint arXiv:2206.07682, 2022

  31. [31]

    Deep Learning Scaling is Predictable, Empirically

    Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically.arXiv preprint arXiv:1712.00409, 2017

  32. [32]

    A hitchhiker’s guide to scaling law estimation

    Leshem Choshen, Yang Zhang, and Jacob Andreas. A hitchhiker’s guide to scaling law estimation. InInternational Conference on Machine Learning, pages 10683–10699. PMLR, 2025

  33. [33]

    A dynamical model of neural scaling laws

    Blake Bordelon, Alexander Atanasov, and Cengiz Pehlevan. A dynamical model of neural scaling laws. InProceedings of the 41st International Conference on Machine Learning, pages 4345–4382, 2024

  34. [34]

    DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

    Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al. Deepseek llm: Scaling open-source language models with longtermism.arXiv preprint arXiv:2401.02954, 2024

  35. [35]

    Skill-it! a data-driven skills framework for understanding and training language models.Advances in Neural Information Processing Systems, 36:36000–36040, 2023

    Mayee Chen, Nicholas Roberts, Kush Bhatia, Jue Wang, Ce Zhang, Frederic Sala, and Christo- pher Ré. Skill-it! a data-driven skills framework for understanding and training language models.Advances in Neural Information Processing Systems, 36:36000–36040, 2023

  36. [36]

    Curriculum learning for small code language models

    Marwa Naïr, Kamel Yamani, Lynda Lhadj, and Riyadh Baghdadi. Curriculum learning for small code language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop), pages 390–401, 2024

  37. [37]

    Competence-based curriculum learning for neural machine translation

    Emmanouil Antonios Platanios, Otilia Stretcu, Graham Neubig, Barnabas Poczos, and Tom Mitchell. Competence-based curriculum learning for neural machine translation. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 1162–1...

  38. [38]

    Frequency explains the inverse correlation of large language models’ size, training data amount, and surprisal’s fit to reading times

    Byung-Doh Oh, Shisen Yue, and William Schuler. Frequency explains the inverse correlation of large language models’ size, training data amount, and surprisal’s fit to reading times. InPro- ceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2644–2663, 2024

  39. [39]

    Learning in-context n-grams with transformers: Sub-n-grams are near-stationary points

    Aditya Varre, Gizem Yüce, and Nicolas Flammarion. Learning in-context n-grams with transformers: Sub-n-grams are near-stationary points. InInternational Conference on Machine Learning, pages 60924–60963. PMLR, 2025

  40. [40]

    Deriving neural scaling laws from the statistics of natural language.arXiv preprint arXiv:2602.07488, 2026

    Francesco Cagnetta, Allan Raventós, Surya Ganguli, and Matthieu Wyart. Deriving neural scaling laws from the statistics of natural language.arXiv preprint arXiv:2602.07488, 2026

  41. [41]

    Physics of language models: Part 4.1, architecture design and the magic of canon layers

    Zeyuan Allen-Zhu. Physics of language models: Part 4.1, architecture design and the magic of canon layers.arXiv preprint arXiv:2512.17351, 2025. 12 A Appendix Contents A Appendix 13 A.1 Why the token distribution in natural language is power-law . . . . . . . . . . . . 13 A.2 Alignment Between Learning Time and Macroscopic Data Consensus . . . . . . . 14 ...