A Boundary-Layer Mechanism for One-Third Scaling in Online Softmax Classification

Bernd Rosenow; Marcel K\"uhn; Yoon Thelge

REVIEW 2 major objections 3 minor 37 references

In online softmax classification, only thin boundary layers near decision boundaries remain active at late times, producing generalization error that decays as training time to the minus one third.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-05-22 07:00 UTC pith:RSJAPOJU

load-bearing objection The paper derives a boundary-layer mechanism for the 1/3 scaling in online softmax generalization error. the 2 major comments →

arxiv 2605.22341 v1 pith:RSJAPOJU submitted 2026-05-21 cs.LG cond-mat.dis-nn

A Boundary-Layer Mechanism for One-Third Scaling in Online Softmax Classification

Marcel K\"uhn , Yoon Thelge , Bernd Rosenow This is my paper

classification cs.LG cond-mat.dis-nn

keywords softmax cross-entropyonline learningteacher-student modelpower-law scalinggeneralization errorboundary layersthermodynamic limitlearning-rate schedules

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper isolates an asymptotic mechanism in a teacher-student model where softmax cross-entropy is used to train hard-label classification. After centering the logits by subtracting their mean, the thermodynamic-limit dynamics reduce to two order parameters: a growing student-teacher alignment D and a residual variance Delta kept nonzero by gradient noise. At late times, examples far from the teacher's decision boundaries are already classified with high , contributing exponentially small gradients; only boundary layers of width proportional to one over D stay active. Solving the resulting closed equations yields a power-law decay of both test loss and generalization error epsilon_g as alpha to the minus one third, where alpha is training time. This scaling is slower than the Bayes-optimal reference of alpha to the minus one for the same model, and the authors show that learning-rate schedules can recover a faster alpha to the minus one half decay.

Core claim

After subtracting the mean logit, the thermodynamic-limit dynamics close in centered variables consisting of a growing centered student-teacher alignment D and the residual student variance Delta. At late times, examples away from teacher decision boundaries contribute exponentially little to the loss and gradients, so only boundary layers of width O(D^{-1}) remain active while noise from fixed-learning-rate online gradient descent maintains nonzero Delta. The late-time solution of these dynamics produces an alpha^{-1/3} power law for both the test loss and the generalization error epsilon_g (one minus test accuracy). Learning-rate schedules can improve the generalization error to an epsilon

What carries the argument

Boundary layers of width O(D^{-1}) that stay active at late times while noise sustains nonzero residual variance Delta in the centered order-parameter dynamics.

Load-bearing premise

The thermodynamic-limit dynamics close exactly in centered variables after subtracting the mean logit, so that only alignment D and residual variance Delta matter and off-boundary examples contribute negligibly.

What would settle it

In long-time simulations of the online teacher-student softmax model with fixed learning rate, check whether the measured generalization error follows alpha to the power of negative one third rather than a different exponent.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Both test loss and generalization error epsilon_g decay as alpha^{-1/3} under fixed learning rate.
This scaling is slower than the Bayes-optimal alpha^{-1} for the same teacher-student setup.
Scheduled learning rates can improve the generalization error to an epsilon_g ~ alpha^{-1/2} power law.
Data structure can dominate early transients, but the boundary-layer mechanism governs the asymptotic regime.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same boundary-layer bottleneck may appear in other surrogate losses whenever hard labels are approximated by smooth functions.
If real data possess well-defined decision boundaries, this mechanism could set a lower bound on how fast classification error can improve with compute.
Controlled experiments with whitened features suggest that the scaling is robust once the model enters the late-time regime.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

The paper derives a boundary-layer mechanism for the 1/3 scaling in online softmax generalization error.

read the letter

The one or two things to know are that the paper derives an asymptotic boundary-layer mechanism responsible for the alpha to the minus one third scaling in generalization error for online softmax classification, and that this is slower than the Bayes optimal rate. What the paper does well is to show how after centering the logits the dynamics reduce to a growing alignment D and a residual variance Delta that stays nonzero due to the noise. At late times the active examples are only in thin layers of width one over D around the teacher boundaries, and the balance of drift and diffusion in those layers produces the scaling. They also show that changing the learning rate schedule can get closer to alpha to the minus one half. The simulations match the order parameter evolution and the resulting learning curves. The authors are clear that this is an asymptotic effect that can be overtaken by data structure in the transients, and they provide controlled experiments to illustrate that. On the soft spots, the main assumption is that the thermodynamic limit closes in just those centered variables, but the stress test suggests the scaling balance holds without internal contradictions. The exponential suppression of the bulk contributions is preserved. It's not a flaw but something to keep in mind that this mechanism is specific to the online fixed learning rate setting with this loss. This kind of work is for researchers interested in the dynamical origins of scaling laws in machine learning, particularly in teacher-student setups for classification. A reader who wants to understand why certain exponents appear in practice would get value from the concrete derivation. It deserves a serious referee because it offers a falsifiable prediction from the model dynamics and supports it with simulations. I would recommend sending it for peer review.

Referee Report

2 major / 3 minor

Summary. The paper analyzes online softmax cross-entropy training in a teacher-student binary classification model. After centering logits, the thermodynamic-limit dynamics close on two order parameters: growing student-teacher alignment D and residual variance Δ. At late times only O(D^{-1})-width boundary layers around the teacher decision boundary remain active; fixed-learning-rate noise sustains nonzero Δ. This balance produces test loss and generalization error ε_g both scaling as α^{-1/3}, slower than the Bayes-optimal α^{-1} reference. Learning-rate schedules are shown to recover α^{-1/2} scaling. Simulations confirm the predicted order-parameter trajectories and learning curves; controlled experiments with correlated inputs illustrate that data structure can dominate transients.

Significance. If the boundary-layer closure and scaling balance hold, the work supplies a concrete, mechanistic origin for a specific power-law exponent that arises directly from the surrogate-loss/hard-label mismatch in online gradient descent. The reduction to two centered variables, the explicit 1/D active-fraction argument, and the resulting α^{-1/3} prediction are falsifiable and complementary to spectral accounts of neural scaling. The demonstration that simple schedules improve the exponent to -1/2 and the discussion of data-structure transients add practical value.

major comments (2)

§3.2, Eq. (18)–(22): the thermodynamic-limit closure in centered variables (D, Δ) is asserted after subtracting the mean logit. The derivation of the drift and diffusion terms for the boundary layer must explicitly show that all higher-order moments and cross-correlations remain sub-leading when the active fraction is O(D^{-1}); otherwise the two-variable reduction is not closed at the order needed for the α^{-1/3} balance.
§4.1, Figure 3: the reported late-time exponent for ε_g is fitted over a limited α window. Because the claimed scaling is asymptotic, the manuscript should include a quantitative check (e.g., local slope versus α or extrapolation to infinite α) that rules out slower transients or crossover to the Bayes-optimal regime within the simulated range.

minor comments (3)

Notation: the symbol Δ is used both for residual variance and for the teacher-student overlap in some intermediate equations; a single consistent definition or explicit distinction would prevent confusion.
Figure 1 caption: the plotted curves are labeled “theory” but the caption does not state whether they are the exact solution of the two-variable ODE or a numerical integration; clarify the source of the solid lines.
Reference list: the discussion of spectral scaling laws cites only a subset of the recent literature; adding the most directly comparable teacher-student analyses would help readers locate the present mechanism within the broader literature.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address the points below and have revised the manuscript to incorporate clarifications and additional checks.

read point-by-point responses

Referee: §3.2, Eq. (18)–(22): the thermodynamic-limit closure in centered variables (D, Δ) is asserted after subtracting the mean logit. The derivation of the drift and diffusion terms for the boundary layer must explicitly show that all higher-order moments and cross-correlations remain sub-leading when the active fraction is O(D^{-1}); otherwise the two-variable reduction is not closed at the order needed for the α^{-1/3} balance.

Authors: We appreciate the request for an explicit bound. In the revised §3.2 we add a dedicated paragraph deriving the moment scalings: outside the O(D^{-1}) layer the measure is exponentially small (O(e^{-cD})), while inside the layer the local fields remain O(1) and the width supplies an extra 1/D factor, so that all higher cumulants and cross-correlations are O(1/D) or smaller. These corrections are sub-dominant to the leading drift-diffusion balance that produces the α^{-1/3} scaling, thereby closing the two-variable system at the required order. revision: yes
Referee: §4.1, Figure 3: the reported late-time exponent for ε_g is fitted over a limited α window. Because the claimed scaling is asymptotic, the manuscript should include a quantitative check (e.g., local slope versus α or extrapolation to infinite α) that rules out slower transients or crossover to the Bayes-optimal regime within the simulated range.

Authors: We agree that a direct diagnostic of the asymptotic regime is useful. The revised Figure 3 now includes an inset plotting the local logarithmic slope d log ε_g / d log α versus α; the slope approaches −1/3 at the largest simulated α and shows no systematic drift toward −1. We also add a short table of effective exponents obtained from successive α windows, confirming convergence to the predicted value without detectable crossover in the accessible range. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation

full rationale

The paper derives the late-time α^{-1/3} scaling for test loss and generalization error from the thermodynamic-limit closure of dynamics in centered variables D (growing alignment) and Δ (residual variance) after subtracting the mean logit. Only boundary layers of width O(D^{-1}) remain active due to exponential suppression of bulk contributions, with fixed-learning-rate noise maintaining nonzero Δ. The scaling follows from integrating the drift over the active fraction and balancing the resulting damping rate against diffusion, without any reduction to fitted parameters, self-definitional loops, or load-bearing self-citations. Simulations are invoked only for support, not as the source of the scaling itself. The analysis is self-contained within the online teacher-student model and positioned as complementary to spectral mechanisms.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumptions of the teacher-student model in the thermodynamic limit and the dominance of boundary layers at late times, with no free parameters or invented entities explicitly introduced in the abstract.

axioms (2)

domain assumption Thermodynamic-limit dynamics close in centered variables after subtracting the mean logit
Stated in the abstract as the basis for the dynamics of D and Δ.
domain assumption At late times, only boundary layers of width O(D^{-1}) remain active while noise maintains nonzero Δ
Key assumption leading to the power-law solution.

pith-pipeline@v0.9.0 · 5786 in / 1479 out tokens · 74810 ms · 2026-05-22T07:00:53.932925+00:00 · methodology

0 comments

read the original abstract

Hard-label classification is usually trained with smooth surrogate losses, most prominently softmax cross-entropy. We isolate an asymptotic mechanism by which this mismatch between smooth surrogate and discrete labels produces power-law learning curves in an online teacher-student model. After subtracting the mean logit, the thermodynamic-limit dynamics close in centered variables: a growing centered student-teacher alignment $D$ and the residual student variance $\Delta$. At late times, examples away from teacher decision boundaries are already classified confidently and contribute exponentially little. Only boundary layers of width $O(D^{-1})$ remain active, while the noise of fixed-learning-rate online gradient descent maintains a nonzero $\Delta$. As a function of the training time $\alpha$ the late-time solution yields a $\alpha^{-1/3}$ power law not only for the test loss but also for the generalization error $\epsilon_g$, i.e., one minus test accuracy. This is much slower than the $\alpha^{-1}$ Bayes-optimal reference for the same model. We further show that learning-rate schedules can improve the generalization error towards a $\epsilon_g \sim \alpha^{-1/2}$ power law. Simulations support the predicted order parameter dynamics and learning curves. Controlled experiments with correlated Gaussian inputs and whitened pretrained features show that data structure can dominate transients. Therefore, our result is an asymptotic, complementary mechanism rather than an alternative to spectral explanations of neural scaling laws.

Figures

Figures reproduced from arXiv: 2605.22341 by Bernd Rosenow, Marcel K\"uhn, Yoon Thelge.

**Figure 1.** Figure 1: Left: The 1/3 law appears not only in the test loss but also in the generalization error ϵg, i.e., one minus test accuracy. Middle: The model captures both the growth of the centered student-teacher alignment D and the rotational alignment to the teacher. Right: Near a teacher decision boundary, the late-time loss is controlled by the student boundary layer of width O(D−1 ). The generalization error is con… view at source ↗

**Figure 2.** Figure 2: Finite-N validation for fixed learning rates in the K = 3 online teacher–student model. The panels show the generalization error, centered overlap D, and residual variance ∆ as functions of macroscopic time α = µ/N. The curves show representative seed trajectories, with envelopes indicating fluctuations across six simulation seeds. Within these fluctuations, the trajectories agree with the predicted power-… view at source ↗

**Figure 3.** Figure 3: Schedule dependence in the K = 3 online teacher–student model. For η(α) ∝ α −γ , the theory predicts ϵg(α) ∼ α −(2+γ)/6 for 0 ≤ γ < 1. Increasing γ slows the growth of the centered overlap, D ∝ α (1−γ)/3 for γ < 1, but decreases the residual variance, ∆ ∝ η(α); the latter effect improves the classification-error exponent. The γ = 1 curve is a borderline case, where the adiabatic approximation for ∆ breaks … view at source ↗

**Figure 5.** Figure 5: Controlled departure from isotropic inputs. Inputs are Gaussian with diagonal covariance [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Dependence on the number of classes. Fixed-learning-rate simulations for [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: Whitened pretrained-feature experiment with real labels. This run is included as a qualitative [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 4 internal anchors

[1]

Deep Learning Scaling is Predictable, Empirically

Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep Learning Scaling is Predictable, Empirically. arXiv preprint arXiv:1712.00409, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[2]

An empirical analysis of compute- optimal large language model training

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Thomas Hennigan, Eric Noland, Katherine Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karén Simonyan, Erich Elsen, Oriol Vinyals, Jack Rae, and Laurent S...

work page 2022
[3]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling Laws for Neural Language Models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[4]

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Jack W. Rae et al. Scaling Language Models: Methods, Analysis & Insights from Training Gopher.arXiv preprint arXiv:2112.11446, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

Spectrum dependent learning curves in kernel regression and wide neural networks

Blake Bordelon, Abdulkadir Canatar, and Cengiz Pehlevan. Spectrum dependent learning curves in kernel regression and wide neural networks. InProceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 1024–1034. PMLR, 2020

work page 2020
[6]

Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks.Nature Communications, 12(1): 2914, 2021

Abdulkadir Canatar, Blake Bordelon, and Cengiz Pehlevan. Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks.Nature Communications, 12(1): 2914, 2021

work page 2021
[7]

Explaining neural scaling laws.Proceedings of the National Academy of Sciences, 121(27):e2311878121, 2024

Yasaman Bahri, Ethan Dyer, Jared Kaplan, Jaehoon Lee, and Utkarsh Sharma. Explaining neural scaling laws.Proceedings of the National Academy of Sciences, 121(27):e2311878121, 2024

work page 2024
[8]

A solvable model of neural scaling laws.arXiv preprint arXiv:2210.16859, 2022

Alexander Maloney, Daniel A. Roberts, and James Sully. A solvable model of neural scaling laws.arXiv preprint arXiv:2210.16859, 2022

work page arXiv 2022
[9]

Kakade, Peter L

Licong Lin, Jingfeng Wu, Sham M. Kakade, Peter L. Bartlett, and Jason D. Lee. Scaling laws in linear regression: Compute, parameters, and data.Advances in Neural Information Processing Systems, 37, 2024

work page 2024
[10]

Scaling laws from the data manifold dimension.Journal of Machine Learning Research, 23(9):1–34, 2022

Utkarsh Sharma and Jared Kaplan. Scaling laws from the data manifold dimension.Journal of Machine Learning Research, 23(9):1–34, 2022

work page 2022
[11]

Adaptive approximation and generalization of deep neural networks with intrinsic dimensionality.Journal of Machine Learning Research, 21(174):1–38, 2020

Ryotaro Nakada and Masaaki Imaizumi. Adaptive approximation and generalization of deep neural networks with intrinsic dimensionality.Journal of Machine Learning Research, 21(174):1–38, 2020

work page 2020
[12]

A dynamical model of neural scaling laws

Blake Bordelon, Alexander Atanasov, and Cengiz Pehlevan. A dynamical model of neural scaling laws. In International Conference on Machine Learning, 2024

work page 2024
[13]

How feature learning can improve neural scaling laws.Journal of Statistical Mechanics: Theory and Experiment, 2025(8):084002, 2025

Blake Bordelon, Alexander Atanasov, and Cengiz Pehlevan. How feature learning can improve neural scaling laws.Journal of Statistical Mechanics: Theory and Experiment, 2025(8):084002, 2025

work page 2025
[14]

Analyzing neural scaling laws in two-layer networks with power-law data spectra

Roman Worschech and Bernd Rosenow. Analyzing neural scaling laws in two-layer networks with power-law data spectra. InInternational Conference on Learning Representations, 2025. Spotlight

work page 2025
[15]

On the origin of neural scaling laws: from random graphs to natural language.arXiv preprint arXiv:2601.10684, 2026

Maissam Barkeshli, Alberto Alfarano, and Andrey Gromov. On the origin of neural scaling laws: From random graphs to natural language.arXiv preprint arXiv:2601.10684, 2026

work page arXiv 2026
[16]

Universal One-third Time Scaling in Learning Peaked Distributions

Yizhou Liu, Ziming Liu, Cengiz Pehlevan, and Jeff Gore. Universal One-third Time Scaling in Learning Peaked Distributions.arXiv preprint arXiv:2602.03685, 2026

work page internal anchor Pith review arXiv 2026
[17]

Learning curves for the multi-class teacher–student perceptron.Machine Learning: Science and Technology, 4(1):015019, 2023

Elisabetta Cornacchia, Francesca Mignacco, Rodrigo Veiga, Cédric Gerbelot, Bruno Loureiro, and Lenka Zdeborová. Learning curves for the multi-class teacher–student perceptron.Machine Learning: Science and Technology, 4(1):015019, 2023

work page 2023
[18]

On-line learning with a student-teacher scenario.Europhysics Letters, 28 (7):525, 1994

Michael Biehl and Peter Riegler. On-line learning with a student-teacher scenario.Europhysics Letters, 28 (7):525, 1994

work page 1994
[19]

Calculation of the learning curve of bayes optimal classification algorithm for learning a perceptron with noise.Physical Review Letters, 66(20):2677, 1991

Manfred Opper and David Haussler. Calculation of the learning curve of bayes optimal classification algorithm for learning a perceptron with noise.Physical Review Letters, 66(20):2677, 1991

work page 1991
[20]

Exact solution for on-line learning in multilayer neural networks.Physical Review Letters, 74(21):4337, 1995

David Saad and Sara A Solla. Exact solution for on-line learning in multilayer neural networks.Physical Review Letters, 74(21):4337, 1995

work page 1995
[21]

H. S. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning from examples.Phys. Rev. A, 45:6056–6091, Apr 1992

work page 1992
[22]

Soft mode in the dynamics of over-realizable online learning for soft committee machines.Physical Review E, 105(5):L052302, 2022

Frederieke Richert, Roman Worschech, and Bernd Rosenow. Soft mode in the dynamics of over-realizable online learning for soft committee machines.Physical Review E, 105(5):L052302, 2022

work page 2022
[23]

High-dimensional dynamics of generalization error in neural networks.Neural Networks, 132:428–446, 2020

Madhu S Advani and Andrew M Saxe. High-dimensional dynamics of generalization error in neural networks.Neural Networks, 132:428–446, 2020

work page 2020
[24]

Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup

Sebastian Goldt, Madhu Advani, Andrew M Saxe, Florent Krzakala, and Lenka Zdeborová. Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup. InAdvances in Neural Information Processing Systems, volume 32, 2019. 10

work page 2019
[25]

Dynamical mean- field theory for sgd in high-dimensional classification

Francesca Mignacco, Florent Krzakala, Pierfrancesco Urbani, and Lenka Zdeborová. Dynamical mean- field theory for sgd in high-dimensional classification. InAdvances in Neural Information Processing Systems, volume 33, pages 5834–5845, 2020

work page 2020
[26]

Generalization error in high- dimensional perceptrons: Approaching bayes error with convex optimization.Advances in Neural Information Processing Systems, 33:12199–12210, 2020

Benjamin Aubin, Florent Krzakala, Yue Lu, and Lenka Zdeborová. Generalization error in high- dimensional perceptrons: Approaching bayes error with convex optimization.Advances in Neural Information Processing Systems, 33:12199–12210, 2020

work page 2020
[27]

Learning curves of generic features maps for realistic datasets with a teacher-student model

Bruno Loureiro, Gabriele Sicuro, Cédric Gerbelot, Alessandro Pacco, Florent Krzakala, and Lenka Zdeborová. Learning curves of generic features maps for realistic datasets with a teacher-student model. InAdvances in Neural Information Processing Systems, volume 34, pages 18137–18151, 2021

work page 2021
[28]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021

work page 2021
[29]

Statistical behavior and consistency of classification methods based on convex risk minimiza- tion.The Annals of Statistics, 32(1):56–134, 2004

Tong Zhang. Statistical behavior and consistency of classification methods based on convex risk minimiza- tion.The Annals of Statistics, 32(1):56–134, 2004

work page 2004
[30]

Bartlett, Michael I

Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156, 2006

work page 2006
[31]

A Universal Growth Rate for Learning with Smooth Surrogate Losses

Anqi Mao, Mehryar Mohri, and Yutao Zhong. A Universal Growth Rate for Learning with Smooth Surrogate Losses. InAdvances in Neural Information Processing Systems, volume 37, pages 41670–41708. Curran Associates, Inc., 2024

work page 2024
[32]

The implicit bias of gradient descent on separable data

Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. InInternational Conference on Learning Representations, 2018

work page 2018
[33]

Stochastic gradient descent on separable data: Exact convergence with a fixed learning rate

Mor Shpigel Nacson, Nathan Srebro, and Daniel Soudry. Stochastic gradient descent on separable data: Exact convergence with a fixed learning rate. InProceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, volume 89 ofProceedings of Machine Learning Research, pages 3051–3059. PMLR, 2019

work page 2019
[34]

Unified binary and multiclass margin-based classification.Journal of Machine Learning Research, 25(143):1–51, 2024

Yutong Wang and Clayton Scott. Unified binary and multiclass margin-based classification.Journal of Machine Learning Research, 25(143):1–51, 2024

work page 2024
[35]

The implicit bias of gradient descent on separable multiclass data

Hrithik Ravi, Clayton Scott, Daniel Soudry, and Yutong Wang. The implicit bias of gradient descent on separable multiclass data. InAdvances in Neural Information Processing Systems, volume 37, 2024

work page 2024
[36]

Maziar Raissi, Paris Perdikaris, and George E Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations.Journal of Computational physics, 378:686–707, 2019

work page 2019
[37]

ρp Q−ρ 2 + 1 − Q√2Q+ 1 # + 2η2 π2√2Q+ 1

Preetum Nakkiran, Behnam Neyshabur, and Hanie Sedghi. The deep bootstrap framework: Good online learners are good offline generalizers. InInternational Conference on Learning Representations, 2021. A Exact centered dynamics for the symmetricK-class model This appendix gives the derivation of the exact centered closure used in Section 3. Throughout, K is f...

work page 2021

[1] [1]

Deep Learning Scaling is Predictable, Empirically

Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep Learning Scaling is Predictable, Empirically. arXiv preprint arXiv:1712.00409, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[2] [2]

An empirical analysis of compute- optimal large language model training

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Thomas Hennigan, Eric Noland, Katherine Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karén Simonyan, Erich Elsen, Oriol Vinyals, Jack Rae, and Laurent S...

work page 2022

[3] [3]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling Laws for Neural Language Models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[4] [4]

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Jack W. Rae et al. Scaling Language Models: Methods, Analysis & Insights from Training Gopher.arXiv preprint arXiv:2112.11446, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[5] [5]

Spectrum dependent learning curves in kernel regression and wide neural networks

Blake Bordelon, Abdulkadir Canatar, and Cengiz Pehlevan. Spectrum dependent learning curves in kernel regression and wide neural networks. InProceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 1024–1034. PMLR, 2020

work page 2020

[6] [6]

Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks.Nature Communications, 12(1): 2914, 2021

Abdulkadir Canatar, Blake Bordelon, and Cengiz Pehlevan. Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks.Nature Communications, 12(1): 2914, 2021

work page 2021

[7] [7]

Explaining neural scaling laws.Proceedings of the National Academy of Sciences, 121(27):e2311878121, 2024

Yasaman Bahri, Ethan Dyer, Jared Kaplan, Jaehoon Lee, and Utkarsh Sharma. Explaining neural scaling laws.Proceedings of the National Academy of Sciences, 121(27):e2311878121, 2024

work page 2024

[8] [8]

A solvable model of neural scaling laws.arXiv preprint arXiv:2210.16859, 2022

Alexander Maloney, Daniel A. Roberts, and James Sully. A solvable model of neural scaling laws.arXiv preprint arXiv:2210.16859, 2022

work page arXiv 2022

[9] [9]

Kakade, Peter L

Licong Lin, Jingfeng Wu, Sham M. Kakade, Peter L. Bartlett, and Jason D. Lee. Scaling laws in linear regression: Compute, parameters, and data.Advances in Neural Information Processing Systems, 37, 2024

work page 2024

[10] [10]

Scaling laws from the data manifold dimension.Journal of Machine Learning Research, 23(9):1–34, 2022

Utkarsh Sharma and Jared Kaplan. Scaling laws from the data manifold dimension.Journal of Machine Learning Research, 23(9):1–34, 2022

work page 2022

[11] [11]

Adaptive approximation and generalization of deep neural networks with intrinsic dimensionality.Journal of Machine Learning Research, 21(174):1–38, 2020

Ryotaro Nakada and Masaaki Imaizumi. Adaptive approximation and generalization of deep neural networks with intrinsic dimensionality.Journal of Machine Learning Research, 21(174):1–38, 2020

work page 2020

[12] [12]

A dynamical model of neural scaling laws

Blake Bordelon, Alexander Atanasov, and Cengiz Pehlevan. A dynamical model of neural scaling laws. In International Conference on Machine Learning, 2024

work page 2024

[13] [13]

How feature learning can improve neural scaling laws.Journal of Statistical Mechanics: Theory and Experiment, 2025(8):084002, 2025

Blake Bordelon, Alexander Atanasov, and Cengiz Pehlevan. How feature learning can improve neural scaling laws.Journal of Statistical Mechanics: Theory and Experiment, 2025(8):084002, 2025

work page 2025

[14] [14]

Analyzing neural scaling laws in two-layer networks with power-law data spectra

Roman Worschech and Bernd Rosenow. Analyzing neural scaling laws in two-layer networks with power-law data spectra. InInternational Conference on Learning Representations, 2025. Spotlight

work page 2025

[15] [15]

On the origin of neural scaling laws: from random graphs to natural language.arXiv preprint arXiv:2601.10684, 2026

Maissam Barkeshli, Alberto Alfarano, and Andrey Gromov. On the origin of neural scaling laws: From random graphs to natural language.arXiv preprint arXiv:2601.10684, 2026

work page arXiv 2026

[16] [16]

Universal One-third Time Scaling in Learning Peaked Distributions

Yizhou Liu, Ziming Liu, Cengiz Pehlevan, and Jeff Gore. Universal One-third Time Scaling in Learning Peaked Distributions.arXiv preprint arXiv:2602.03685, 2026

work page internal anchor Pith review arXiv 2026

[17] [17]

Learning curves for the multi-class teacher–student perceptron.Machine Learning: Science and Technology, 4(1):015019, 2023

Elisabetta Cornacchia, Francesca Mignacco, Rodrigo Veiga, Cédric Gerbelot, Bruno Loureiro, and Lenka Zdeborová. Learning curves for the multi-class teacher–student perceptron.Machine Learning: Science and Technology, 4(1):015019, 2023

work page 2023

[18] [18]

On-line learning with a student-teacher scenario.Europhysics Letters, 28 (7):525, 1994

Michael Biehl and Peter Riegler. On-line learning with a student-teacher scenario.Europhysics Letters, 28 (7):525, 1994

work page 1994

[19] [19]

Calculation of the learning curve of bayes optimal classification algorithm for learning a perceptron with noise.Physical Review Letters, 66(20):2677, 1991

Manfred Opper and David Haussler. Calculation of the learning curve of bayes optimal classification algorithm for learning a perceptron with noise.Physical Review Letters, 66(20):2677, 1991

work page 1991

[20] [20]

Exact solution for on-line learning in multilayer neural networks.Physical Review Letters, 74(21):4337, 1995

David Saad and Sara A Solla. Exact solution for on-line learning in multilayer neural networks.Physical Review Letters, 74(21):4337, 1995

work page 1995

[21] [21]

H. S. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning from examples.Phys. Rev. A, 45:6056–6091, Apr 1992

work page 1992

[22] [22]

Soft mode in the dynamics of over-realizable online learning for soft committee machines.Physical Review E, 105(5):L052302, 2022

Frederieke Richert, Roman Worschech, and Bernd Rosenow. Soft mode in the dynamics of over-realizable online learning for soft committee machines.Physical Review E, 105(5):L052302, 2022

work page 2022

[23] [23]

High-dimensional dynamics of generalization error in neural networks.Neural Networks, 132:428–446, 2020

Madhu S Advani and Andrew M Saxe. High-dimensional dynamics of generalization error in neural networks.Neural Networks, 132:428–446, 2020

work page 2020

[24] [24]

Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup

Sebastian Goldt, Madhu Advani, Andrew M Saxe, Florent Krzakala, and Lenka Zdeborová. Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup. InAdvances in Neural Information Processing Systems, volume 32, 2019. 10

work page 2019

[25] [25]

Dynamical mean- field theory for sgd in high-dimensional classification

Francesca Mignacco, Florent Krzakala, Pierfrancesco Urbani, and Lenka Zdeborová. Dynamical mean- field theory for sgd in high-dimensional classification. InAdvances in Neural Information Processing Systems, volume 33, pages 5834–5845, 2020

work page 2020

[26] [26]

Generalization error in high- dimensional perceptrons: Approaching bayes error with convex optimization.Advances in Neural Information Processing Systems, 33:12199–12210, 2020

Benjamin Aubin, Florent Krzakala, Yue Lu, and Lenka Zdeborová. Generalization error in high- dimensional perceptrons: Approaching bayes error with convex optimization.Advances in Neural Information Processing Systems, 33:12199–12210, 2020

work page 2020

[27] [27]

Learning curves of generic features maps for realistic datasets with a teacher-student model

Bruno Loureiro, Gabriele Sicuro, Cédric Gerbelot, Alessandro Pacco, Florent Krzakala, and Lenka Zdeborová. Learning curves of generic features maps for realistic datasets with a teacher-student model. InAdvances in Neural Information Processing Systems, volume 34, pages 18137–18151, 2021

work page 2021

[28] [28]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021

work page 2021

[29] [29]

Statistical behavior and consistency of classification methods based on convex risk minimiza- tion.The Annals of Statistics, 32(1):56–134, 2004

Tong Zhang. Statistical behavior and consistency of classification methods based on convex risk minimiza- tion.The Annals of Statistics, 32(1):56–134, 2004

work page 2004

[30] [30]

Bartlett, Michael I

Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156, 2006

work page 2006

[31] [31]

A Universal Growth Rate for Learning with Smooth Surrogate Losses

Anqi Mao, Mehryar Mohri, and Yutao Zhong. A Universal Growth Rate for Learning with Smooth Surrogate Losses. InAdvances in Neural Information Processing Systems, volume 37, pages 41670–41708. Curran Associates, Inc., 2024

work page 2024

[32] [32]

The implicit bias of gradient descent on separable data

Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. InInternational Conference on Learning Representations, 2018

work page 2018

[33] [33]

Stochastic gradient descent on separable data: Exact convergence with a fixed learning rate

Mor Shpigel Nacson, Nathan Srebro, and Daniel Soudry. Stochastic gradient descent on separable data: Exact convergence with a fixed learning rate. InProceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, volume 89 ofProceedings of Machine Learning Research, pages 3051–3059. PMLR, 2019

work page 2019

[34] [34]

Unified binary and multiclass margin-based classification.Journal of Machine Learning Research, 25(143):1–51, 2024

Yutong Wang and Clayton Scott. Unified binary and multiclass margin-based classification.Journal of Machine Learning Research, 25(143):1–51, 2024

work page 2024

[35] [35]

The implicit bias of gradient descent on separable multiclass data

Hrithik Ravi, Clayton Scott, Daniel Soudry, and Yutong Wang. The implicit bias of gradient descent on separable multiclass data. InAdvances in Neural Information Processing Systems, volume 37, 2024

work page 2024

[36] [36]

Maziar Raissi, Paris Perdikaris, and George E Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations.Journal of Computational physics, 378:686–707, 2019

work page 2019

[37] [37]

ρp Q−ρ 2 + 1 − Q√2Q+ 1 # + 2η2 π2√2Q+ 1

Preetum Nakkiran, Behnam Neyshabur, and Hanie Sedghi. The deep bootstrap framework: Good online learners are good offline generalizers. InInternational Conference on Learning Representations, 2021. A Exact centered dynamics for the symmetricK-class model This appendix gives the derivation of the exact centered closure used in Section 3. Throughout, K is f...

work page 2021