A Boundary-Layer Mechanism for One-Third Scaling in Online Softmax Classification
Pith reviewed 2026-05-22 07:00 UTC · model grok-4.3
The pith
In online softmax classification, only thin boundary layers near decision boundaries remain active at late times, producing generalization error that decays as training time to the minus one third.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
After subtracting the mean logit, the thermodynamic-limit dynamics close in centered variables consisting of a growing centered student-teacher alignment D and the residual student variance Delta. At late times, examples away from teacher decision boundaries contribute exponentially little to the loss and gradients, so only boundary layers of width O(D^{-1}) remain active while noise from fixed-learning-rate online gradient descent maintains nonzero Delta. The late-time solution of these dynamics produces an alpha^{-1/3} power law for both the test loss and the generalization error epsilon_g (one minus test accuracy). Learning-rate schedules can improve the generalization error to an epsilon
What carries the argument
Boundary layers of width O(D^{-1}) that stay active at late times while noise sustains nonzero residual variance Delta in the centered order-parameter dynamics.
If this is right
- Both test loss and generalization error epsilon_g decay as alpha^{-1/3} under fixed learning rate.
- This scaling is slower than the Bayes-optimal alpha^{-1} for the same teacher-student setup.
- Scheduled learning rates can improve the generalization error to an epsilon_g ~ alpha^{-1/2} power law.
- Data structure can dominate early transients, but the boundary-layer mechanism governs the asymptotic regime.
Where Pith is reading between the lines
- The same boundary-layer bottleneck may appear in other surrogate losses whenever hard labels are approximated by smooth functions.
- If real data possess well-defined decision boundaries, this mechanism could set a lower bound on how fast classification error can improve with compute.
- Controlled experiments with whitened features suggest that the scaling is robust once the model enters the late-time regime.
Load-bearing premise
The thermodynamic-limit dynamics close exactly in centered variables after subtracting the mean logit, so that only alignment D and residual variance Delta matter and off-boundary examples contribute negligibly.
What would settle it
In long-time simulations of the online teacher-student softmax model with fixed learning rate, check whether the measured generalization error follows alpha to the power of negative one third rather than a different exponent.
Figures
read the original abstract
Hard-label classification is usually trained with smooth surrogate losses, most prominently softmax cross-entropy. We isolate an asymptotic mechanism by which this mismatch between smooth surrogate and discrete labels produces power-law learning curves in an online teacher-student model. After subtracting the mean logit, the thermodynamic-limit dynamics close in centered variables: a growing centered student-teacher alignment $D$ and the residual student variance $\Delta$. At late times, examples away from teacher decision boundaries are already classified confidently and contribute exponentially little. Only boundary layers of width $O(D^{-1})$ remain active, while the noise of fixed-learning-rate online gradient descent maintains a nonzero $\Delta$. As a function of the training time $\alpha$ the late-time solution yields a $\alpha^{-1/3}$ power law not only for the test loss but also for the generalization error $\epsilon_g$, i.e., one minus test accuracy. This is much slower than the $\alpha^{-1}$ Bayes-optimal reference for the same model. We further show that learning-rate schedules can improve the generalization error towards a $\epsilon_g \sim \alpha^{-1/2}$ power law. Simulations support the predicted order parameter dynamics and learning curves. Controlled experiments with correlated Gaussian inputs and whitened pretrained features show that data structure can dominate transients. Therefore, our result is an asymptotic, complementary mechanism rather than an alternative to spectral explanations of neural scaling laws.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper analyzes online softmax cross-entropy training in a teacher-student binary classification model. After centering logits, the thermodynamic-limit dynamics close on two order parameters: growing student-teacher alignment D and residual variance Δ. At late times only O(D^{-1})-width boundary layers around the teacher decision boundary remain active; fixed-learning-rate noise sustains nonzero Δ. This balance produces test loss and generalization error ε_g both scaling as α^{-1/3}, slower than the Bayes-optimal α^{-1} reference. Learning-rate schedules are shown to recover α^{-1/2} scaling. Simulations confirm the predicted order-parameter trajectories and learning curves; controlled experiments with correlated inputs illustrate that data structure can dominate transients.
Significance. If the boundary-layer closure and scaling balance hold, the work supplies a concrete, mechanistic origin for a specific power-law exponent that arises directly from the surrogate-loss/hard-label mismatch in online gradient descent. The reduction to two centered variables, the explicit 1/D active-fraction argument, and the resulting α^{-1/3} prediction are falsifiable and complementary to spectral accounts of neural scaling. The demonstration that simple schedules improve the exponent to -1/2 and the discussion of data-structure transients add practical value.
major comments (2)
- §3.2, Eq. (18)–(22): the thermodynamic-limit closure in centered variables (D, Δ) is asserted after subtracting the mean logit. The derivation of the drift and diffusion terms for the boundary layer must explicitly show that all higher-order moments and cross-correlations remain sub-leading when the active fraction is O(D^{-1}); otherwise the two-variable reduction is not closed at the order needed for the α^{-1/3} balance.
- §4.1, Figure 3: the reported late-time exponent for ε_g is fitted over a limited α window. Because the claimed scaling is asymptotic, the manuscript should include a quantitative check (e.g., local slope versus α or extrapolation to infinite α) that rules out slower transients or crossover to the Bayes-optimal regime within the simulated range.
minor comments (3)
- Notation: the symbol Δ is used both for residual variance and for the teacher-student overlap in some intermediate equations; a single consistent definition or explicit distinction would prevent confusion.
- Figure 1 caption: the plotted curves are labeled “theory” but the caption does not state whether they are the exact solution of the two-variable ODE or a numerical integration; clarify the source of the solid lines.
- Reference list: the discussion of spectral scaling laws cites only a subset of the recent literature; adding the most directly comparable teacher-student analyses would help readers locate the present mechanism within the broader literature.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments. We address the points below and have revised the manuscript to incorporate clarifications and additional checks.
read point-by-point responses
-
Referee: §3.2, Eq. (18)–(22): the thermodynamic-limit closure in centered variables (D, Δ) is asserted after subtracting the mean logit. The derivation of the drift and diffusion terms for the boundary layer must explicitly show that all higher-order moments and cross-correlations remain sub-leading when the active fraction is O(D^{-1}); otherwise the two-variable reduction is not closed at the order needed for the α^{-1/3} balance.
Authors: We appreciate the request for an explicit bound. In the revised §3.2 we add a dedicated paragraph deriving the moment scalings: outside the O(D^{-1}) layer the measure is exponentially small (O(e^{-cD})), while inside the layer the local fields remain O(1) and the width supplies an extra 1/D factor, so that all higher cumulants and cross-correlations are O(1/D) or smaller. These corrections are sub-dominant to the leading drift-diffusion balance that produces the α^{-1/3} scaling, thereby closing the two-variable system at the required order. revision: yes
-
Referee: §4.1, Figure 3: the reported late-time exponent for ε_g is fitted over a limited α window. Because the claimed scaling is asymptotic, the manuscript should include a quantitative check (e.g., local slope versus α or extrapolation to infinite α) that rules out slower transients or crossover to the Bayes-optimal regime within the simulated range.
Authors: We agree that a direct diagnostic of the asymptotic regime is useful. The revised Figure 3 now includes an inset plotting the local logarithmic slope d log ε_g / d log α versus α; the slope approaches −1/3 at the largest simulated α and shows no systematic drift toward −1. We also add a short table of effective exponents obtained from successive α windows, confirming convergence to the predicted value without detectable crossover in the accessible range. revision: yes
Circularity Check
No significant circularity in the derivation
full rationale
The paper derives the late-time α^{-1/3} scaling for test loss and generalization error from the thermodynamic-limit closure of dynamics in centered variables D (growing alignment) and Δ (residual variance) after subtracting the mean logit. Only boundary layers of width O(D^{-1}) remain active due to exponential suppression of bulk contributions, with fixed-learning-rate noise maintaining nonzero Δ. The scaling follows from integrating the drift over the active fraction and balancing the resulting damping rate against diffusion, without any reduction to fitted parameters, self-definitional loops, or load-bearing self-citations. Simulations are invoked only for support, not as the source of the scaling itself. The analysis is self-contained within the online teacher-student model and positioned as complementary to spectral mechanisms.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Thermodynamic-limit dynamics close in centered variables after subtracting the mean logit
- domain assumption At late times, only boundary layers of width O(D^{-1}) remain active while noise maintains nonzero Δ
Reference graph
Works this paper leans on
-
[1]
Deep Learning Scaling is Predictable, Empirically
Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep Learning Scaling is Predictable, Empirically. arXiv preprint arXiv:1712.00409, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[2]
An empirical analysis of compute- optimal large language model training
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Thomas Hennigan, Eric Noland, Katherine Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karén Simonyan, Erich Elsen, Oriol Vinyals, Jack Rae, and Laurent S...
work page 2022
-
[3]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling Laws for Neural Language Models.arXiv preprint arXiv:2001.08361, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[4]
Scaling Language Models: Methods, Analysis & Insights from Training Gopher
Jack W. Rae et al. Scaling Language Models: Methods, Analysis & Insights from Training Gopher.arXiv preprint arXiv:2112.11446, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[5]
Spectrum dependent learning curves in kernel regression and wide neural networks
Blake Bordelon, Abdulkadir Canatar, and Cengiz Pehlevan. Spectrum dependent learning curves in kernel regression and wide neural networks. InProceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 1024–1034. PMLR, 2020
work page 2020
-
[6]
Abdulkadir Canatar, Blake Bordelon, and Cengiz Pehlevan. Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks.Nature Communications, 12(1): 2914, 2021
work page 2021
-
[7]
Yasaman Bahri, Ethan Dyer, Jared Kaplan, Jaehoon Lee, and Utkarsh Sharma. Explaining neural scaling laws.Proceedings of the National Academy of Sciences, 121(27):e2311878121, 2024
work page 2024
-
[8]
arXiv preprint arXiv:2210.16859 , year=
Alexander Maloney, Daniel A. Roberts, and James Sully. A solvable model of neural scaling laws.arXiv preprint arXiv:2210.16859, 2022
-
[9]
Licong Lin, Jingfeng Wu, Sham M. Kakade, Peter L. Bartlett, and Jason D. Lee. Scaling laws in linear regression: Compute, parameters, and data.Advances in Neural Information Processing Systems, 37, 2024
work page 2024
-
[10]
Scaling laws from the data manifold dimension.Journal of Machine Learning Research, 23(9):1–34, 2022
Utkarsh Sharma and Jared Kaplan. Scaling laws from the data manifold dimension.Journal of Machine Learning Research, 23(9):1–34, 2022
work page 2022
-
[11]
Ryotaro Nakada and Masaaki Imaizumi. Adaptive approximation and generalization of deep neural networks with intrinsic dimensionality.Journal of Machine Learning Research, 21(174):1–38, 2020
work page 2020
-
[12]
A dynamical model of neural scaling laws
Blake Bordelon, Alexander Atanasov, and Cengiz Pehlevan. A dynamical model of neural scaling laws. In International Conference on Machine Learning, 2024
work page 2024
-
[13]
Blake Bordelon, Alexander Atanasov, and Cengiz Pehlevan. How feature learning can improve neural scaling laws.Journal of Statistical Mechanics: Theory and Experiment, 2025(8):084002, 2025
work page 2025
-
[14]
Analyzing neural scaling laws in two-layer networks with power-law data spectra
Roman Worschech and Bernd Rosenow. Analyzing neural scaling laws in two-layer networks with power-law data spectra. InInternational Conference on Learning Representations, 2025. Spotlight
work page 2025
-
[15]
arXiv preprint arXiv:2601.10684 , year =
Maissam Barkeshli, Alberto Alfarano, and Andrey Gromov. On the origin of neural scaling laws: From random graphs to natural language.arXiv preprint arXiv:2601.10684, 2026
-
[16]
Yizhou Liu, Ziming Liu, Cengiz Pehlevan, and Jeff Gore. Universal One-third Time Scaling in Learning Peaked Distributions.arXiv preprint arXiv:2602.03685, 2026
-
[17]
Elisabetta Cornacchia, Francesca Mignacco, Rodrigo Veiga, Cédric Gerbelot, Bruno Loureiro, and Lenka Zdeborová. Learning curves for the multi-class teacher–student perceptron.Machine Learning: Science and Technology, 4(1):015019, 2023
work page 2023
-
[18]
On-line learning with a student-teacher scenario.Europhysics Letters, 28 (7):525, 1994
Michael Biehl and Peter Riegler. On-line learning with a student-teacher scenario.Europhysics Letters, 28 (7):525, 1994
work page 1994
-
[19]
Manfred Opper and David Haussler. Calculation of the learning curve of bayes optimal classification algorithm for learning a perceptron with noise.Physical Review Letters, 66(20):2677, 1991
work page 1991
-
[20]
David Saad and Sara A Solla. Exact solution for on-line learning in multilayer neural networks.Physical Review Letters, 74(21):4337, 1995
work page 1995
-
[21]
H. S. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning from examples.Phys. Rev. A, 45:6056–6091, Apr 1992
work page 1992
-
[22]
Frederieke Richert, Roman Worschech, and Bernd Rosenow. Soft mode in the dynamics of over-realizable online learning for soft committee machines.Physical Review E, 105(5):L052302, 2022
work page 2022
-
[23]
Madhu S Advani and Andrew M Saxe. High-dimensional dynamics of generalization error in neural networks.Neural Networks, 132:428–446, 2020
work page 2020
-
[24]
Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup
Sebastian Goldt, Madhu Advani, Andrew M Saxe, Florent Krzakala, and Lenka Zdeborová. Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup. InAdvances in Neural Information Processing Systems, volume 32, 2019. 10
work page 2019
-
[25]
Dynamical mean- field theory for sgd in high-dimensional classification
Francesca Mignacco, Florent Krzakala, Pierfrancesco Urbani, and Lenka Zdeborová. Dynamical mean- field theory for sgd in high-dimensional classification. InAdvances in Neural Information Processing Systems, volume 33, pages 5834–5845, 2020
work page 2020
-
[26]
Benjamin Aubin, Florent Krzakala, Yue Lu, and Lenka Zdeborová. Generalization error in high- dimensional perceptrons: Approaching bayes error with convex optimization.Advances in Neural Information Processing Systems, 33:12199–12210, 2020
work page 2020
-
[27]
Learning curves of generic features maps for realistic datasets with a teacher-student model
Bruno Loureiro, Gabriele Sicuro, Cédric Gerbelot, Alessandro Pacco, Florent Krzakala, and Lenka Zdeborová. Learning curves of generic features maps for realistic datasets with a teacher-student model. InAdvances in Neural Information Processing Systems, volume 34, pages 18137–18151, 2021
work page 2021
-
[28]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021
work page 2021
-
[29]
Tong Zhang. Statistical behavior and consistency of classification methods based on convex risk minimiza- tion.The Annals of Statistics, 32(1):56–134, 2004
work page 2004
-
[30]
Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156, 2006
work page 2006
-
[31]
A Universal Growth Rate for Learning with Smooth Surrogate Losses
Anqi Mao, Mehryar Mohri, and Yutao Zhong. A Universal Growth Rate for Learning with Smooth Surrogate Losses. InAdvances in Neural Information Processing Systems, volume 37, pages 41670–41708. Curran Associates, Inc., 2024
work page 2024
-
[32]
The implicit bias of gradient descent on separable data
Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. InInternational Conference on Learning Representations, 2018
work page 2018
-
[33]
Stochastic gradient descent on separable data: Exact convergence with a fixed learning rate
Mor Shpigel Nacson, Nathan Srebro, and Daniel Soudry. Stochastic gradient descent on separable data: Exact convergence with a fixed learning rate. InProceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, volume 89 ofProceedings of Machine Learning Research, pages 3051–3059. PMLR, 2019
work page 2019
-
[34]
Yutong Wang and Clayton Scott. Unified binary and multiclass margin-based classification.Journal of Machine Learning Research, 25(143):1–51, 2024
work page 2024
-
[35]
The implicit bias of gradient descent on separable multiclass data
Hrithik Ravi, Clayton Scott, Daniel Soudry, and Yutong Wang. The implicit bias of gradient descent on separable multiclass data. InAdvances in Neural Information Processing Systems, volume 37, 2024
work page 2024
-
[36]
Maziar Raissi, Paris Perdikaris, and George E Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations.Journal of Computational physics, 378:686–707, 2019
work page 2019
-
[37]
ρp Q−ρ 2 + 1 − Q√2Q+ 1 # + 2η2 π2√2Q+ 1
Preetum Nakkiran, Behnam Neyshabur, and Hanie Sedghi. The deep bootstrap framework: Good online learners are good offline generalizers. InInternational Conference on Learning Representations, 2021. A Exact centered dynamics for the symmetricK-class model This appendix gives the derivation of the exact centered closure used in Section 3. Throughout, K is f...
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.