pith. sign in

arxiv: 2605.22341 · v1 · pith:RSJAPOJUnew · submitted 2026-05-21 · 💻 cs.LG · cond-mat.dis-nn

A Boundary-Layer Mechanism for One-Third Scaling in Online Softmax Classification

Pith reviewed 2026-05-22 07:00 UTC · model grok-4.3

classification 💻 cs.LG cond-mat.dis-nn
keywords softmax cross-entropyonline learningteacher-student modelpower-law scalinggeneralization errorboundary layersthermodynamic limitlearning-rate schedules
0
0 comments X

The pith

In online softmax classification, only thin boundary layers near decision boundaries remain active at late times, producing generalization error that decays as training time to the minus one third.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper isolates an asymptotic mechanism in a teacher-student model where softmax cross-entropy is used to train hard-label classification. After centering the logits by subtracting their mean, the thermodynamic-limit dynamics reduce to two order parameters: a growing student-teacher alignment D and a residual variance Delta kept nonzero by gradient noise. At late times, examples far from the teacher's decision boundaries are already classified with high , contributing exponentially small gradients; only boundary layers of width proportional to one over D stay active. Solving the resulting closed equations yields a power-law decay of both test loss and generalization error epsilon_g as alpha to the minus one third, where alpha is training time. This scaling is slower than the Bayes-optimal reference of alpha to the minus one for the same model, and the authors show that learning-rate schedules can recover a faster alpha to the minus one half decay.

Core claim

After subtracting the mean logit, the thermodynamic-limit dynamics close in centered variables consisting of a growing centered student-teacher alignment D and the residual student variance Delta. At late times, examples away from teacher decision boundaries contribute exponentially little to the loss and gradients, so only boundary layers of width O(D^{-1}) remain active while noise from fixed-learning-rate online gradient descent maintains nonzero Delta. The late-time solution of these dynamics produces an alpha^{-1/3} power law for both the test loss and the generalization error epsilon_g (one minus test accuracy). Learning-rate schedules can improve the generalization error to an epsilon

What carries the argument

Boundary layers of width O(D^{-1}) that stay active at late times while noise sustains nonzero residual variance Delta in the centered order-parameter dynamics.

If this is right

  • Both test loss and generalization error epsilon_g decay as alpha^{-1/3} under fixed learning rate.
  • This scaling is slower than the Bayes-optimal alpha^{-1} for the same teacher-student setup.
  • Scheduled learning rates can improve the generalization error to an epsilon_g ~ alpha^{-1/2} power law.
  • Data structure can dominate early transients, but the boundary-layer mechanism governs the asymptotic regime.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same boundary-layer bottleneck may appear in other surrogate losses whenever hard labels are approximated by smooth functions.
  • If real data possess well-defined decision boundaries, this mechanism could set a lower bound on how fast classification error can improve with compute.
  • Controlled experiments with whitened features suggest that the scaling is robust once the model enters the late-time regime.

Load-bearing premise

The thermodynamic-limit dynamics close exactly in centered variables after subtracting the mean logit, so that only alignment D and residual variance Delta matter and off-boundary examples contribute negligibly.

What would settle it

In long-time simulations of the online teacher-student softmax model with fixed learning rate, check whether the measured generalization error follows alpha to the power of negative one third rather than a different exponent.

Figures

Figures reproduced from arXiv: 2605.22341 by Bernd Rosenow, Marcel K\"uhn, Yoon Thelge.

Figure 1
Figure 1. Figure 1: Left: The 1/3 law appears not only in the test loss but also in the generalization error ϵg, i.e., one minus test accuracy. Middle: The model captures both the growth of the centered student-teacher alignment D and the rotational alignment to the teacher. Right: Near a teacher decision boundary, the late-time loss is controlled by the student boundary layer of width O(D−1 ). The generalization error is con… view at source ↗
Figure 2
Figure 2. Figure 2: Finite-N validation for fixed learning rates in the K = 3 online teacher–student model. The panels show the generalization error, centered overlap D, and residual variance ∆ as functions of macroscopic time α = µ/N. The curves show representative seed trajectories, with envelopes indicating fluctuations across six simulation seeds. Within these fluctuations, the trajectories agree with the predicted power-… view at source ↗
Figure 3
Figure 3. Figure 3: Schedule dependence in the K = 3 online teacher–student model. For η(α) ∝ α −γ , the theory predicts ϵg(α) ∼ α −(2+γ)/6 for 0 ≤ γ < 1. Increasing γ slows the growth of the centered overlap, D ∝ α (1−γ)/3 for γ < 1, but decreases the residual variance, ∆ ∝ η(α); the latter effect improves the classification-error exponent. The γ = 1 curve is a borderline case, where the adiabatic approximation for ∆ breaks … view at source ↗
Figure 5
Figure 5. Figure 5: Controlled departure from isotropic inputs. Inputs are Gaussian with diagonal covariance [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Dependence on the number of classes. Fixed-learning-rate simulations for [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Whitened pretrained-feature experiment with real labels. This run is included as a qualitative [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
read the original abstract

Hard-label classification is usually trained with smooth surrogate losses, most prominently softmax cross-entropy. We isolate an asymptotic mechanism by which this mismatch between smooth surrogate and discrete labels produces power-law learning curves in an online teacher-student model. After subtracting the mean logit, the thermodynamic-limit dynamics close in centered variables: a growing centered student-teacher alignment $D$ and the residual student variance $\Delta$. At late times, examples away from teacher decision boundaries are already classified confidently and contribute exponentially little. Only boundary layers of width $O(D^{-1})$ remain active, while the noise of fixed-learning-rate online gradient descent maintains a nonzero $\Delta$. As a function of the training time $\alpha$ the late-time solution yields a $\alpha^{-1/3}$ power law not only for the test loss but also for the generalization error $\epsilon_g$, i.e., one minus test accuracy. This is much slower than the $\alpha^{-1}$ Bayes-optimal reference for the same model. We further show that learning-rate schedules can improve the generalization error towards a $\epsilon_g \sim \alpha^{-1/2}$ power law. Simulations support the predicted order parameter dynamics and learning curves. Controlled experiments with correlated Gaussian inputs and whitened pretrained features show that data structure can dominate transients. Therefore, our result is an asymptotic, complementary mechanism rather than an alternative to spectral explanations of neural scaling laws.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper analyzes online softmax cross-entropy training in a teacher-student binary classification model. After centering logits, the thermodynamic-limit dynamics close on two order parameters: growing student-teacher alignment D and residual variance Δ. At late times only O(D^{-1})-width boundary layers around the teacher decision boundary remain active; fixed-learning-rate noise sustains nonzero Δ. This balance produces test loss and generalization error ε_g both scaling as α^{-1/3}, slower than the Bayes-optimal α^{-1} reference. Learning-rate schedules are shown to recover α^{-1/2} scaling. Simulations confirm the predicted order-parameter trajectories and learning curves; controlled experiments with correlated inputs illustrate that data structure can dominate transients.

Significance. If the boundary-layer closure and scaling balance hold, the work supplies a concrete, mechanistic origin for a specific power-law exponent that arises directly from the surrogate-loss/hard-label mismatch in online gradient descent. The reduction to two centered variables, the explicit 1/D active-fraction argument, and the resulting α^{-1/3} prediction are falsifiable and complementary to spectral accounts of neural scaling. The demonstration that simple schedules improve the exponent to -1/2 and the discussion of data-structure transients add practical value.

major comments (2)
  1. §3.2, Eq. (18)–(22): the thermodynamic-limit closure in centered variables (D, Δ) is asserted after subtracting the mean logit. The derivation of the drift and diffusion terms for the boundary layer must explicitly show that all higher-order moments and cross-correlations remain sub-leading when the active fraction is O(D^{-1}); otherwise the two-variable reduction is not closed at the order needed for the α^{-1/3} balance.
  2. §4.1, Figure 3: the reported late-time exponent for ε_g is fitted over a limited α window. Because the claimed scaling is asymptotic, the manuscript should include a quantitative check (e.g., local slope versus α or extrapolation to infinite α) that rules out slower transients or crossover to the Bayes-optimal regime within the simulated range.
minor comments (3)
  1. Notation: the symbol Δ is used both for residual variance and for the teacher-student overlap in some intermediate equations; a single consistent definition or explicit distinction would prevent confusion.
  2. Figure 1 caption: the plotted curves are labeled “theory” but the caption does not state whether they are the exact solution of the two-variable ODE or a numerical integration; clarify the source of the solid lines.
  3. Reference list: the discussion of spectral scaling laws cites only a subset of the recent literature; adding the most directly comparable teacher-student analyses would help readers locate the present mechanism within the broader literature.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address the points below and have revised the manuscript to incorporate clarifications and additional checks.

read point-by-point responses
  1. Referee: §3.2, Eq. (18)–(22): the thermodynamic-limit closure in centered variables (D, Δ) is asserted after subtracting the mean logit. The derivation of the drift and diffusion terms for the boundary layer must explicitly show that all higher-order moments and cross-correlations remain sub-leading when the active fraction is O(D^{-1}); otherwise the two-variable reduction is not closed at the order needed for the α^{-1/3} balance.

    Authors: We appreciate the request for an explicit bound. In the revised §3.2 we add a dedicated paragraph deriving the moment scalings: outside the O(D^{-1}) layer the measure is exponentially small (O(e^{-cD})), while inside the layer the local fields remain O(1) and the width supplies an extra 1/D factor, so that all higher cumulants and cross-correlations are O(1/D) or smaller. These corrections are sub-dominant to the leading drift-diffusion balance that produces the α^{-1/3} scaling, thereby closing the two-variable system at the required order. revision: yes

  2. Referee: §4.1, Figure 3: the reported late-time exponent for ε_g is fitted over a limited α window. Because the claimed scaling is asymptotic, the manuscript should include a quantitative check (e.g., local slope versus α or extrapolation to infinite α) that rules out slower transients or crossover to the Bayes-optimal regime within the simulated range.

    Authors: We agree that a direct diagnostic of the asymptotic regime is useful. The revised Figure 3 now includes an inset plotting the local logarithmic slope d log ε_g / d log α versus α; the slope approaches −1/3 at the largest simulated α and shows no systematic drift toward −1. We also add a short table of effective exponents obtained from successive α windows, confirming convergence to the predicted value without detectable crossover in the accessible range. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation

full rationale

The paper derives the late-time α^{-1/3} scaling for test loss and generalization error from the thermodynamic-limit closure of dynamics in centered variables D (growing alignment) and Δ (residual variance) after subtracting the mean logit. Only boundary layers of width O(D^{-1}) remain active due to exponential suppression of bulk contributions, with fixed-learning-rate noise maintaining nonzero Δ. The scaling follows from integrating the drift over the active fraction and balancing the resulting damping rate against diffusion, without any reduction to fitted parameters, self-definitional loops, or load-bearing self-citations. Simulations are invoked only for support, not as the source of the scaling itself. The analysis is self-contained within the online teacher-student model and positioned as complementary to spectral mechanisms.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumptions of the teacher-student model in the thermodynamic limit and the dominance of boundary layers at late times, with no free parameters or invented entities explicitly introduced in the abstract.

axioms (2)
  • domain assumption Thermodynamic-limit dynamics close in centered variables after subtracting the mean logit
    Stated in the abstract as the basis for the dynamics of D and Δ.
  • domain assumption At late times, only boundary layers of width O(D^{-1}) remain active while noise maintains nonzero Δ
    Key assumption leading to the power-law solution.

pith-pipeline@v0.9.0 · 5786 in / 1479 out tokens · 74810 ms · 2026-05-22T07:00:53.932925+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 3 internal anchors

  1. [1]

    Deep Learning Scaling is Predictable, Empirically

    Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep Learning Scaling is Predictable, Empirically. arXiv preprint arXiv:1712.00409, 2017

  2. [2]

    An empirical analysis of compute- optimal large language model training

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Thomas Hennigan, Eric Noland, Katherine Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karén Simonyan, Erich Elsen, Oriol Vinyals, Jack Rae, and Laurent S...

  3. [3]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling Laws for Neural Language Models.arXiv preprint arXiv:2001.08361, 2020

  4. [4]

    Scaling Language Models: Methods, Analysis & Insights from Training Gopher

    Jack W. Rae et al. Scaling Language Models: Methods, Analysis & Insights from Training Gopher.arXiv preprint arXiv:2112.11446, 2021

  5. [5]

    Spectrum dependent learning curves in kernel regression and wide neural networks

    Blake Bordelon, Abdulkadir Canatar, and Cengiz Pehlevan. Spectrum dependent learning curves in kernel regression and wide neural networks. InProceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 1024–1034. PMLR, 2020

  6. [6]

    Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks.Nature Communications, 12(1): 2914, 2021

    Abdulkadir Canatar, Blake Bordelon, and Cengiz Pehlevan. Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks.Nature Communications, 12(1): 2914, 2021

  7. [7]

    Explaining neural scaling laws.Proceedings of the National Academy of Sciences, 121(27):e2311878121, 2024

    Yasaman Bahri, Ethan Dyer, Jared Kaplan, Jaehoon Lee, and Utkarsh Sharma. Explaining neural scaling laws.Proceedings of the National Academy of Sciences, 121(27):e2311878121, 2024

  8. [8]

    arXiv preprint arXiv:2210.16859 , year=

    Alexander Maloney, Daniel A. Roberts, and James Sully. A solvable model of neural scaling laws.arXiv preprint arXiv:2210.16859, 2022

  9. [9]

    Kakade, Peter L

    Licong Lin, Jingfeng Wu, Sham M. Kakade, Peter L. Bartlett, and Jason D. Lee. Scaling laws in linear regression: Compute, parameters, and data.Advances in Neural Information Processing Systems, 37, 2024

  10. [10]

    Scaling laws from the data manifold dimension.Journal of Machine Learning Research, 23(9):1–34, 2022

    Utkarsh Sharma and Jared Kaplan. Scaling laws from the data manifold dimension.Journal of Machine Learning Research, 23(9):1–34, 2022

  11. [11]

    Adaptive approximation and generalization of deep neural networks with intrinsic dimensionality.Journal of Machine Learning Research, 21(174):1–38, 2020

    Ryotaro Nakada and Masaaki Imaizumi. Adaptive approximation and generalization of deep neural networks with intrinsic dimensionality.Journal of Machine Learning Research, 21(174):1–38, 2020

  12. [12]

    A dynamical model of neural scaling laws

    Blake Bordelon, Alexander Atanasov, and Cengiz Pehlevan. A dynamical model of neural scaling laws. In International Conference on Machine Learning, 2024

  13. [13]

    How feature learning can improve neural scaling laws.Journal of Statistical Mechanics: Theory and Experiment, 2025(8):084002, 2025

    Blake Bordelon, Alexander Atanasov, and Cengiz Pehlevan. How feature learning can improve neural scaling laws.Journal of Statistical Mechanics: Theory and Experiment, 2025(8):084002, 2025

  14. [14]

    Analyzing neural scaling laws in two-layer networks with power-law data spectra

    Roman Worschech and Bernd Rosenow. Analyzing neural scaling laws in two-layer networks with power-law data spectra. InInternational Conference on Learning Representations, 2025. Spotlight

  15. [15]

    arXiv preprint arXiv:2601.10684 , year =

    Maissam Barkeshli, Alberto Alfarano, and Andrey Gromov. On the origin of neural scaling laws: From random graphs to natural language.arXiv preprint arXiv:2601.10684, 2026

  16. [16]

    Universal One-third Time Scaling in Learning Peaked Distributions.arXiv preprint arXiv:2602.03685, 2026

    Yizhou Liu, Ziming Liu, Cengiz Pehlevan, and Jeff Gore. Universal One-third Time Scaling in Learning Peaked Distributions.arXiv preprint arXiv:2602.03685, 2026

  17. [17]

    Learning curves for the multi-class teacher–student perceptron.Machine Learning: Science and Technology, 4(1):015019, 2023

    Elisabetta Cornacchia, Francesca Mignacco, Rodrigo Veiga, Cédric Gerbelot, Bruno Loureiro, and Lenka Zdeborová. Learning curves for the multi-class teacher–student perceptron.Machine Learning: Science and Technology, 4(1):015019, 2023

  18. [18]

    On-line learning with a student-teacher scenario.Europhysics Letters, 28 (7):525, 1994

    Michael Biehl and Peter Riegler. On-line learning with a student-teacher scenario.Europhysics Letters, 28 (7):525, 1994

  19. [19]

    Calculation of the learning curve of bayes optimal classification algorithm for learning a perceptron with noise.Physical Review Letters, 66(20):2677, 1991

    Manfred Opper and David Haussler. Calculation of the learning curve of bayes optimal classification algorithm for learning a perceptron with noise.Physical Review Letters, 66(20):2677, 1991

  20. [20]

    Exact solution for on-line learning in multilayer neural networks.Physical Review Letters, 74(21):4337, 1995

    David Saad and Sara A Solla. Exact solution for on-line learning in multilayer neural networks.Physical Review Letters, 74(21):4337, 1995

  21. [21]

    H. S. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning from examples.Phys. Rev. A, 45:6056–6091, Apr 1992

  22. [22]

    Soft mode in the dynamics of over-realizable online learning for soft committee machines.Physical Review E, 105(5):L052302, 2022

    Frederieke Richert, Roman Worschech, and Bernd Rosenow. Soft mode in the dynamics of over-realizable online learning for soft committee machines.Physical Review E, 105(5):L052302, 2022

  23. [23]

    High-dimensional dynamics of generalization error in neural networks.Neural Networks, 132:428–446, 2020

    Madhu S Advani and Andrew M Saxe. High-dimensional dynamics of generalization error in neural networks.Neural Networks, 132:428–446, 2020

  24. [24]

    Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup

    Sebastian Goldt, Madhu Advani, Andrew M Saxe, Florent Krzakala, and Lenka Zdeborová. Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup. InAdvances in Neural Information Processing Systems, volume 32, 2019. 10

  25. [25]

    Dynamical mean- field theory for sgd in high-dimensional classification

    Francesca Mignacco, Florent Krzakala, Pierfrancesco Urbani, and Lenka Zdeborová. Dynamical mean- field theory for sgd in high-dimensional classification. InAdvances in Neural Information Processing Systems, volume 33, pages 5834–5845, 2020

  26. [26]

    Generalization error in high- dimensional perceptrons: Approaching bayes error with convex optimization.Advances in Neural Information Processing Systems, 33:12199–12210, 2020

    Benjamin Aubin, Florent Krzakala, Yue Lu, and Lenka Zdeborová. Generalization error in high- dimensional perceptrons: Approaching bayes error with convex optimization.Advances in Neural Information Processing Systems, 33:12199–12210, 2020

  27. [27]

    Learning curves of generic features maps for realistic datasets with a teacher-student model

    Bruno Loureiro, Gabriele Sicuro, Cédric Gerbelot, Alessandro Pacco, Florent Krzakala, and Lenka Zdeborová. Learning curves of generic features maps for realistic datasets with a teacher-student model. InAdvances in Neural Information Processing Systems, volume 34, pages 18137–18151, 2021

  28. [28]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021

  29. [29]

    Statistical behavior and consistency of classification methods based on convex risk minimiza- tion.The Annals of Statistics, 32(1):56–134, 2004

    Tong Zhang. Statistical behavior and consistency of classification methods based on convex risk minimiza- tion.The Annals of Statistics, 32(1):56–134, 2004

  30. [30]

    Bartlett, Michael I

    Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156, 2006

  31. [31]

    A Universal Growth Rate for Learning with Smooth Surrogate Losses

    Anqi Mao, Mehryar Mohri, and Yutao Zhong. A Universal Growth Rate for Learning with Smooth Surrogate Losses. InAdvances in Neural Information Processing Systems, volume 37, pages 41670–41708. Curran Associates, Inc., 2024

  32. [32]

    The implicit bias of gradient descent on separable data

    Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. InInternational Conference on Learning Representations, 2018

  33. [33]

    Stochastic gradient descent on separable data: Exact convergence with a fixed learning rate

    Mor Shpigel Nacson, Nathan Srebro, and Daniel Soudry. Stochastic gradient descent on separable data: Exact convergence with a fixed learning rate. InProceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, volume 89 ofProceedings of Machine Learning Research, pages 3051–3059. PMLR, 2019

  34. [34]

    Unified binary and multiclass margin-based classification.Journal of Machine Learning Research, 25(143):1–51, 2024

    Yutong Wang and Clayton Scott. Unified binary and multiclass margin-based classification.Journal of Machine Learning Research, 25(143):1–51, 2024

  35. [35]

    The implicit bias of gradient descent on separable multiclass data

    Hrithik Ravi, Clayton Scott, Daniel Soudry, and Yutong Wang. The implicit bias of gradient descent on separable multiclass data. InAdvances in Neural Information Processing Systems, volume 37, 2024

  36. [36]

    Maziar Raissi, Paris Perdikaris, and George E Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations.Journal of Computational physics, 378:686–707, 2019

  37. [37]

    ρp Q−ρ 2 + 1 − Q√2Q+ 1 # + 2η2 π2√2Q+ 1

    Preetum Nakkiran, Behnam Neyshabur, and Hanie Sedghi. The deep bootstrap framework: Good online learners are good offline generalizers. InInternational Conference on Learning Representations, 2021. A Exact centered dynamics for the symmetricK-class model This appendix gives the derivation of the exact centered closure used in Section 3. Throughout, K is f...