In extensive-width networks, features are recovered sequentially through sharp phase transitions, yielding an effective width k_c that unifies Bayes-optimal generalization error scaling as Θ(k_c d / n).
Explaining neural scaling laws.Proceedings of the National Academy of Sciences, 121(27):e2311878121
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
Derives α^{-1/3} scaling for generalization error in online softmax classification from boundary layers in a teacher-student model.
Strong superposition causes neural loss to scale as the inverse of model dimension due to geometric feature overlaps, explaining scaling laws for broad frequency distributions.
citing papers explorer
-
Sharp feature-learning transitions and Bayes-optimal neural scaling laws in extensive-width networks
In extensive-width networks, features are recovered sequentially through sharp phase transitions, yielding an effective width k_c that unifies Bayes-optimal generalization error scaling as Θ(k_c d / n).
-
A Boundary-Layer Mechanism for One-Third Scaling in Online Softmax Classification
Derives α^{-1/3} scaling for generalization error in online softmax classification from boundary layers in a teacher-student model.
-
Superposition Yields Robust Neural Scaling
Strong superposition causes neural loss to scale as the inverse of model dimension due to geometric feature overlaps, explaining scaling laws for broad frequency distributions.