Asymmetric Scaling Laws from Sparse Features

John Sous; Michael Winer

arxiv: 2605.23591 · v1 · pith:JVXMEBYEnew · submitted 2026-05-22 · 📊 stat.ML · cond-mat.dis-nn· cs.LG· math.ST· stat.TH

Asymmetric Scaling Laws from Sparse Features

John Sous , Michael Winer This is my paper

Pith reviewed 2026-05-25 03:14 UTC · model grok-4.3

classification 📊 stat.ML cond-mat.dis-nncs.LGmath.STstat.TH

keywords sparse activationsscaling lawsdouble descentinterpolation thresholdunobserved featurescompute optimalgradient descent stability

0 comments

The pith

Sparse activations make test loss dominated by coordinates never seen in training, yielding asymmetric scaling laws.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that in neural networks with sparse activations, the test loss is often controlled by rare features that do not appear in any training example. This creates a bottleneck that is not present in dense models and leads to a double-descent phenomenon near the interpolation threshold. The resulting loss curve follows one scaling exponent in the underparameterized regime and another in the overparameterized regime, with the difference set by the level of sparsity. The analysis also yields a compute-optimal allocation that prefers larger datasets over bigger models at fixed compute, and the sparsity effect remains under nonlinear activations and affects gradient descent stability.

Core claim

Test loss is dominated by the population mass on coordinates that remain unobserved during training. This induces a double-descent peak at the interpolation threshold and produces distinct scaling exponents in the under- and overparameterized regimes whose gap is fixed by the sparsity degree.

What carries the argument

The mechanism of rare, completely unobserved coordinates that dominate the asymptotic population loss due to sparsity in activations.

If this is right

The loss exhibits a double-descent peak near the interpolation threshold.
Two distinct scaling exponents govern the loss curve, separated by a gap set by sparsity.
A compute-optimal frontier favors increasing dataset size over model capacity.
Gradient descent has a scaling law for the probability of becoming unstable.
The sparsity-induced effect holds under nonlinear activations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This suggests that covering rare features in data collection could be more important than previously thought for scaling performance.
The model may apply to other sparse data domains like language or images where certain combinations are rare.
Practitioners might adjust training to explicitly handle or estimate the unobserved mass to mitigate the bottleneck.

Load-bearing premise

That the sparsity produces coordinates which stay completely unobserved in the entire training set and that their contribution dominates the population loss.

What would settle it

Finding a sparse dataset where the error contribution from never-observed coordinates does not set the scaling behavior or where the two exponents do not appear with the predicted gap.

Figures

Figures reproduced from arXiv: 2605.23591 by John Sous, Michael Winer.

**Figure 1.** Figure 1: Phase diagram in (α1, α2). Solid black lines mark the boundary beyond which the model is well defined. Within it we identify three regimes: symmetric one-exponent scaling, asymmetric twoexponent scaling, and a GD-failure regime with its own scaling law. Interpretation: High-Dimensional Representations with Rare but Informative Features. Sparse activations of x can be understood as rare but highly informa… view at source ↗

**Figure 2.** Figure 2: Scaling collapse and double descent. We plot the rescaled loss ℓBayes · NαN as a function of (DαD /NαN ) 1/αN , across a wide range of values for D, N, for fixed (α1 = 1.0, α2 = 0.3). All curves collapse onto a single universal function Sα1,α2 , verifying a universal scale-invariant structure consistent with the predicted scaling law. A sharp peak emerges near ξcrit consistent with the double descent pheno… view at source ↗

**Figure 3.** Figure 3: Empirical scaling of loss with compute. Each curve corresponds to test loss ℓ versus total compute C for a fixed model size N, with C varied by sweeping D, for fixed (α1 = 1.0, α2 = 0.3). The dashed line denotes the predicted compute-optimal scaling ℓ ∗ (C) ∼ C −αC . The empirical loss converges toward this frontier at high compute, just past the double descent spike for each curve demonstrating agreement … view at source ↗

**Figure 4.** Figure 4: Asymmetry persists under nonlinearity. Test loss scaling under the ReLU feature map ϕ = σ(ux), trained with Nesterov + adaptive restart; (α1, α2) = (1.0, 0.3), 5 seeds. Top: sparse; bottom: dense. Left: N-sweep at D = 50,000; right: D-sweep at N = 8000. A single exponent α ≈ 1.5 (dashed lines, jointly fitted with separate intercepts) describes the sparse N-, dense N-, and dense Dsweeps. The sparse D-swee… view at source ↗

**Figure 5.** Figure 5: Two-exponent scaling validates linear theory. Test loss scaling under the linear feature map ϕ(x) = ux, computed via the closed-form minnorm least-squares solution; (α1, α2) = (1.0, 0.3), 20 seeds. Top: sparse; bottom: dense. Left: N-sweep at D = 50,000; right: D-sweep at N = 16,000. A single exponent α ≈ 2.0 (dashed lines, jointly fitted with separate intercepts) describes the sparse N-, dense N-, and de… view at source ↗

**Figure 6.** Figure 6: Asymmetry persists under nonlinearity (closed-form). Test loss scaling under the ReLU feature map ϕ(x) = σ(ux), computed via the closed-form min-norm least-squares solution; (α1, α2) = (1.0, 0.3), 20 seeds. Top: sparse; bottom: dense. Left: N-sweep at D = 50,000; right: D-sweep at N = 16,000. A single exponent α ≈ 1.25 (dashed lines, jointly fitted with separate intercepts) describes the sparse N-, dense… view at source ↗

read the original abstract

We introduce a model for neural scaling laws under sparse activations. In the model, test loss is often dominated by rare coordinates that are never observed in the training input. This mechanism induces a novel bottleneck absent from dense models. We derive the asymptotic population loss in both the underparameterized and overparameterized regimes, and show that the loss exhibits a double-descent peak near the interpolation threshold -- where the number of parameters is just sufficient to fit the training data -- resulting in a loss curve governed by two distinct scaling exponents -- one for the overparameterized regime and one for the underparameterized regime -- with a gap determined by the degree of sparsity. Additionally, we derive a compute-optimal frontier that favors increasing dataset size over model capacity under fixed compute budgets. We also analyze gradient-descent dynamics and identify a scaling law for the probability that fixed-step gradient descent becomes unstable. We further show that the sparsity-induced effect persists under nonlinear activations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Sparse unobserved coordinates drive asymmetric scaling exponents and a data-favoring compute frontier, but the result rests on a strict zero-probability assumption.

read the letter

The key takeaway is that this paper derives distinct scaling exponents for sparse feature models by positing that rare coordinates never seen in training dominate the test loss. This creates an asymmetry between under- and over-parameterized regimes, a double-descent peak, and a compute frontier that prefers scaling data over parameters. The new element is the explicit mechanism tying sparsity to the exponent gap and the bottleneck from unobserved mass. They derive the asymptotic population loss in both regimes, analyze the interpolation threshold behavior, and extend the analysis to gradient descent instability and nonlinear cases. That gives a coherent story for how sparsity alters the usual scaling picture. The derivations look internally consistent on the stated assumptions. The compute-optimal frontier and the stability scaling law are concrete outputs that could be checked against simulations. The main limitation is the assumption that certain coordinates have zero probability of observation and that this term dominates the loss. If real data has small but positive probability for those coordinates, or if the model can mitigate their error through other parameters, the predicted gap and the data-favoring frontier may shrink or disappear. The paper does not report checks that relax the strict zero-probability condition while holding other elements fixed. This work is aimed at researchers focused on scaling laws and sparse representations. It deserves peer review because the central claim is new, the math is presented in enough detail to evaluate, and the topic matters for how we allocate compute in large models.

Referee Report

2 major / 1 minor

Summary. The paper introduces a model for neural scaling laws under sparse activations where test loss is dominated by rare coordinates never observed in training inputs. This induces a novel bottleneck. The authors derive asymptotic population loss in underparameterized and overparameterized regimes, showing double-descent near the interpolation threshold with two distinct scaling exponents whose gap depends on sparsity degree. They also derive a compute-optimal frontier favoring dataset size over capacity, analyze gradient-descent instability scaling, and show the effect persists under nonlinear activations.

Significance. If the derivations hold under the stated sparsity model, the work provides a mechanistic explanation for asymmetric scaling and double descent absent in dense models, along with concrete predictions for compute allocation. The closed-form asymptotics and GD dynamics analysis would be notable strengths for the scaling-laws literature.

major comments (2)

[Abstract] Abstract and introduction: The central claim that the loss curve is governed by two distinct scaling exponents with a sparsity-determined gap rests on the assumption that unobserved coordinates (with zero training probability) dominate population loss in both regimes. The skeptic note correctly identifies that relaxing this to small positive probability could eliminate the dominance and thus the gap; the manuscript does not appear to contain a robustness derivation or simulation relaxing the strict zero-probability condition while holding other elements fixed.
[Abstract] The derivation of the compute-optimal frontier and the GD instability scaling law inherits the same unobserved-mass dominance assumption. Without an explicit statement of the generative process (e.g., the precise support of the coordinate distribution) or a check that indirect parameter sharing cannot reduce error on rare coordinates, it is unclear whether the reported exponents remain load-bearing when the model is misspecified relative to real data.

minor comments (1)

The abstract states that asymptotic losses and exponents are derived, yet the provided text contains no equations, assumptions list, or validation steps; the full manuscript should include these in a dedicated theory section with numbered equations for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of the work's significance and for the detailed comments on our modeling assumptions. Our results are derived under an explicit sparse-activation model in which certain coordinates have exactly zero training probability; this is the source of the novel bottleneck and the two distinct scaling exponents. We address each major comment below and will incorporate clarifications in the revision.

read point-by-point responses

Referee: [Abstract] Abstract and introduction: The central claim that the loss curve is governed by two distinct scaling exponents with a sparsity-determined gap rests on the assumption that unobserved coordinates (with zero training probability) dominate population loss in both regimes. The skeptic note correctly identifies that relaxing this to small positive probability could eliminate the dominance and thus the gap; the manuscript does not appear to contain a robustness derivation or simulation relaxing the strict zero-probability condition while holding other elements fixed.

Authors: The zero-probability assumption is not incidental but defines the model we analyze; the skeptic note in the manuscript already flags this as the origin of the asymmetric exponents. Our derivations isolate the effect of this extreme sparsity regime, which produces a bottleneck absent from dense models. We do not claim the two-exponent gap persists under small positive probabilities, as that would constitute a different generative process reverting toward standard scaling. In revision we will expand the discussion of the skeptic note to state the precise conditions under which the reported gap holds. revision: partial
Referee: [Abstract] The derivation of the compute-optimal frontier and the GD instability scaling law inherits the same unobserved-mass dominance assumption. Without an explicit statement of the generative process (e.g., the precise support of the coordinate distribution) or a check that indirect parameter sharing cannot reduce error on rare coordinates, it is unclear whether the reported exponents remain load-bearing when the model is misspecified relative to real data.

Authors: Section 2 defines the generative process: coordinates are drawn from a distribution whose support is finite, with a sparse subset assigned zero probability under the training measure but positive probability under the population measure. Because each coordinate has its own dedicated parameter in the linear case, indirect sharing cannot affect error on truly unobserved coordinates. The compute-optimal and GD-instability results are therefore load-bearing inside this model. We will add an explicit one-sentence statement of the generative process to the abstract and introduction. revision: partial

Circularity Check

0 steps flagged

No circularity: derivations follow directly from model assumptions without reduction to inputs

full rationale

The paper introduces an explicit generative model with fixed zero-probability coordinates and derives asymptotic population loss, double-descent shape, and scaling exponents as mathematical consequences of that model. No equations or text in the abstract or description indicate self-definitional steps, fitted parameters renamed as predictions, or load-bearing self-citations. The claimed results are consequences of the stated sparsity assumptions rather than tautological restatements, so the derivation chain is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the ledger is therefore empty pending access to the full derivations.

pith-pipeline@v0.9.0 · 5692 in / 1216 out tokens · 35733 ms · 2026-05-25T03:14:11.018714+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We derive the asymptotic population loss... two distinct scaling exponents... gap determined by the degree of sparsity... K(D) = Γ(1−1/(α1+1)) D^{1/(α1+1)}
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Phase diagram in (α1, α2)... Two-exponent scaling α1 > 0

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

86 extracted references · 33 canonical work pages · 9 internal anchors

[1]

Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS) , series =

Deep Sparse Rectifier Neural Networks , author =. Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS) , series =
[2]

Toy Models of Superposition

Toy Models of Superposition , author =. Transformer Circuits Thread , year =. 2209.10652 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Proceedings of the 41st International Conference on Machine Learning , series =

Scaling Laws for Fine-Grained Mixture of Experts , author =. Proceedings of the 41st International Conference on Machine Learning , series =
[4]

Learning Quadratic Neural Networks in High Dimensions:

Ben Arous, G. Learning Quadratic Neural Networks in High Dimensions:. Advances in Neural Information Processing Systems (NeurIPS 2025) , year =. 2508.03688 , archivePrefix =

work page arXiv 2025
[5]

arXiv preprint arXiv:2602.23039 , year =

Dynamics of Neural Scaling Laws in Random Feature Regression with Powerlaw-Distributed Kernel Eigenvalues , author =. arXiv preprint arXiv:2602.23039 , year =. 2602.23039 , archivePrefix =

work page arXiv
[6]

Advances in Neural Information Processing Systems (NeurIPS 2022) , volume =

Learning Sparse Features Can Lead to Overfitting in Neural Networks , author =. Advances in Neural Information Processing Systems (NeurIPS 2022) , volume =. 2022 , eprint =

2022
[7]

Theory of Optimal Learning Rate Schedules and Scaling Laws for a Random Feature Model

Theory of Optimal Learning Rate Schedules and Scaling Laws for a Random Feature Model , author =. arXiv preprint arXiv:2602.04774 , year =. 2602.04774 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[8]

arXiv preprint arXiv:2602.07488 , year =

Deriving Neural Scaling Laws from the Statistics of Natural Language , author =. arXiv preprint arXiv:2602.07488 , year =. 2602.07488 , archivePrefix =

work page arXiv
[9]

arXiv preprint arXiv:2601.10684 , year =

On the Origin of Neural Scaling Laws: From Random Graphs to Natural Language , author =. arXiv preprint arXiv:2601.10684 , year =. 2601.10684 , archivePrefix =

work page arXiv
[10]

and Thilak, Vimal , booktitle =

Abnar, Samira and Shah, Harshay and Busbridge, Dan and El-Nouby, Alaaeldin and Susskind, Joshua M. and Thilak, Vimal , booktitle =. Parameters vs
[11]

Scaling Laws for Autoregressive Generative Modeling

Scaling Laws for Autoregressive Generative Modeling , author =. arXiv preprint arXiv:2010.14701 , year =. 2010.14701 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv 2010
[12]

Nature , volume =

Emergence of Simple-Cell Receptive Field Properties by Learning a Sparse Code for Natural Images , author =. Nature , volume =. 1996 , doi =

1996
[13]

Advances in Neural Information Processing Systems (NeurIPS 2024) , volume =

Paquette, Elliot and Paquette, Courtney and Xiao, Lechao and Pennington, Jeffrey , title =. Advances in Neural Information Processing Systems (NeurIPS 2024) , volume =

2024
[14]

Advances in Neural Information Processing Systems (NeurIPS 2024) , volume =

Scaling Laws in Linear Regression: Compute, Parameters, and Data , author =. Advances in Neural Information Processing Systems (NeurIPS 2024) , volume =. 2024 , pages =

2024
[15]

Advances in Neural Information Processing Systems (NeurIPS 2025) , journal =

Improved Scaling Laws in Linear Regression via Data Reuse , author =. Advances in Neural Information Processing Systems (NeurIPS 2025) , journal =. 2025 , eprint =

2025
[16]

Scaling and renormalization in high-dimensional regression

Scaling and Renormalization in High-Dimensional Regression , author =. arXiv preprint arXiv:2405.00592 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Sardana, Nikhil and Portes, Jacob and Doubov, Sasha and Frankle, Jonathan , booktitle =. Beyond. 2024 , volume =

2024
[18]

Proceedings of the 40th International Conference on Machine Learning , pages =

Data Efficient Neural Scaling Law via Model Reusing , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , volume =

2023
[19]

Advances in Neural Information Processing Systems (NeurIPS 2024) , volume =

Resolving Discrepancies in Compute-Optimal Scaling of Language Models , author =. Advances in Neural Information Processing Systems (NeurIPS 2024) , volume =

2024
[20]

Advances in Neural Information Processing Systems (NeurIPS 2023) , volume =

Scaling Data-Constrained Language Models , author =. Advances in Neural Information Processing Systems (NeurIPS 2023) , volume =. 2023 , eprint =

2023
[21]

Journal of Statistical Mechanics: Theory and Experiment , abstract =

Spigler, Stefano and Geiger, Mario and Wyart, Matthieu , title =. Journal of Statistical Mechanics: Theory and Experiment , abstract =. doi:10.1088/1742-5468/abc61d , year =

work page doi:10.1088/1742-5468/abc61d
[22]

Proceedings of the 37th International Conference on Machine Learning , series =

Spectrum Dependent Learning Curves in Kernel Regression and Wide Neural Networks , author =. Proceedings of the 37th International Conference on Machine Learning , series =. 2020 , pdf =

2020
[23]

Spector, Blake Bordelon, Niklas Muennighoff, Mansheej Paul, Cengiz Pehlevan, Christopher Ré, and Aditi Raghunathan

Scaling Laws for Precision , author =. The Thirteenth International Conference on Learning Representations , year =. 2411.04330 , archivePrefix =

work page arXiv
[24]

Scaling Laws for Neural Language Models

Scaling Laws for Neural Language Models , author=. arXiv preprint arXiv:2001.08361 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2001
[25]

Advances in Neural Information Processing Systems (NeurIPS 2022) , volume =

Training Compute-Optimal Large Language Models , author =. Advances in Neural Information Processing Systems (NeurIPS 2022) , volume =. 2022 , eprint =

2022
[26]

Deep Learning Scaling is Predictable, Empirically

Deep Learning Scaling is Predictable, Empirically , author=. arXiv preprint arXiv:1712.00409 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

Scaling Vision Transformers , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =. 2022 , eprint =

2022
[28]

arXiv preprint arXiv:2210.16859 , year=

A Solvable Model of Neural Scaling Laws , author=. arXiv preprint arXiv:2210.16859 , year=

work page arXiv
[29]

arXiv preprint arXiv:2102.06701 , year=

Explaining Neural Scaling Laws , author=. arXiv preprint arXiv:2102.06701 , year=

work page arXiv
[30]

Proceedings of the 41st International Conference on Machine Learning , pages =

A Dynamical Model of Neural Scaling Laws , author =. Proceedings of the 41st International Conference on Machine Learning , pages =. 2024 , volume =

2024
[31]

arXiv preprint arXiv:2312.09194 , year=

Dyson Equation for Correlated Linearizations and Test Error of Random Features Regression , author=. arXiv preprint arXiv:2312.09194 , year=

work page arXiv
[32]

Journal of Statistical Mechanics: Theory and Experiment , volume =

How Feature Learning Can Improve Neural Scaling Laws , author =. Journal of Statistical Mechanics: Theory and Experiment , volume =. 2025 , doi =

2025
[33]

Advances in Neural Information Processing Systems (NeurIPS 2024) , volume =

Dimension-Free Deterministic Equivalents and Scaling Laws for Random Feature Regression , author =. Advances in Neural Information Processing Systems (NeurIPS 2024) , volume =. 2024 , eprint =

2024
[34]

Scaling laws and spectra of shallow neural networks in the feature learning regime.arXiv preprint arXiv:2509.24882,

Scaling Laws and Spectra of Shallow Neural Networks in the Feature Learning Regime , author =. The Fourteenth International Conference on Learning Representations , year =. 2509.24882 , archivePrefix =

work page arXiv
[35]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Shazeer, Noam and Mirhoseini, Azalia and Maziarz, Krzysztof and Davis, Andy and Le, Quoc V. and Hinton, Geoffrey E. and Dean, Jeff , title =. The Fifth International Conference on Learning Representations , year =. 1701.06538 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[36]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

2000
[37]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

1980
[38]

M. J. Kearns , title =
[39]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

1983
[40]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

2000
[41]

International Conference on Learning Representations (ICLR) 2020 , year =

A Constructive Prediction of the Generalization Error Across Scales , author =. International Conference on Learning Representations (ICLR) 2020 , year =

2020
[42]

Suppressed for Anonymity , author=
[43]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

1981
[44]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

1959
[45]

The jamming transition as a paradigm to understand the loss landscape of deep neural networks

Jamming transition as a paradigm to understand the loss landscape of deep neural networks , author =. Physical Review E , volume =. 2019 , doi =. 1809.09349 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv 2019
[46]

Proceedings of the National Academy of Sciences , volume =

Reconciling modern machine-learning practice and the classical bias--variance trade-off , author =. Proceedings of the National Academy of Sciences , volume =. 2019 , doi =. 1812.11118 , archivePrefix=

work page arXiv 2019
[47]

SIAM Journal on Mathematics of Data Science , volume =

Two models of double descent for weak features , author =. SIAM Journal on Mathematics of Data Science , volume =. 2020 , doi =. 1903.07571 , archivePrefix=

work page arXiv 2020
[48]

The Annals of Statistics , volume =

Surprises in high-dimensional ridgeless least squares interpolation , author =. The Annals of Statistics , volume =. 2022 , doi =. 1903.08560 , archivePrefix=

work page arXiv 2022
[49]

Proceedings of the National Academy of Sciences , volume =

Benign overfitting in linear regression , author =. Proceedings of the National Academy of Sciences , volume =. 2020 , doi =. 1906.11300 , archivePrefix=

work page arXiv 2020
[50]

Physical Review X , volume =

Modelling the influence of data structure on learning in neural networks: the hidden manifold model , author =. Physical Review X , volume =. 2020 , doi =

2020
[51]

Proceedings of the 37th International Conference on Machine Learning , series =

Generalisation error in learning with random features and the hidden manifold model , author =. Proceedings of the 37th International Conference on Machine Learning , series =
[52]

Goldt, Sebastian and Loureiro, Bruno and Reeves, Galen and Krzakala, Florent and M. The. Proceedings of The 33rd International Conference on Algorithmic Learning Theory , series =
[53]

Superposition Yields Robust Neural Scaling

Superposition Yields Robust Neural Scaling , author =. Advances in Neural Information Processing Systems (NeurIPS 2025) , year =. 2505.10465 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

Transformer Circuits Thread , year =

Towards Monosemanticity: Decomposing Language Models with Dictionary Learning , author =. Transformer Circuits Thread , year =
[55]

Advances in Neural Information Processing Systems (NeurIPS 2019) , volume =

On the Inductive Bias of Neural Tangent Kernels , author =. Advances in Neural Information Processing Systems (NeurIPS 2019) , volume =. 2019 , eprint =

2019
[56]

Journal of Machine Learning Research , volume =

Breaking the Curse of Dimensionality with Convex Neural Networks , author =. Journal of Machine Learning Research , volume =. 2017 , url =

2017
[57]

Proceedings of the 37th International Conference on Machine Learning , series =

Frequency Bias in Neural Networks for Input of Non-Uniform Density , author =. Proceedings of the 37th International Conference on Machine Learning , series =
[58]

The Annals of Applied Probability , volume =

A Random Matrix Approach to Neural Networks , author =. The Annals of Applied Probability , volume =. 2018 , doi =

2018
[59]

Transformer Circuits Thread , year =

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet , author =. Transformer Circuits Thread , year =
[60]

Doklady Akademii Nauk SSSR , volume=

A method for solving the convex programming problem with convergence rate O(1/k^2) , author=. Doklady Akademii Nauk SSSR , volume=
[61]

Foundations of Computational Mathematics , volume=

Adaptive restart for accelerated gradient schemes , author=. Foundations of Computational Mathematics , volume=. 2015 , doi=

2015
[62]

Advances in Neural Information Processing Systems (NeurIPS 2023) , volume =

Improved Convergence in High Probability of Clipped Gradient Methods with Heavy Tailed Noise , author =. Advances in Neural Information Processing Systems (NeurIPS 2023) , volume =

2023
[63]

Communications on Pure and Applied Mathematics , volume =

The generalization error of random features regression: Precise asymptotics and the double descent curve , author =. Communications on Pure and Applied Mathematics , volume =. 2022 , doi =. 1908.05355 , archivePrefix=

work page arXiv 2022
[64]

Applied and Computational Harmonic Analysis , volume =

Generalization Error of Random Feature and Kernel Methods: Hypercontractivity and Kernel Matrix Concentration , author =. Applied and Computational Harmonic Analysis , volume =. 2022 , doi =

2022
[65]

Journal of Statistical Mechanics: Theory and Experiment , volume =

The Committee Machine: Computational to Statistical Gaps in Learning a Two-Layers Neural Network , author =. Journal of Statistical Mechanics: Theory and Experiment , volume =. 2019 , doi =

2019
[66]

Proceedings of the National Academy of Sciences , volume =

Optimal Errors and Phase Transitions in High-Dimensional Generalized Linear Models , author =. Proceedings of the National Academy of Sciences , volume =. 2019 , doi =

2019
[67]

Advances in Neural Information Processing Systems (NeurIPS 2020) , volume =

When Do Neural Networks Outperform Kernel Methods? , author =. Advances in Neural Information Processing Systems (NeurIPS 2020) , volume =. 2020 , eprint =

2020
[68]

Proceedings of the Thirty Fourth Conference on Learning Theory , series =

Learning with Invariances in Random Features and Kernel Models , author =. Proceedings of the Thirty Fourth Conference on Learning Theory , series =
[69]

arXiv preprint arXiv:2602.19241 , year =

Scaling Laws for Precision in High-Dimensional Linear Regression , author =. arXiv preprint arXiv:2602.19241 , year =. 2602.19241 , archivePrefix =

work page arXiv
[70]

Advances in Neural Information Processing Systems (NeurIPS 2025) , year =

Functional Scaling Laws in Kernel Regression: Loss Dynamics and Learning Rate Schedules , author =. Advances in Neural Information Processing Systems (NeurIPS 2025) , year =. 2509.19189 , archivePrefix =

work page arXiv 2025
[71]

arXiv preprint arXiv:2510.24616 , year =

Statistical Physics of Deep Learning: Optimal Learning of a Multi-Layer Perceptron Near Interpolation , author =. arXiv preprint arXiv:2510.24616 , year =. 2510.24616 , archivePrefix =

work page arXiv
[72]

, booktitle =

Ren, Yunwei and Nichani, Eshaan and Wu, Denny and Lee, Jason D. , booktitle =. Emergence and Scaling Laws in. 2025 , eprint =

2025
[73]

arXiv preprint arXiv:2510.04780 , year =

Kernel Ridge Regression under Power-Law Data: Spectrum and Generalization , author =. arXiv preprint arXiv:2510.04780 , year =. 2510.04780 , archivePrefix =

work page arXiv
[74]

The Thirteenth International Conference on Learning Representations , year =

Analyzing Neural Scaling Laws in Two-Layer Networks with Power-Law Data Spectra , author =. The Thirteenth International Conference on Learning Representations , year =. 2410.09005 , archivePrefix =

work page arXiv
[75]

Advances in Neural Information Processing Systems (NeurIPS 2019) , volume =

On the Power and Limitations of Random Features for Understanding Neural Networks , author =. Advances in Neural Information Processing Systems (NeurIPS 2019) , volume =. 2019 , eprint =

2019
[76]

arXiv preprint arXiv:2603.14578 , year =

Power-Law Spectrum of the Random Feature Model , author =. arXiv preprint arXiv:2603.14578 , year =. 2603.14578 , archivePrefix =

work page arXiv
[77]

International Conference on Learning Representations (ICLR) , year =

Deep Double Descent: Where Bigger Models and More Data Hurt , author =. International Conference on Learning Representations (ICLR) , year =. 1912.02292 , archivePrefix=

work page arXiv 1912
[78]

Advances in Neural Information Processing Systems (NeurIPS 2007) , volume =

Random Features for Large-Scale Kernel Machines , author =. Advances in Neural Information Processing Systems (NeurIPS 2007) , volume =

2007
[79]

IEEE Transactions on Information Theory , volume =

Universality Laws for High-Dimensional Learning with Random Features , author =. IEEE Transactions on Information Theory , volume =. 2023 , doi =

2023
[80]

Saul , title =

Youngmin Cho and Lawrence K. Saul , title =. Advances in Neural Information Processing Systems 22 (NeurIPS 2009) , volume =

2009

Showing first 80 references.

[1] [1]

Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS) , series =

Deep Sparse Rectifier Neural Networks , author =. Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS) , series =

[2] [2]

Toy Models of Superposition

Toy Models of Superposition , author =. Transformer Circuits Thread , year =. 2209.10652 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Proceedings of the 41st International Conference on Machine Learning , series =

Scaling Laws for Fine-Grained Mixture of Experts , author =. Proceedings of the 41st International Conference on Machine Learning , series =

[4] [4]

Learning Quadratic Neural Networks in High Dimensions:

Ben Arous, G. Learning Quadratic Neural Networks in High Dimensions:. Advances in Neural Information Processing Systems (NeurIPS 2025) , year =. 2508.03688 , archivePrefix =

work page arXiv 2025

[5] [5]

arXiv preprint arXiv:2602.23039 , year =

Dynamics of Neural Scaling Laws in Random Feature Regression with Powerlaw-Distributed Kernel Eigenvalues , author =. arXiv preprint arXiv:2602.23039 , year =. 2602.23039 , archivePrefix =

work page arXiv

[6] [6]

Advances in Neural Information Processing Systems (NeurIPS 2022) , volume =

Learning Sparse Features Can Lead to Overfitting in Neural Networks , author =. Advances in Neural Information Processing Systems (NeurIPS 2022) , volume =. 2022 , eprint =

2022

[7] [7]

Theory of Optimal Learning Rate Schedules and Scaling Laws for a Random Feature Model

Theory of Optimal Learning Rate Schedules and Scaling Laws for a Random Feature Model , author =. arXiv preprint arXiv:2602.04774 , year =. 2602.04774 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

arXiv preprint arXiv:2602.07488 , year =

Deriving Neural Scaling Laws from the Statistics of Natural Language , author =. arXiv preprint arXiv:2602.07488 , year =. 2602.07488 , archivePrefix =

work page arXiv

[9] [9]

arXiv preprint arXiv:2601.10684 , year =

On the Origin of Neural Scaling Laws: From Random Graphs to Natural Language , author =. arXiv preprint arXiv:2601.10684 , year =. 2601.10684 , archivePrefix =

work page arXiv

[10] [10]

and Thilak, Vimal , booktitle =

Abnar, Samira and Shah, Harshay and Busbridge, Dan and El-Nouby, Alaaeldin and Susskind, Joshua M. and Thilak, Vimal , booktitle =. Parameters vs

[11] [11]

Scaling Laws for Autoregressive Generative Modeling

Scaling Laws for Autoregressive Generative Modeling , author =. arXiv preprint arXiv:2010.14701 , year =. 2010.14701 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv 2010

[12] [12]

Nature , volume =

Emergence of Simple-Cell Receptive Field Properties by Learning a Sparse Code for Natural Images , author =. Nature , volume =. 1996 , doi =

1996

[13] [13]

Advances in Neural Information Processing Systems (NeurIPS 2024) , volume =

Paquette, Elliot and Paquette, Courtney and Xiao, Lechao and Pennington, Jeffrey , title =. Advances in Neural Information Processing Systems (NeurIPS 2024) , volume =

2024

[14] [14]

Advances in Neural Information Processing Systems (NeurIPS 2024) , volume =

Scaling Laws in Linear Regression: Compute, Parameters, and Data , author =. Advances in Neural Information Processing Systems (NeurIPS 2024) , volume =. 2024 , pages =

2024

[15] [15]

Advances in Neural Information Processing Systems (NeurIPS 2025) , journal =

Improved Scaling Laws in Linear Regression via Data Reuse , author =. Advances in Neural Information Processing Systems (NeurIPS 2025) , journal =. 2025 , eprint =

2025

[16] [16]

Scaling and renormalization in high-dimensional regression

Scaling and Renormalization in High-Dimensional Regression , author =. arXiv preprint arXiv:2405.00592 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Sardana, Nikhil and Portes, Jacob and Doubov, Sasha and Frankle, Jonathan , booktitle =. Beyond. 2024 , volume =

2024

[18] [18]

Proceedings of the 40th International Conference on Machine Learning , pages =

Data Efficient Neural Scaling Law via Model Reusing , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , volume =

2023

[19] [19]

Advances in Neural Information Processing Systems (NeurIPS 2024) , volume =

Resolving Discrepancies in Compute-Optimal Scaling of Language Models , author =. Advances in Neural Information Processing Systems (NeurIPS 2024) , volume =

2024

[20] [20]

Advances in Neural Information Processing Systems (NeurIPS 2023) , volume =

Scaling Data-Constrained Language Models , author =. Advances in Neural Information Processing Systems (NeurIPS 2023) , volume =. 2023 , eprint =

2023

[21] [21]

Journal of Statistical Mechanics: Theory and Experiment , abstract =

Spigler, Stefano and Geiger, Mario and Wyart, Matthieu , title =. Journal of Statistical Mechanics: Theory and Experiment , abstract =. doi:10.1088/1742-5468/abc61d , year =

work page doi:10.1088/1742-5468/abc61d

[22] [22]

Proceedings of the 37th International Conference on Machine Learning , series =

Spectrum Dependent Learning Curves in Kernel Regression and Wide Neural Networks , author =. Proceedings of the 37th International Conference on Machine Learning , series =. 2020 , pdf =

2020

[23] [23]

Spector, Blake Bordelon, Niklas Muennighoff, Mansheej Paul, Cengiz Pehlevan, Christopher Ré, and Aditi Raghunathan

Scaling Laws for Precision , author =. The Thirteenth International Conference on Learning Representations , year =. 2411.04330 , archivePrefix =

work page arXiv

[24] [24]

Scaling Laws for Neural Language Models

Scaling Laws for Neural Language Models , author=. arXiv preprint arXiv:2001.08361 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2001

[25] [25]

Advances in Neural Information Processing Systems (NeurIPS 2022) , volume =

Training Compute-Optimal Large Language Models , author =. Advances in Neural Information Processing Systems (NeurIPS 2022) , volume =. 2022 , eprint =

2022

[26] [26]

Deep Learning Scaling is Predictable, Empirically

Deep Learning Scaling is Predictable, Empirically , author=. arXiv preprint arXiv:1712.00409 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

Scaling Vision Transformers , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =. 2022 , eprint =

2022

[28] [28]

arXiv preprint arXiv:2210.16859 , year=

A Solvable Model of Neural Scaling Laws , author=. arXiv preprint arXiv:2210.16859 , year=

work page arXiv

[29] [29]

arXiv preprint arXiv:2102.06701 , year=

Explaining Neural Scaling Laws , author=. arXiv preprint arXiv:2102.06701 , year=

work page arXiv

[30] [30]

Proceedings of the 41st International Conference on Machine Learning , pages =

A Dynamical Model of Neural Scaling Laws , author =. Proceedings of the 41st International Conference on Machine Learning , pages =. 2024 , volume =

2024

[31] [31]

arXiv preprint arXiv:2312.09194 , year=

Dyson Equation for Correlated Linearizations and Test Error of Random Features Regression , author=. arXiv preprint arXiv:2312.09194 , year=

work page arXiv

[32] [32]

Journal of Statistical Mechanics: Theory and Experiment , volume =

How Feature Learning Can Improve Neural Scaling Laws , author =. Journal of Statistical Mechanics: Theory and Experiment , volume =. 2025 , doi =

2025

[33] [33]

Advances in Neural Information Processing Systems (NeurIPS 2024) , volume =

Dimension-Free Deterministic Equivalents and Scaling Laws for Random Feature Regression , author =. Advances in Neural Information Processing Systems (NeurIPS 2024) , volume =. 2024 , eprint =

2024

[34] [34]

Scaling laws and spectra of shallow neural networks in the feature learning regime.arXiv preprint arXiv:2509.24882,

Scaling Laws and Spectra of Shallow Neural Networks in the Feature Learning Regime , author =. The Fourteenth International Conference on Learning Representations , year =. 2509.24882 , archivePrefix =

work page arXiv

[35] [35]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Shazeer, Noam and Mirhoseini, Azalia and Maziarz, Krzysztof and Davis, Andy and Le, Quoc V. and Hinton, Geoffrey E. and Dean, Jeff , title =. The Fifth International Conference on Learning Representations , year =. 1701.06538 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv

[36] [36]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

2000

[37] [37]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

1980

[38] [38]

M. J. Kearns , title =

[39] [39]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

1983

[40] [40]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

2000

[41] [41]

International Conference on Learning Representations (ICLR) 2020 , year =

A Constructive Prediction of the Generalization Error Across Scales , author =. International Conference on Learning Representations (ICLR) 2020 , year =

2020

[42] [42]

Suppressed for Anonymity , author=

[43] [43]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

1981

[44] [44]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

1959

[45] [45]

The jamming transition as a paradigm to understand the loss landscape of deep neural networks

Jamming transition as a paradigm to understand the loss landscape of deep neural networks , author =. Physical Review E , volume =. 2019 , doi =. 1809.09349 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv 2019

[46] [46]

Proceedings of the National Academy of Sciences , volume =

Reconciling modern machine-learning practice and the classical bias--variance trade-off , author =. Proceedings of the National Academy of Sciences , volume =. 2019 , doi =. 1812.11118 , archivePrefix=

work page arXiv 2019

[47] [47]

SIAM Journal on Mathematics of Data Science , volume =

Two models of double descent for weak features , author =. SIAM Journal on Mathematics of Data Science , volume =. 2020 , doi =. 1903.07571 , archivePrefix=

work page arXiv 2020

[48] [48]

The Annals of Statistics , volume =

Surprises in high-dimensional ridgeless least squares interpolation , author =. The Annals of Statistics , volume =. 2022 , doi =. 1903.08560 , archivePrefix=

work page arXiv 2022

[49] [49]

Proceedings of the National Academy of Sciences , volume =

Benign overfitting in linear regression , author =. Proceedings of the National Academy of Sciences , volume =. 2020 , doi =. 1906.11300 , archivePrefix=

work page arXiv 2020

[50] [50]

Physical Review X , volume =

Modelling the influence of data structure on learning in neural networks: the hidden manifold model , author =. Physical Review X , volume =. 2020 , doi =

2020

[51] [51]

Proceedings of the 37th International Conference on Machine Learning , series =

Generalisation error in learning with random features and the hidden manifold model , author =. Proceedings of the 37th International Conference on Machine Learning , series =

[52] [52]

Goldt, Sebastian and Loureiro, Bruno and Reeves, Galen and Krzakala, Florent and M. The. Proceedings of The 33rd International Conference on Algorithmic Learning Theory , series =

[53] [53]

Superposition Yields Robust Neural Scaling

Superposition Yields Robust Neural Scaling , author =. Advances in Neural Information Processing Systems (NeurIPS 2025) , year =. 2505.10465 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv 2025

[54] [54]

Transformer Circuits Thread , year =

Towards Monosemanticity: Decomposing Language Models with Dictionary Learning , author =. Transformer Circuits Thread , year =

[55] [55]

Advances in Neural Information Processing Systems (NeurIPS 2019) , volume =

On the Inductive Bias of Neural Tangent Kernels , author =. Advances in Neural Information Processing Systems (NeurIPS 2019) , volume =. 2019 , eprint =

2019

[56] [56]

Journal of Machine Learning Research , volume =

Breaking the Curse of Dimensionality with Convex Neural Networks , author =. Journal of Machine Learning Research , volume =. 2017 , url =

2017

[57] [57]

Proceedings of the 37th International Conference on Machine Learning , series =

Frequency Bias in Neural Networks for Input of Non-Uniform Density , author =. Proceedings of the 37th International Conference on Machine Learning , series =

[58] [58]

The Annals of Applied Probability , volume =

A Random Matrix Approach to Neural Networks , author =. The Annals of Applied Probability , volume =. 2018 , doi =

2018

[59] [59]

Transformer Circuits Thread , year =

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet , author =. Transformer Circuits Thread , year =

[60] [60]

Doklady Akademii Nauk SSSR , volume=

A method for solving the convex programming problem with convergence rate O(1/k^2) , author=. Doklady Akademii Nauk SSSR , volume=

[61] [61]

Foundations of Computational Mathematics , volume=

Adaptive restart for accelerated gradient schemes , author=. Foundations of Computational Mathematics , volume=. 2015 , doi=

2015

[62] [62]

Advances in Neural Information Processing Systems (NeurIPS 2023) , volume =

Improved Convergence in High Probability of Clipped Gradient Methods with Heavy Tailed Noise , author =. Advances in Neural Information Processing Systems (NeurIPS 2023) , volume =

2023

[63] [63]

Communications on Pure and Applied Mathematics , volume =

The generalization error of random features regression: Precise asymptotics and the double descent curve , author =. Communications on Pure and Applied Mathematics , volume =. 2022 , doi =. 1908.05355 , archivePrefix=

work page arXiv 2022

[64] [64]

Applied and Computational Harmonic Analysis , volume =

Generalization Error of Random Feature and Kernel Methods: Hypercontractivity and Kernel Matrix Concentration , author =. Applied and Computational Harmonic Analysis , volume =. 2022 , doi =

2022

[65] [65]

Journal of Statistical Mechanics: Theory and Experiment , volume =

The Committee Machine: Computational to Statistical Gaps in Learning a Two-Layers Neural Network , author =. Journal of Statistical Mechanics: Theory and Experiment , volume =. 2019 , doi =

2019

[66] [66]

Proceedings of the National Academy of Sciences , volume =

Optimal Errors and Phase Transitions in High-Dimensional Generalized Linear Models , author =. Proceedings of the National Academy of Sciences , volume =. 2019 , doi =

2019

[67] [67]

Advances in Neural Information Processing Systems (NeurIPS 2020) , volume =

When Do Neural Networks Outperform Kernel Methods? , author =. Advances in Neural Information Processing Systems (NeurIPS 2020) , volume =. 2020 , eprint =

2020

[68] [68]

Proceedings of the Thirty Fourth Conference on Learning Theory , series =

Learning with Invariances in Random Features and Kernel Models , author =. Proceedings of the Thirty Fourth Conference on Learning Theory , series =

[69] [69]

arXiv preprint arXiv:2602.19241 , year =

Scaling Laws for Precision in High-Dimensional Linear Regression , author =. arXiv preprint arXiv:2602.19241 , year =. 2602.19241 , archivePrefix =

work page arXiv

[70] [70]

Advances in Neural Information Processing Systems (NeurIPS 2025) , year =

Functional Scaling Laws in Kernel Regression: Loss Dynamics and Learning Rate Schedules , author =. Advances in Neural Information Processing Systems (NeurIPS 2025) , year =. 2509.19189 , archivePrefix =

work page arXiv 2025

[71] [71]

arXiv preprint arXiv:2510.24616 , year =

Statistical Physics of Deep Learning: Optimal Learning of a Multi-Layer Perceptron Near Interpolation , author =. arXiv preprint arXiv:2510.24616 , year =. 2510.24616 , archivePrefix =

work page arXiv

[72] [72]

, booktitle =

Ren, Yunwei and Nichani, Eshaan and Wu, Denny and Lee, Jason D. , booktitle =. Emergence and Scaling Laws in. 2025 , eprint =

2025

[73] [73]

arXiv preprint arXiv:2510.04780 , year =

Kernel Ridge Regression under Power-Law Data: Spectrum and Generalization , author =. arXiv preprint arXiv:2510.04780 , year =. 2510.04780 , archivePrefix =

work page arXiv

[74] [74]

The Thirteenth International Conference on Learning Representations , year =

Analyzing Neural Scaling Laws in Two-Layer Networks with Power-Law Data Spectra , author =. The Thirteenth International Conference on Learning Representations , year =. 2410.09005 , archivePrefix =

work page arXiv

[75] [75]

Advances in Neural Information Processing Systems (NeurIPS 2019) , volume =

On the Power and Limitations of Random Features for Understanding Neural Networks , author =. Advances in Neural Information Processing Systems (NeurIPS 2019) , volume =. 2019 , eprint =

2019

[76] [76]

arXiv preprint arXiv:2603.14578 , year =

Power-Law Spectrum of the Random Feature Model , author =. arXiv preprint arXiv:2603.14578 , year =. 2603.14578 , archivePrefix =

work page arXiv

[77] [77]

International Conference on Learning Representations (ICLR) , year =

Deep Double Descent: Where Bigger Models and More Data Hurt , author =. International Conference on Learning Representations (ICLR) , year =. 1912.02292 , archivePrefix=

work page arXiv 1912

[78] [78]

Advances in Neural Information Processing Systems (NeurIPS 2007) , volume =

Random Features for Large-Scale Kernel Machines , author =. Advances in Neural Information Processing Systems (NeurIPS 2007) , volume =

2007

[79] [79]

IEEE Transactions on Information Theory , volume =

Universality Laws for High-Dimensional Learning with Random Features , author =. IEEE Transactions on Information Theory , volume =. 2023 , doi =

2023

[80] [80]

Saul , title =

Youngmin Cho and Lawrence K. Saul , title =. Advances in Neural Information Processing Systems 22 (NeurIPS 2009) , volume =

2009