Asymmetric Scaling Laws from Sparse Features
Pith reviewed 2026-05-25 03:14 UTC · model grok-4.3
The pith
Sparse activations make test loss dominated by coordinates never seen in training, yielding asymmetric scaling laws.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Test loss is dominated by the population mass on coordinates that remain unobserved during training. This induces a double-descent peak at the interpolation threshold and produces distinct scaling exponents in the under- and overparameterized regimes whose gap is fixed by the sparsity degree.
What carries the argument
The mechanism of rare, completely unobserved coordinates that dominate the asymptotic population loss due to sparsity in activations.
If this is right
- The loss exhibits a double-descent peak near the interpolation threshold.
- Two distinct scaling exponents govern the loss curve, separated by a gap set by sparsity.
- A compute-optimal frontier favors increasing dataset size over model capacity.
- Gradient descent has a scaling law for the probability of becoming unstable.
- The sparsity-induced effect holds under nonlinear activations.
Where Pith is reading between the lines
- This suggests that covering rare features in data collection could be more important than previously thought for scaling performance.
- The model may apply to other sparse data domains like language or images where certain combinations are rare.
- Practitioners might adjust training to explicitly handle or estimate the unobserved mass to mitigate the bottleneck.
Load-bearing premise
That the sparsity produces coordinates which stay completely unobserved in the entire training set and that their contribution dominates the population loss.
What would settle it
Finding a sparse dataset where the error contribution from never-observed coordinates does not set the scaling behavior or where the two exponents do not appear with the predicted gap.
Figures
read the original abstract
We introduce a model for neural scaling laws under sparse activations. In the model, test loss is often dominated by rare coordinates that are never observed in the training input. This mechanism induces a novel bottleneck absent from dense models. We derive the asymptotic population loss in both the underparameterized and overparameterized regimes, and show that the loss exhibits a double-descent peak near the interpolation threshold -- where the number of parameters is just sufficient to fit the training data -- resulting in a loss curve governed by two distinct scaling exponents -- one for the overparameterized regime and one for the underparameterized regime -- with a gap determined by the degree of sparsity. Additionally, we derive a compute-optimal frontier that favors increasing dataset size over model capacity under fixed compute budgets. We also analyze gradient-descent dynamics and identify a scaling law for the probability that fixed-step gradient descent becomes unstable. We further show that the sparsity-induced effect persists under nonlinear activations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a model for neural scaling laws under sparse activations where test loss is dominated by rare coordinates never observed in training inputs. This induces a novel bottleneck. The authors derive asymptotic population loss in underparameterized and overparameterized regimes, showing double-descent near the interpolation threshold with two distinct scaling exponents whose gap depends on sparsity degree. They also derive a compute-optimal frontier favoring dataset size over capacity, analyze gradient-descent instability scaling, and show the effect persists under nonlinear activations.
Significance. If the derivations hold under the stated sparsity model, the work provides a mechanistic explanation for asymmetric scaling and double descent absent in dense models, along with concrete predictions for compute allocation. The closed-form asymptotics and GD dynamics analysis would be notable strengths for the scaling-laws literature.
major comments (2)
- [Abstract] Abstract and introduction: The central claim that the loss curve is governed by two distinct scaling exponents with a sparsity-determined gap rests on the assumption that unobserved coordinates (with zero training probability) dominate population loss in both regimes. The skeptic note correctly identifies that relaxing this to small positive probability could eliminate the dominance and thus the gap; the manuscript does not appear to contain a robustness derivation or simulation relaxing the strict zero-probability condition while holding other elements fixed.
- [Abstract] The derivation of the compute-optimal frontier and the GD instability scaling law inherits the same unobserved-mass dominance assumption. Without an explicit statement of the generative process (e.g., the precise support of the coordinate distribution) or a check that indirect parameter sharing cannot reduce error on rare coordinates, it is unclear whether the reported exponents remain load-bearing when the model is misspecified relative to real data.
minor comments (1)
- The abstract states that asymptotic losses and exponents are derived, yet the provided text contains no equations, assumptions list, or validation steps; the full manuscript should include these in a dedicated theory section with numbered equations for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of the work's significance and for the detailed comments on our modeling assumptions. Our results are derived under an explicit sparse-activation model in which certain coordinates have exactly zero training probability; this is the source of the novel bottleneck and the two distinct scaling exponents. We address each major comment below and will incorporate clarifications in the revision.
read point-by-point responses
-
Referee: [Abstract] Abstract and introduction: The central claim that the loss curve is governed by two distinct scaling exponents with a sparsity-determined gap rests on the assumption that unobserved coordinates (with zero training probability) dominate population loss in both regimes. The skeptic note correctly identifies that relaxing this to small positive probability could eliminate the dominance and thus the gap; the manuscript does not appear to contain a robustness derivation or simulation relaxing the strict zero-probability condition while holding other elements fixed.
Authors: The zero-probability assumption is not incidental but defines the model we analyze; the skeptic note in the manuscript already flags this as the origin of the asymmetric exponents. Our derivations isolate the effect of this extreme sparsity regime, which produces a bottleneck absent from dense models. We do not claim the two-exponent gap persists under small positive probabilities, as that would constitute a different generative process reverting toward standard scaling. In revision we will expand the discussion of the skeptic note to state the precise conditions under which the reported gap holds. revision: partial
-
Referee: [Abstract] The derivation of the compute-optimal frontier and the GD instability scaling law inherits the same unobserved-mass dominance assumption. Without an explicit statement of the generative process (e.g., the precise support of the coordinate distribution) or a check that indirect parameter sharing cannot reduce error on rare coordinates, it is unclear whether the reported exponents remain load-bearing when the model is misspecified relative to real data.
Authors: Section 2 defines the generative process: coordinates are drawn from a distribution whose support is finite, with a sparse subset assigned zero probability under the training measure but positive probability under the population measure. Because each coordinate has its own dedicated parameter in the linear case, indirect sharing cannot affect error on truly unobserved coordinates. The compute-optimal and GD-instability results are therefore load-bearing inside this model. We will add an explicit one-sentence statement of the generative process to the abstract and introduction. revision: partial
Circularity Check
No circularity: derivations follow directly from model assumptions without reduction to inputs
full rationale
The paper introduces an explicit generative model with fixed zero-probability coordinates and derives asymptotic population loss, double-descent shape, and scaling exponents as mathematical consequences of that model. No equations or text in the abstract or description indicate self-definitional steps, fitted parameters renamed as predictions, or load-bearing self-citations. The claimed results are consequences of the stated sparsity assumptions rather than tautological restatements, so the derivation chain is self-contained.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We derive the asymptotic population loss... two distinct scaling exponents... gap determined by the degree of sparsity... K(D) = Γ(1−1/(α1+1)) D^{1/(α1+1)}
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Phase diagram in (α1, α2)... Two-exponent scaling α1 > 0
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS) , series =
Deep Sparse Rectifier Neural Networks , author =. Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS) , series =
-
[2]
Toy Models of Superposition , author =. Transformer Circuits Thread , year =. 2209.10652 , archivePrefix =
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Proceedings of the 41st International Conference on Machine Learning , series =
Scaling Laws for Fine-Grained Mixture of Experts , author =. Proceedings of the 41st International Conference on Machine Learning , series =
-
[4]
Learning Quadratic Neural Networks in High Dimensions:
Ben Arous, G. Learning Quadratic Neural Networks in High Dimensions:. Advances in Neural Information Processing Systems (NeurIPS 2025) , year =. 2508.03688 , archivePrefix =
-
[5]
arXiv preprint arXiv:2602.23039 , year =
Dynamics of Neural Scaling Laws in Random Feature Regression with Powerlaw-Distributed Kernel Eigenvalues , author =. arXiv preprint arXiv:2602.23039 , year =. 2602.23039 , archivePrefix =
-
[6]
Advances in Neural Information Processing Systems (NeurIPS 2022) , volume =
Learning Sparse Features Can Lead to Overfitting in Neural Networks , author =. Advances in Neural Information Processing Systems (NeurIPS 2022) , volume =. 2022 , eprint =
2022
-
[7]
Theory of Optimal Learning Rate Schedules and Scaling Laws for a Random Feature Model
Theory of Optimal Learning Rate Schedules and Scaling Laws for a Random Feature Model , author =. arXiv preprint arXiv:2602.04774 , year =. 2602.04774 , archivePrefix =
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
arXiv preprint arXiv:2602.07488 , year =
Deriving Neural Scaling Laws from the Statistics of Natural Language , author =. arXiv preprint arXiv:2602.07488 , year =. 2602.07488 , archivePrefix =
-
[9]
arXiv preprint arXiv:2601.10684 , year =
On the Origin of Neural Scaling Laws: From Random Graphs to Natural Language , author =. arXiv preprint arXiv:2601.10684 , year =. 2601.10684 , archivePrefix =
-
[10]
and Thilak, Vimal , booktitle =
Abnar, Samira and Shah, Harshay and Busbridge, Dan and El-Nouby, Alaaeldin and Susskind, Joshua M. and Thilak, Vimal , booktitle =. Parameters vs
-
[11]
Scaling Laws for Autoregressive Generative Modeling
Scaling Laws for Autoregressive Generative Modeling , author =. arXiv preprint arXiv:2010.14701 , year =. 2010.14701 , archivePrefix =
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[12]
Nature , volume =
Emergence of Simple-Cell Receptive Field Properties by Learning a Sparse Code for Natural Images , author =. Nature , volume =. 1996 , doi =
1996
-
[13]
Advances in Neural Information Processing Systems (NeurIPS 2024) , volume =
Paquette, Elliot and Paquette, Courtney and Xiao, Lechao and Pennington, Jeffrey , title =. Advances in Neural Information Processing Systems (NeurIPS 2024) , volume =
2024
-
[14]
Advances in Neural Information Processing Systems (NeurIPS 2024) , volume =
Scaling Laws in Linear Regression: Compute, Parameters, and Data , author =. Advances in Neural Information Processing Systems (NeurIPS 2024) , volume =. 2024 , pages =
2024
-
[15]
Advances in Neural Information Processing Systems (NeurIPS 2025) , journal =
Improved Scaling Laws in Linear Regression via Data Reuse , author =. Advances in Neural Information Processing Systems (NeurIPS 2025) , journal =. 2025 , eprint =
2025
-
[16]
Scaling and renormalization in high-dimensional regression
Scaling and Renormalization in High-Dimensional Regression , author =. arXiv preprint arXiv:2405.00592 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Sardana, Nikhil and Portes, Jacob and Doubov, Sasha and Frankle, Jonathan , booktitle =. Beyond. 2024 , volume =
2024
-
[18]
Proceedings of the 40th International Conference on Machine Learning , pages =
Data Efficient Neural Scaling Law via Model Reusing , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , volume =
2023
-
[19]
Advances in Neural Information Processing Systems (NeurIPS 2024) , volume =
Resolving Discrepancies in Compute-Optimal Scaling of Language Models , author =. Advances in Neural Information Processing Systems (NeurIPS 2024) , volume =
2024
-
[20]
Advances in Neural Information Processing Systems (NeurIPS 2023) , volume =
Scaling Data-Constrained Language Models , author =. Advances in Neural Information Processing Systems (NeurIPS 2023) , volume =. 2023 , eprint =
2023
-
[21]
Journal of Statistical Mechanics: Theory and Experiment , abstract =
Spigler, Stefano and Geiger, Mario and Wyart, Matthieu , title =. Journal of Statistical Mechanics: Theory and Experiment , abstract =. doi:10.1088/1742-5468/abc61d , year =
-
[22]
Proceedings of the 37th International Conference on Machine Learning , series =
Spectrum Dependent Learning Curves in Kernel Regression and Wide Neural Networks , author =. Proceedings of the 37th International Conference on Machine Learning , series =. 2020 , pdf =
2020
-
[23]
Scaling Laws for Precision , author =. The Thirteenth International Conference on Learning Representations , year =. 2411.04330 , archivePrefix =
-
[24]
Scaling Laws for Neural Language Models
Scaling Laws for Neural Language Models , author=. arXiv preprint arXiv:2001.08361 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[25]
Advances in Neural Information Processing Systems (NeurIPS 2022) , volume =
Training Compute-Optimal Large Language Models , author =. Advances in Neural Information Processing Systems (NeurIPS 2022) , volume =. 2022 , eprint =
2022
-
[26]
Deep Learning Scaling is Predictable, Empirically
Deep Learning Scaling is Predictable, Empirically , author=. arXiv preprint arXiv:1712.00409 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =
Scaling Vision Transformers , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =. 2022 , eprint =
2022
-
[28]
arXiv preprint arXiv:2210.16859 , year=
A Solvable Model of Neural Scaling Laws , author=. arXiv preprint arXiv:2210.16859 , year=
-
[29]
arXiv preprint arXiv:2102.06701 , year=
Explaining Neural Scaling Laws , author=. arXiv preprint arXiv:2102.06701 , year=
-
[30]
Proceedings of the 41st International Conference on Machine Learning , pages =
A Dynamical Model of Neural Scaling Laws , author =. Proceedings of the 41st International Conference on Machine Learning , pages =. 2024 , volume =
2024
-
[31]
arXiv preprint arXiv:2312.09194 , year=
Dyson Equation for Correlated Linearizations and Test Error of Random Features Regression , author=. arXiv preprint arXiv:2312.09194 , year=
-
[32]
Journal of Statistical Mechanics: Theory and Experiment , volume =
How Feature Learning Can Improve Neural Scaling Laws , author =. Journal of Statistical Mechanics: Theory and Experiment , volume =. 2025 , doi =
2025
-
[33]
Advances in Neural Information Processing Systems (NeurIPS 2024) , volume =
Dimension-Free Deterministic Equivalents and Scaling Laws for Random Feature Regression , author =. Advances in Neural Information Processing Systems (NeurIPS 2024) , volume =. 2024 , eprint =
2024
-
[34]
Scaling Laws and Spectra of Shallow Neural Networks in the Feature Learning Regime , author =. The Fourteenth International Conference on Learning Representations , year =. 2509.24882 , archivePrefix =
-
[35]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Shazeer, Noam and Mirhoseini, Azalia and Maziarz, Krzysztof and Davis, Andy and Le, Quoc V. and Hinton, Geoffrey E. and Dean, Jeff , title =. The Fifth International Conference on Learning Representations , year =. 1701.06538 , archivePrefix =
work page internal anchor Pith review Pith/arXiv arXiv
-
[36]
Langley , title =
P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =
2000
-
[37]
T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980
1980
-
[38]
M. J. Kearns , title =
-
[39]
Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983
1983
-
[40]
R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000
2000
-
[41]
International Conference on Learning Representations (ICLR) 2020 , year =
A Constructive Prediction of the Generalization Error Across Scales , author =. International Conference on Learning Representations (ICLR) 2020 , year =
2020
-
[42]
Suppressed for Anonymity , author=
-
[43]
Newell and P
A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981
1981
-
[44]
A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959
1959
-
[45]
The jamming transition as a paradigm to understand the loss landscape of deep neural networks
Jamming transition as a paradigm to understand the loss landscape of deep neural networks , author =. Physical Review E , volume =. 2019 , doi =. 1809.09349 , archivePrefix=
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[46]
Proceedings of the National Academy of Sciences , volume =
Reconciling modern machine-learning practice and the classical bias--variance trade-off , author =. Proceedings of the National Academy of Sciences , volume =. 2019 , doi =. 1812.11118 , archivePrefix=
-
[47]
SIAM Journal on Mathematics of Data Science , volume =
Two models of double descent for weak features , author =. SIAM Journal on Mathematics of Data Science , volume =. 2020 , doi =. 1903.07571 , archivePrefix=
-
[48]
The Annals of Statistics , volume =
Surprises in high-dimensional ridgeless least squares interpolation , author =. The Annals of Statistics , volume =. 2022 , doi =. 1903.08560 , archivePrefix=
-
[49]
Proceedings of the National Academy of Sciences , volume =
Benign overfitting in linear regression , author =. Proceedings of the National Academy of Sciences , volume =. 2020 , doi =. 1906.11300 , archivePrefix=
-
[50]
Physical Review X , volume =
Modelling the influence of data structure on learning in neural networks: the hidden manifold model , author =. Physical Review X , volume =. 2020 , doi =
2020
-
[51]
Proceedings of the 37th International Conference on Machine Learning , series =
Generalisation error in learning with random features and the hidden manifold model , author =. Proceedings of the 37th International Conference on Machine Learning , series =
-
[52]
Goldt, Sebastian and Loureiro, Bruno and Reeves, Galen and Krzakala, Florent and M. The. Proceedings of The 33rd International Conference on Algorithmic Learning Theory , series =
-
[53]
Superposition Yields Robust Neural Scaling
Superposition Yields Robust Neural Scaling , author =. Advances in Neural Information Processing Systems (NeurIPS 2025) , year =. 2505.10465 , archivePrefix =
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[54]
Transformer Circuits Thread , year =
Towards Monosemanticity: Decomposing Language Models with Dictionary Learning , author =. Transformer Circuits Thread , year =
-
[55]
Advances in Neural Information Processing Systems (NeurIPS 2019) , volume =
On the Inductive Bias of Neural Tangent Kernels , author =. Advances in Neural Information Processing Systems (NeurIPS 2019) , volume =. 2019 , eprint =
2019
-
[56]
Journal of Machine Learning Research , volume =
Breaking the Curse of Dimensionality with Convex Neural Networks , author =. Journal of Machine Learning Research , volume =. 2017 , url =
2017
-
[57]
Proceedings of the 37th International Conference on Machine Learning , series =
Frequency Bias in Neural Networks for Input of Non-Uniform Density , author =. Proceedings of the 37th International Conference on Machine Learning , series =
-
[58]
The Annals of Applied Probability , volume =
A Random Matrix Approach to Neural Networks , author =. The Annals of Applied Probability , volume =. 2018 , doi =
2018
-
[59]
Transformer Circuits Thread , year =
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet , author =. Transformer Circuits Thread , year =
-
[60]
Doklady Akademii Nauk SSSR , volume=
A method for solving the convex programming problem with convergence rate O(1/k^2) , author=. Doklady Akademii Nauk SSSR , volume=
-
[61]
Foundations of Computational Mathematics , volume=
Adaptive restart for accelerated gradient schemes , author=. Foundations of Computational Mathematics , volume=. 2015 , doi=
2015
-
[62]
Advances in Neural Information Processing Systems (NeurIPS 2023) , volume =
Improved Convergence in High Probability of Clipped Gradient Methods with Heavy Tailed Noise , author =. Advances in Neural Information Processing Systems (NeurIPS 2023) , volume =
2023
-
[63]
Communications on Pure and Applied Mathematics , volume =
The generalization error of random features regression: Precise asymptotics and the double descent curve , author =. Communications on Pure and Applied Mathematics , volume =. 2022 , doi =. 1908.05355 , archivePrefix=
-
[64]
Applied and Computational Harmonic Analysis , volume =
Generalization Error of Random Feature and Kernel Methods: Hypercontractivity and Kernel Matrix Concentration , author =. Applied and Computational Harmonic Analysis , volume =. 2022 , doi =
2022
-
[65]
Journal of Statistical Mechanics: Theory and Experiment , volume =
The Committee Machine: Computational to Statistical Gaps in Learning a Two-Layers Neural Network , author =. Journal of Statistical Mechanics: Theory and Experiment , volume =. 2019 , doi =
2019
-
[66]
Proceedings of the National Academy of Sciences , volume =
Optimal Errors and Phase Transitions in High-Dimensional Generalized Linear Models , author =. Proceedings of the National Academy of Sciences , volume =. 2019 , doi =
2019
-
[67]
Advances in Neural Information Processing Systems (NeurIPS 2020) , volume =
When Do Neural Networks Outperform Kernel Methods? , author =. Advances in Neural Information Processing Systems (NeurIPS 2020) , volume =. 2020 , eprint =
2020
-
[68]
Proceedings of the Thirty Fourth Conference on Learning Theory , series =
Learning with Invariances in Random Features and Kernel Models , author =. Proceedings of the Thirty Fourth Conference on Learning Theory , series =
-
[69]
arXiv preprint arXiv:2602.19241 , year =
Scaling Laws for Precision in High-Dimensional Linear Regression , author =. arXiv preprint arXiv:2602.19241 , year =. 2602.19241 , archivePrefix =
-
[70]
Advances in Neural Information Processing Systems (NeurIPS 2025) , year =
Functional Scaling Laws in Kernel Regression: Loss Dynamics and Learning Rate Schedules , author =. Advances in Neural Information Processing Systems (NeurIPS 2025) , year =. 2509.19189 , archivePrefix =
-
[71]
arXiv preprint arXiv:2510.24616 , year =
Statistical Physics of Deep Learning: Optimal Learning of a Multi-Layer Perceptron Near Interpolation , author =. arXiv preprint arXiv:2510.24616 , year =. 2510.24616 , archivePrefix =
-
[72]
, booktitle =
Ren, Yunwei and Nichani, Eshaan and Wu, Denny and Lee, Jason D. , booktitle =. Emergence and Scaling Laws in. 2025 , eprint =
2025
-
[73]
arXiv preprint arXiv:2510.04780 , year =
Kernel Ridge Regression under Power-Law Data: Spectrum and Generalization , author =. arXiv preprint arXiv:2510.04780 , year =. 2510.04780 , archivePrefix =
-
[74]
The Thirteenth International Conference on Learning Representations , year =
Analyzing Neural Scaling Laws in Two-Layer Networks with Power-Law Data Spectra , author =. The Thirteenth International Conference on Learning Representations , year =. 2410.09005 , archivePrefix =
-
[75]
Advances in Neural Information Processing Systems (NeurIPS 2019) , volume =
On the Power and Limitations of Random Features for Understanding Neural Networks , author =. Advances in Neural Information Processing Systems (NeurIPS 2019) , volume =. 2019 , eprint =
2019
-
[76]
arXiv preprint arXiv:2603.14578 , year =
Power-Law Spectrum of the Random Feature Model , author =. arXiv preprint arXiv:2603.14578 , year =. 2603.14578 , archivePrefix =
-
[77]
International Conference on Learning Representations (ICLR) , year =
Deep Double Descent: Where Bigger Models and More Data Hurt , author =. International Conference on Learning Representations (ICLR) , year =. 1912.02292 , archivePrefix=
-
[78]
Advances in Neural Information Processing Systems (NeurIPS 2007) , volume =
Random Features for Large-Scale Kernel Machines , author =. Advances in Neural Information Processing Systems (NeurIPS 2007) , volume =
2007
-
[79]
IEEE Transactions on Information Theory , volume =
Universality Laws for High-Dimensional Learning with Random Features , author =. IEEE Transactions on Information Theory , volume =. 2023 , doi =
2023
-
[80]
Saul , title =
Youngmin Cho and Lawrence K. Saul , title =. Advances in Neural Information Processing Systems 22 (NeurIPS 2009) , volume =
2009
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.