pith. sign in

arxiv: 2605.17767 · v1 · pith:JGKG3E4Cnew · submitted 2026-05-18 · 📊 stat.ML · cs.LG

Feature Learning in Linear-Width Two-Layer Networks: Two vs. One Step of Gradient Descent

Pith reviewed 2026-05-20 01:25 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords feature learningtwo-layer networksgradient descentlinear-width regimespiked random matrixbatch reuseinformation exponent
0
0 comments X

The pith

In the linear-width regime, the second gradient step on two-layer networks produces weights that act as a spiked random matrix whose number of outliers is set by floor(alpha2 over one-half minus alpha1).

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper analyzes feature learning in two-layer networks when hidden units, samples, and input dimension all scale proportionally. One gradient step is limited to rank-one updates that capture only information-exponent-one directions. With two steps whose sizes scale as powers of width, the weight matrix acquires a spiked spectrum whose outlier count equals floor of alpha2 divided by (one-half minus alpha1). Reusing the same batch across steps lets the second update align with directions whose information exponent exceeds one, provided the exponents are chosen suitably, whereas independent batches cannot.

Core claim

The weights after the second gradient step behave as a spiked random matrix with multiple outliers, the number of which is given by floor(alpha2 / (1/2 - alpha1)), and batch reuse enables the second update to capture directions with information exponent exceeding one when alpha1 and alpha2 are chosen appropriately.

What carries the argument

The spectral characterization of the updated weights as a spiked random matrix, with outlier count controlled by the ratio of the two step-size exponents.

If this is right

  • The number of learned directions grows with the ratio of the second step-size exponent to the remaining capacity after the first step.
  • Batch reuse produces a qualitative improvement over independent batches by unlocking directions with information exponent larger than one.
  • Early training dynamics in overparameterized networks admit a precise spectral description once the linear-width scaling and step-size powers are fixed.
  • The same scaling regime supplies a tractable limit for studying how optimization moves from random initialization to feature alignment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Extending the same analysis to three or more steps would predict how many additional outliers appear under continued power-law step sizes.
  • The batch-reuse distinction may generalize to deeper architectures if analogous step-size scalings are applied layer-wise.
  • Finite-width simulations with moderate proportionality constants could directly count outliers and test whether the floor formula remains predictive before the asymptotic limit.

Load-bearing premise

The derivation assumes the linear-width regime in which hidden neurons, sample size, and input dimension scale proportionally, together with power-law step-size scalings for the two updates.

What would settle it

Compute the eigenvalues of the weight matrix after exactly two scaled gradient steps on synthetic data with proportional dimensions and verify whether the number of large outliers matches the floor formula while their alignment with the target differs between reused and independent batches.

Figures

Figures reproduced from arXiv: 2605.17767 by Behrad Moniri, Hamed Hassani.

Figure 1
Figure 1. Figure 1: Histogram of the singular values of the weight matrix [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The number of different directions Λ(α1, α2) learned by the second step of gradient descent, as a function of α1, α2 ∈ [0, 0.5). This theorem shows that after the second step of gradient descent update, the weights are in the Bulk+Spikes phase of training from [MM21], and that the dynamics closely match the early phase training dynamics empirically characterized in prior work (see also [MPM21, WES+23]). Us… view at source ↗
Figure 3
Figure 3. Figure 3: The histogram of the singular values of W1 and W2 with α2 = 0.4, α1 = 0.3. In this setup, we consider the reused batch setting and set M = 1 with g1(z) = H3(z), σ(z) = tanh(z), n = 40 × 103 , dX = 14 × 103 and N = 20 × 103 . The vertical dashed lines correspond to the outliers and the vertical blue line is the top non-outlying singular value. As predicted by Theorem 4, the spectrum of W2 contains two outly… view at source ↗
Figure 4
Figure 4. Figure 4: The alignment of right singular vectors of the weight matrix, with the target direction in [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Inner Product Alignment β ⊤ ∗ βˆ 2;2 as a function of dX/ξ2n. Simulation 2. In this setting, we consider n = 6 × 103 and vary dX between 600 and 12 × 103 . We set M = 1 with y = H3(Xβ⋆). We compute β ⊤ ⋆ βˆ 2;2 and average it for each values of dX over 100 trials. We compare the simulation results with the theoretical predictions of Theorem 6. This is shown in [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
read the original abstract

We study feature learning in two-layer neural networks within the linear-width regime, where the number of hidden neurons, sample size, and input dimension scale proportionally. While recent work has analyzed feature learning via a single step of gradient descent, such updates are fundamentally limited: they are approximately rank-one, capturing only a single direction, and require the target function to have an information exponent of one. In this paper, we go beyond one-step updates to provide a full characterization of the features learned during the second step of gradient descent with step-sizes $\eta_1 \asymp N^{\alpha_1}$ and $\eta_2 \asymp N^{\alpha_2}$ for $\alpha_1, \alpha_2 \in [0,0.5)$. We derive a sharp spectral characterization of the updated weights, demonstrating they behave as a spiked random matrix with multiple outliers, each corresponding to a learned direction. We show that the number of these outliers is determined by the scaling parameters $\alpha_1$ and $\alpha_2$ through $\lfloor \frac{\alpha_2}{1/2 - \alpha_1} \rfloor$. Furthermore, by analyzing the alignment between these learned directions and the target function, we identify a qualitative gap between training with independent versus reused batches. While independent batches restrict learning to directions with an information exponent of one, batch reuse enables the second update to capture directions even when the information exponent exceeds one, under the condition that $\alpha_1, \alpha_2$ are chosen properly. This confirms that the benefits of batch reuse, previously observed in finite-width regimes, persist in the high-dimensional linear-width limit. By characterizing these early-phase spectral transitions, our work establishes a tractable mathematical framework for studying optimization and feature learning phenomenology in modern overparameterized networks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper examines feature learning in two-layer neural networks in the linear-width regime, where hidden neurons, sample size, and input dimension scale proportionally. It characterizes the weights after a second gradient descent step with step sizes scaling as N to the power alpha1 and alpha2 (alpha in [0, 0.5)), showing that the updated weights act as a spiked random matrix whose number of outliers is floor(alpha2 / (1/2 - alpha1)). It further identifies a qualitative difference in alignment with the target function depending on whether batches are independent or reused, with reuse enabling capture of directions having information exponent greater than one under suitable alpha choices.

Significance. If the asymptotic spectral results hold, the work supplies a precise mathematical framework for early-phase feature learning beyond single-step updates, with an explicit dependence of the number of learned directions on the step-size exponents and a clear distinction for batch reuse. This extends prior one-step analyses in a controlled high-dimensional limit and could inform understanding of optimization phenomenology in overparameterized networks.

major comments (1)
  1. The central spectral characterization and the explicit outlier count floor(alpha2 / (1/2 - alpha1)) are load-bearing for the main claims, yet the abstract and stated results leave the precise perturbation analysis and random-matrix derivation implicit; a dedicated section or appendix deriving this count from the linear-width scaling and the two-step update would strengthen verifiability.
minor comments (2)
  1. Clarify the precise definition of the information exponent early in the introduction, as its usage in the batch-reuse comparison is central to the qualitative gap claimed.
  2. The step-size regime alpha1, alpha2 in [0, 0.5) is stated without discussion of boundary behavior at 0.5; a brief remark on why the upper limit is strict would aid readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive evaluation and constructive feedback. We address the single major comment below.

read point-by-point responses
  1. Referee: The central spectral characterization and the explicit outlier count floor(alpha2 / (1/2 - alpha1)) are load-bearing for the main claims, yet the abstract and stated results leave the precise perturbation analysis and random-matrix derivation implicit; a dedicated section or appendix deriving this count from the linear-width scaling and the two-step update would strengthen verifiability.

    Authors: We agree that the perturbation analysis underlying the outlier count is central and that its current presentation could be made more self-contained for verifiability. In the revised manuscript we will add a dedicated subsection (placed after the statement of the main spectral result) that derives the floor(alpha2 / (1/2 - alpha1)) count explicitly from the linear-width scaling, the two-step gradient update, and the associated spiked random-matrix perturbation. The derivation will collect the key intermediate lemmas on the covariance structure and eigenvalue perturbation that are currently distributed across the proofs. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained asymptotic analysis

full rationale

The paper derives the spiked random matrix behavior and outlier count floor(alpha2 / (1/2 - alpha1)) via perturbation analysis and high-dimensional limits in the linear-width regime, with step-size scalings eta1 ~ N^alpha1 and eta2 ~ N^alpha2. This is a mathematical characterization from random matrix theory applied to the two-step gradient updates, not a fit to data or a quantity defined circularly from the outputs. The batch-reuse vs. independent-batch distinction follows from analyzing alignments with the target function under the stated scalings, yielding independent content on information exponents >1. No load-bearing self-citations, self-definitional steps, or renamed known results appear in the central claims; the analysis is externally grounded in asymptotic techniques rather than reducing to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on the linear-width scaling assumption and the specific power-law step-size regimes; these are domain assumptions standard in high-dimensional neural-network theory rather than new entities or fitted constants.

free parameters (1)
  • alpha1 and alpha2
    Exponents in [0, 0.5) that set the scaling of the two step sizes; they directly determine the number of outliers via the floor formula.
axioms (1)
  • domain assumption Linear-width regime: hidden neurons, sample size, and input dimension all scale proportionally with N.
    Invoked throughout the abstract as the setting in which the spectral characterization holds.

pith-pipeline@v0.9.0 · 5872 in / 1258 out tokens · 40495 ms · 2026-05-20T01:25:24.997623+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

299 extracted references · 299 canonical work pages · 9 internal anchors

  1. [1]

    Electronic Journal of Probability , volume=

    Eigenvalue distribution of some nonlinear models of random matrices , author=. Electronic Journal of Probability , volume=. 2021 , publisher=

  2. [2]

    Journal of Multivariate Analysis , volume=

    On the empirical distribution of eigenvalues of large dimensional information-plus-noise-type matrices , author=. Journal of Multivariate Analysis , volume=. 2007 , publisher=

  3. [3]

    Indiana University Mathematics Journal , pages=

    Exact separation phenomenon for the eigenvalues of large information-plus-noise type matrices, and an application to spiked models , author=. Indiana University Mathematics Journal , pages=. 2014 , publisher=

  4. [4]

    International Conference on Learning Representations , year=

    Gradient descent provably optimizes over-parameterized neural networks , author=. International Conference on Learning Representations , year=

  5. [5]

    Electronic Communications in Probability , publisher =

    Sandrine P. Electronic Communications in Probability , publisher =

  6. [6]

    Information and Inference: A Journal of the IMA , volume=

    Moving beyond sub-Gaussianity in high-dimensional statistics: Applications in covariance estimation and linear regression , author=. Information and Inference: A Journal of the IMA , volume=. 2022 , publisher=

  7. [7]

    Stat , volume=

    Sub-Weibull distributions: Generalizing sub-Gaussian and sub-Exponential properties to heavier tailed distributions , author=. Stat , volume=. 2020 , publisher=

  8. [8]

    Communications in Mathematical Research , year =

    Zhang , Huiming and Chen , Songxi , title =. Communications in Mathematical Research , year =

  9. [9]

    International Conference on Learning Representations , year=

    A Theoretical Analysis on Feature Learning in Neural Networks: Emergence from Inputs and Advantage over Fixed Features , author=. International Conference on Learning Representations , year=

  10. [10]

    Conference on Learning Theory , year=

    Learning neural networks with two nonlinear layers in polynomial time , author=. Conference on Learning Theory , year=

  11. [11]

    Advances in Neural Information Processing Systems , year=

    Provable guarantees for nonlinear feature learning in three-layer neural networks , author=. Advances in Neural Information Processing Systems , year=

  12. [12]

    BIT Numerical Mathematics , volume=

    Perturbation bounds in connection with singular value decomposition , author=. BIT Numerical Mathematics , volume=. 1972 , publisher=

  13. [13]

    International Conference on Learning Representations , year=

    Adversarial Feature Learning , author=. International Conference on Learning Representations , year=

  14. [14]

    IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

    Representation learning: A review and new perspectives , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=. 2013 , publisher=

  15. [15]

    2018 , publisher=

    High-dimensional probability: An introduction with applications in data science , author=. 2018 , publisher=

  16. [16]

    2010 , publisher=

    Spectral Analysis of Large Dimensional Random Matrices , author=. 2010 , publisher=

  17. [17]

    Evasion Attacks against Machine Learning at Test Time

    Battista Biggio and Igino Corona and Davide Maiorca and Blaine Nelson and Nedim S rndi \' c and Pavel Laskov and Giorgio Giacinto and Fabio Roli. Evasion Attacks against Machine Learning at Test Time. Proc. Joint European Conf. Mach. Learning and Knowledge Discovery in Databases. 2013

  18. [18]

    , author=

    WONDER: Weighted One-shot Distributed Ridge Regression in High Dimensions. , author=. Journal of Machine Learning Research , volume=

  19. [19]

    2001 , publisher=

    Statistical mechanics of learning , author=. 2001 , publisher=

  20. [20]

    Neural Networks and Spin Glasses , pages=

    Statistical theory of learning a rule , author=. Neural Networks and Spin Glasses , pages=. 1990 , publisher=

  21. [21]

    Journal of Physics A: Mathematical and General , volume=

    Phase transitions in simple learning , author=. Journal of Physics A: Mathematical and General , volume=. 1989 , publisher=

  22. [22]

    Journal of Physics A: Mathematical and General , volume=

    Finite-size effects and optimal test set size in linear perceptrons , author=. Journal of Physics A: Mathematical and General , volume=. 1995 , publisher=

  23. [23]

    Neural Networks , volume=

    Stochastic linear learning: Exact test and training error averages , author=. Neural Networks , volume=. 1993 , publisher=

  24. [24]

    Journal of Physics A: Mathematical and General , volume=

    On the ability of the optimal perceptron to generalise , author=. Journal of Physics A: Mathematical and General , volume=. 1990 , publisher=

  25. [25]

    The Handbook of Brain Theory and Neural Networks, , pages=

    Statistical mechanics of learning: Generalization , author=. The Handbook of Brain Theory and Neural Networks, , pages=

  26. [26]

    Pattern recognition letters , volume=

    Expected classification error of the Fisher linear classifier with pseudo-inverse covariance matrix , author=. Pattern recognition letters , volume=. 1998 , publisher=

  27. [27]

    Proceedings of the Scandinavian Conference on Image Analysis , volume=

    Small sample size generalization , author=. Proceedings of the Scandinavian Conference on Image Analysis , volume=

  28. [28]

    Models of Neural Networks III , pages=

    Statistical mechanics of generalization , author=. Models of Neural Networks III , pages=. 1996 , publisher=

  29. [29]

    Proceedings of the National Academy of Sciences , volume=

    A brief prehistory of double descent , author=. Proceedings of the National Academy of Sciences , volume=. 2020 , publisher=

  30. [30]

    IEEE Transactions on Pattern Analysis and Machine Intelligence , number=

    A problem of dimensionality: A simple example , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , number=. 1979 , publisher=

  31. [31]

    On the Peaking Phenomenon of the Lasso in Model Selection

    On the peaking phenomenon of the lasso in model selection , author=. arXiv preprint arXiv:0904.4416 , year=

  32. [32]

    Frontiers of Life , volume=

    Learning to generalize , author=. Frontiers of Life , volume=

  33. [33]

    Physical Review E , volume=

    Jamming transition as a paradigm to understand the loss landscape of deep neural networks , author=. Physical Review E , volume=. 2019 , publisher=

  34. [34]

    Neural Networks , volume=

    High-dimensional dynamics of generalization error in neural networks , author=. Neural Networks , volume=. 2020 , publisher=

  35. [35]

    Results in statistical discriminant analysis: A review of the former

    Raudys,. Results in statistical discriminant analysis: A review of the former. Journal of Multivariate Analysis , volume=. 2004 , publisher=

  36. [36]

    Multiparametric

    Serdobolskii, Vadim Ivanovich , year=. Multiparametric

  37. [37]

    2022 , publisher=

    Random Matrix Methods for Machine Learning , author=. 2022 , publisher=

  38. [38]

    Random matrix theory and wireless communications , Volume =

    Tulino, Antonio M and Verd. Random matrix theory and wireless communications , Volume =. Communications and Information Theory , Number =

  39. [39]

    Couillet, Romain and Debbah, Merouane , Publisher =. Random

  40. [40]

    Journal of Statistical Planning and Inference , volume=

    Random matrix theory in statistics: A review , author=. Journal of Statistical Planning and Inference , volume=. 2014 , publisher=

  41. [41]

    Large Sample Covariance Matrices and High-Dimensional Data Analysis , Year =

    Yao, Jianfeng and Bai, Zhidong and Zheng, Shurong , Date-Added =. Large Sample Covariance Matrices and High-Dimensional Data Analysis , Year =

  42. [42]

    Technical Cybernetics (in Russian) , pages=

    On the amount of a priori information in designing the classification algorithm , author=. Technical Cybernetics (in Russian) , pages=. 1972 , volume=

  43. [43]

    Representation of statistics of discriminant analysis and asymptotic expansion when space dimensions are comparable with sample size , author=. Sov. Math. Dokl. , volume=

  44. [44]

    The Annals of Applied Probability , volume=

    Deterministic equivalents for certain functionals of large random matrices , author=. The Annals of Applied Probability , volume=. 2007 , publisher=

  45. [45]

    Computing Systems (in Russian) , volume=

    On determining training sample size of linear classifier , author=. Computing Systems (in Russian) , volume=

  46. [46]

    2012 , publisher=

    Statistical and Neural Classifiers: An integrated approach to design , author=. 2012 , publisher=

  47. [47]

    New Trends in Probability and Statistics , volume=

    Small sample properties of ridge estimate of the covariance matrix in statistical and neural net classification , author=. New Trends in Probability and Statistics , volume=

  48. [48]

    1998 , publisher=

    Combinatorial theory of the free product with amalgamation and operator-valued free probability theory , author=. 1998 , publisher=

  49. [49]

    The Annals of Statistics , volume=

    High-dimensional asymptotics of prediction: Ridge regression and classification , author=. The Annals of Statistics , volume=. 2018 , publisher=

  50. [50]

    What Causes the Test Error? Going Beyond Bias-Variance via

    Lin, Licong and Dobriban, Edgar , journal=. What Causes the Test Error? Going Beyond Bias-Variance via

  51. [51]

    Communications on Pure and Applied Mathematics , volume=

    The generalization error of random features regression: Precise asymptotics and the double descent curve , author=. Communications on Pure and Applied Mathematics , volume=. 2022 , publisher=

  52. [52]

    Advances in Neural Information Processing Systems , year=

    Overparameterization improves robustness to covariate shift in high dimensions , author=. Advances in Neural Information Processing Systems , year=

  53. [53]

    Spectra of large block matrices

    Spectra of large block matrices , author=. arXiv preprint cs/0610045 , year=

  54. [54]

    Journal of Functional Analysis , volume=

    Applications of realizations (aka linearizations) to free probability , author=. Journal of Functional Analysis , volume=. 2018 , publisher=

  55. [55]

    Recht, Benjamin and Roelofs, Rebecca and Schmidt, Ludwig and Shankar, Vaishaal , booktitle=. Do

  56. [56]

    Advances in Neural Information Processing Systems , year=

    Measuring robustness to natural distribution shifts in image classification , author=. Advances in Neural Information Processing Systems , year=

  57. [57]

    Assessing Generalization of

    Jiang, Yiding and Nagarajan, Vaishnavh and Baek, Christina and Kolter, J Zico , booktitle=. Assessing Generalization of

  58. [58]

    2022 , booktitle=

    Agreement-on-the-line: Predicting the Performance of Neural Networks under Distribution Shift , author=. 2022 , booktitle=

  59. [59]

    International Conference on Machine Learning , year=

    The neural tangent kernel in high dimensions: Triple descent and a multi-scale theory of generalization , author=. International Conference on Machine Learning , year=

  60. [60]

    Advances in Neural Information Processing Systems , year=

    Understanding double descent requires a fine-grained bias-variance decomposition , author=. Advances in Neural Information Processing Systems , year=

  61. [61]

    International Conference on Artificial Intelligence and Statistics , year=

    A Random Matrix Perspective on Mixtures of Nonlinearities in High Dimensions , author=. International Conference on Artificial Intelligence and Statistics , year=

  62. [62]

    Stochastic Processes and their Applications , volume=

    On the limiting spectral distribution for a large class of symmetric random matrices with correlated entries , author=. Stochastic Processes and their Applications , volume=. 2015 , publisher=

  63. [63]

    Banna, Marwa and Najim, Jamal and Yao, Jianfeng , journal=. A. 2020 , publisher=

  64. [64]

    The matrix Dyson equation and its applications for random matrices

    Erd. The matrix. arXiv preprint arXiv:1903.10060 , year=

  65. [65]

    IEEE Transactions on Information Theory , year=

    Universality laws for high-dimensional learning with random features , author=. IEEE Transactions on Information Theory , year=

  66. [66]

    2017 , publisher=

    Free probability and random matrices , author=. 2017 , publisher=

  67. [67]

    International Mathematics Research Notices , volume=

    Operator-valued semicircular elements: solving a quadratic matrix equation with positivity constraints , author=. International Mathematics Research Notices , volume=. 2007 , publisher=

  68. [68]

    Advances in Neural Information Processing Systems , year=

    High-dimensional Asymptotics of Feature Learning: How One Gradient Step Improves the Representation , author=. Advances in Neural Information Processing Systems , year=

  69. [69]

    International Conference on Learning Representations , year=

    Anisotropic Random Feature Regression in High Dimensions , author=. International Conference on Learning Representations , year=

  70. [70]

    The Annals of Statistics , volume=

    Linearized two-layers neural networks in high dimension , author=. The Annals of Statistics , volume=. 2021 , publisher=

  71. [71]

    The Annals of Statistics , volume=

    Distributed linear regression by averaging , author=. The Annals of Statistics , volume=. 2021 , publisher=

  72. [72]

    The Annals of Statistics , volume=

    The spectrum of kernel random matrices , author=. The Annals of Statistics , volume=. 2010 , publisher=

  73. [73]

    Random Features for Large-Scale Kernel Machines , year =

    Rahimi, Ali and Recht, Benjamin , booktitle =. Random Features for Large-Scale Kernel Machines , year =

  74. [74]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , year=

    The many faces of robustness: A critical analysis of out-of-distribution generalization , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , year=

  75. [75]

    Koh, Pang Wei and Sagawa, Shiori and Marklund, Henrik and Xie, Sang Michael and Zhang, Marvin and Balsubramani, Akshay and Hu, Weihua and Yasunaga, Michihiro and Phillips, Richard Lanas and Gao, Irena and others , booktitle=

  76. [76]

    International Conference on Machine Learning , year=

    Accuracy on the line: on the strong correlation between out-of-distribution and in-distribution generalization , author=. International Conference on Machine Learning , year=

  77. [77]

    Advances in Neural Information Processing Systems , year=

    On the Optimal Weighted _2 Regularization in Overparameterized Linear Regression , author=. Advances in Neural Information Processing Systems , year=

  78. [78]

    The Annals of Statistics , volume=

    Surprises in high-dimensional ridgeless least squares interpolation , author=. The Annals of Statistics , volume=. 2022 , publisher=

  79. [79]

    arXiv preprint arXiv:2208.02753 , year=

    Spectral universality of regularized linear regression with nearly deterministic sensing matrices , author=. arXiv preprint arXiv:2208.02753 , year=

  80. [80]

    Conference on Learning Theory , year=

    Universality of empirical risk minimization , author=. Conference on Learning Theory , year=

Showing first 80 references.