Two steps of gradient descent on first-layer weights in linear-width two-layer networks produce a spiked random matrix with floor(alpha2/(1/2-alpha1)) outliers, each a learned direction, and batch reuse allows capturing directions with information exponent exceeding one.
On the Peaking Phenomenon of the Lasso in Model Selection
1 Pith paper cite this work. Polarity classification is still indexing.
abstract
I briefly report on some unexpected results that I obtained when optimizing the model parameters of the Lasso. In simulations with varying observations-to-variables ratio n=p, I typically observe a strong peak in the test error curve at the transition point n/p = 1. This peaking phenomenon is well-documented in scenarios that involve the inversion of the sample covariance matrix, and as I illustrate in this note, it is also the source of the peak for the Lasso. The key problem is the parametrization of the Lasso penalty (as e.g. in the current R package lars) and I present a solution in terms of a normalized Lasso parameter.
fields
stat.ML 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Feature Learning in Linear-Width Two-Layer Networks: Two vs. One Step of Gradient Descent
Two steps of gradient descent on first-layer weights in linear-width two-layer networks produce a spiked random matrix with floor(alpha2/(1/2-alpha1)) outliers, each a learned direction, and batch reuse allows capturing directions with information exponent exceeding one.