Two steps of gradient descent on first-layer weights in linear-width two-layer networks produce a spiked random matrix with floor(alpha2/(1/2-alpha1)) outliers, each a learned direction, and batch reuse allows capturing directions with information exponent exceeding one.
arXiv preprint arXiv:2201.04753 , year=
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
Derives Weyl-based perturbation bounds showing quantization increases the dominant eigenvalue of the empirical FIM up to higher-order terms, with supporting measurements on language models.
In the LP/N = Θ(1) regime, Bayesian predictive posteriors for deep MLPs equal those of data-dependent kernels to first order, with a criterion identifying data processes that benefit from larger effective depth.
citing papers explorer
-
Feature Learning in Linear-Width Two-Layer Networks: Two vs. One Step of Gradient Descent
Two steps of gradient descent on first-layer weights in linear-width two-layer networks produce a spiked random matrix with floor(alpha2/(1/2-alpha1)) outliers, each a learned direction, and batch reuse allows capturing directions with information exponent exceeding one.
-
Spectral Perturbation of the Empirical Fisher Information Matrix under Weight Quantization
Derives Weyl-based perturbation bounds showing quantization increases the dominant eigenvalue of the empirical FIM up to higher-order terms, with supporting measurements on language models.
-
Bayesian Inference with Shaped Deep Non-linear MLPs
In the LP/N = Θ(1) regime, Bayesian predictive posteriors for deep MLPs equal those of data-dependent kernels to first order, with a criterion identifying data processes that benefit from larger effective depth.