Two steps of gradient descent on first-layer weights in linear-width two-layer networks produce a spiked random matrix with floor(alpha2/(1/2-alpha1)) outliers, each a learned direction, and batch reuse allows capturing directions with information exponent exceeding one.
arXiv preprint arXiv:2407.04600 , year=
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
FLAME condenses ensemble diversity into a single network via modular ensemble simulation and guided mutual learning during training, delivering ensemble-level performance with single-network inference speed on sequential recommendation tasks.
In ridgeless regression with low intrinsic dimension, discrepancy between weak and strong models reduces W2S generalization variance by dim(V_s)/N in the discrepant subspace while inheriting it in the overlap.
citing papers explorer
-
Feature Learning in Linear-Width Two-Layer Networks: Two vs. One Step of Gradient Descent
Two steps of gradient descent on first-layer weights in linear-width two-layer networks produce a spiked random matrix with floor(alpha2/(1/2-alpha1)) outliers, each a learned direction, and batch reuse allows capturing directions with information exponent exceeding one.
-
FLAME: Condensing Ensemble Diversity into a Single Network for Efficient Sequential Recommendation
FLAME condenses ensemble diversity into a single network via modular ensemble simulation and guided mutual learning during training, delivering ensemble-level performance with single-network inference speed on sequential recommendation tasks.
-
Discrepancies are Virtue: Weak-to-Strong Generalization through Lens of Intrinsic Dimension
In ridgeless regression with low intrinsic dimension, discrepancy between weak and strong models reduces W2S generalization variance by dim(V_s)/N in the discrepant subspace while inheriting it in the overlap.