Two steps of gradient descent on first-layer weights in linear-width two-layer networks produce a spiked random matrix with floor(alpha2/(1/2-alpha1)) outliers, each a learned direction, and batch reuse allows capturing directions with information exponent exceeding one.
Emergence in non-neural models: grokking modular arithmetic via average gradient outer product , shorttitle =
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
Normal alignment is the rank-one Jacobian structure that lets classifiers minimize loss and maximize local robustness in sparse regimes; the paper proves its optimality and uses it to create GrokAlign and RFAMs.
xRFM merges kernel-based feature learning with tree structures for scalable, interpretable tabular modeling and reports top performance on 100 regression and competitive results on 200 classification datasets versus 31 baselines including GBDTs and TabPFNv2.
citing papers explorer
-
Feature Learning in Linear-Width Two-Layer Networks: Two vs. One Step of Gradient Descent
Two steps of gradient descent on first-layer weights in linear-width two-layer networks produce a spiked random matrix with floor(alpha2/(1/2-alpha1)) outliers, each a learned direction, and batch reuse allows capturing directions with information exponent exceeding one.
-
The Geometric Structure of Models Learning Sparse Data
Normal alignment is the rank-one Jacobian structure that lets classifiers minimize loss and maximize local robustness in sparse regimes; the paper proves its optimality and uses it to create GrokAlign and RFAMs.
-
xRFM: Accurate, scalable, and interpretable feature learning models for tabular data
xRFM merges kernel-based feature learning with tree structures for scalable, interpretable tabular modeling and reports top performance on 100 regression and competitive results on 200 classification datasets versus 31 baselines including GBDTs and TabPFNv2.
- Unlearning with Asymmetric Sources: Improved Unlearning-Utility Trade-off with Public Data