Emergence in non-neural models: grokking modular arithmetic via average gradient outer product , shorttitle =

Emergence in non-neural models: grokking modular arithmetic via average gradient outer product , author= · 2025 · arXiv 2407.20199

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 1 method 1

citation-polarity summary

background 1 use method 1

representative citing papers

Feature Learning in Linear-Width Two-Layer Networks: Two vs. One Step of Gradient Descent

stat.ML · 2026-05-18 · unverdicted · novelty 7.0 · 2 refs

Two steps of gradient descent on first-layer weights in linear-width two-layer networks produce a spiked random matrix with floor(alpha2/(1/2-alpha1)) outliers, each a learned direction, and batch reuse allows capturing directions with information exponent exceeding one.

The Geometric Structure of Models Learning Sparse Data

cs.LG · 2026-05-08 · unverdicted · novelty 7.0 · 2 refs

Normal alignment is the rank-one Jacobian structure that lets classifiers minimize loss and maximize local robustness in sparse regimes; the paper proves its optimality and uses it to create GrokAlign and RFAMs.

xRFM: Accurate, scalable, and interpretable feature learning models for tabular data

cs.LG · 2025-08-12 · unverdicted · novelty 6.0

xRFM merges kernel-based feature learning with tree structures for scalable, interpretable tabular modeling and reports top performance on 100 regression and competitive results on 200 classification datasets versus 31 baselines including GBDTs and TabPFNv2.

Unlearning with Asymmetric Sources: Improved Unlearning-Utility Trade-off with Public Data

cs.LG · 2026-05-11

citing papers explorer

Showing 4 of 4 citing papers.

Feature Learning in Linear-Width Two-Layer Networks: Two vs. One Step of Gradient Descent stat.ML · 2026-05-18 · unverdicted · none · ref 191 · 2 links
Two steps of gradient descent on first-layer weights in linear-width two-layer networks produce a spiked random matrix with floor(alpha2/(1/2-alpha1)) outliers, each a learned direction, and batch reuse allows capturing directions with information exponent exceeding one.
The Geometric Structure of Models Learning Sparse Data cs.LG · 2026-05-08 · unverdicted · none · ref 36 · 2 links
Normal alignment is the rank-one Jacobian structure that lets classifiers minimize loss and maximize local robustness in sparse regimes; the paper proves its optimality and uses it to create GrokAlign and RFAMs.
xRFM: Accurate, scalable, and interpretable feature learning models for tabular data cs.LG · 2025-08-12 · unverdicted · none · ref 27
xRFM merges kernel-based feature learning with tree structures for scalable, interpretable tabular modeling and reports top performance on 100 regression and competitive results on 200 classification datasets versus 31 baselines including GBDTs and TabPFNv2.
Unlearning with Asymmetric Sources: Improved Unlearning-Utility Trade-off with Public Data cs.LG · 2026-05-11 · unreviewed · ref 3

Emergence in non-neural models: grokking modular arithmetic via average gradient outer product , shorttitle =

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer