EGD equalizes gradient speeds across singular directions, eliminating or shortening grokking plateaus on modular addition and sparse parity problems.
David Saad and Sara A Solla
4 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.LG 4representative citing papers
TRM with 7M parameters achieves 45% accuracy on ARC-AGI-1 and 8% on ARC-AGI-2, surpassing most LLMs with under 0.01% of their parameters.
One training example via RLVR boosts LLM math reasoning from 17.6% to 35.7% average across six benchmarks.
A residual from Hankel DMD on Wasserstein-mapped training distributions localizes grokking transitions in modular-addition Transformers with AUROC 0.93 and can precede onset under a sustained-threshold rule.
citing papers explorer
-
Egalitarian Gradient Descent: A Simple Approach to Accelerated Grokking
EGD equalizes gradient speeds across singular directions, eliminating or shortening grokking plateaus on modular addition and sparse parity problems.
-
Less is More: Recursive Reasoning with Tiny Networks
TRM with 7M parameters achieves 45% accuracy on ARC-AGI-1 and 8% on ARC-AGI-2, surpassing most LLMs with under 0.01% of their parameters.
-
Reinforcement Learning for Reasoning in Large Language Models with One Training Example
One training example via RLVR boosts LLM math reasoning from 17.6% to 35.7% average across six benchmarks.
-
Distributional Spectral Diagnostics for Localizing Grokking Transitions
A residual from Hankel DMD on Wasserstein-mapped training distributions localizes grokking transitions in modular-addition Transformers with AUROC 0.93 and can precede onset under a sustained-threshold rule.