A stochastic-geometric model of solution-space topology under Adam derives explicit scaling laws for grokking transition time as a function of learning rate, batch size, and L2 coefficient.
Proceedings of the Forty-Third International Conference on Machine Learning , year =
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
Larger models succeed on rare and complex tasks by reducing gradient interference from common tasks, allowing rare-task features to accumulate, as shown via synthetic task mixtures and OLMo pretraining from 4M to 4B parameters.
citing papers explorer
-
A Stochastic--Geometric Theory of Scaling Laws in Grokking
A stochastic-geometric model of solution-space topology under Adam derives explicit scaling laws for grokking transition time as a function of learning rate, batch size, and L2 coefficient.
-
Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention
Larger models succeed on rare and complex tasks by reducing gradient interference from common tasks, allowing rare-task features to accumulate, as shown via synthetic task mixtures and OLMo pretraining from 4M to 4B parameters.