The Geometric Structure of Models Learning Sparse Data

Ahmed Imtiaz Humayun; Randall Balestriero; Richard Baraniuk; Thomas Walker; T. Mitchell Roddenberry

arxiv: 2605.08464 · v2 · pith:DRDBSOJUnew · submitted 2026-05-08 · 💻 cs.LG

The Geometric Structure of Models Learning Sparse Data

Thomas Walker , T. Mitchell Roddenberry , Ahmed Imtiaz Humayun , Randall Balestriero , Richard Baraniuk This is my paper

Pith reviewed 2026-05-19 17:56 UTC · model grok-4.3

classification 💻 cs.LG

keywords sparse regimenormal alignmentJacobian alignmentlocal robustnessgrokkingpiecewise-affine networkspower diagramadversarial robustness

0 comments

The pith

Models succeed on sparse data by making their input-output Jacobians rank-one and perfectly aligned with each training point.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

In data regimes too sparse for the manifold hypothesis to hold, models still learn by using a specific local geometry the authors call normal alignment. The paper proves that classifiers whose input-output Jacobians are exactly rank-one and point directly along the training vectors minimize the training objective under norm constraints while also achieving the greatest possible local robustness as long as the Jacobian remains non-zero. For continuous piecewise-affine networks, this same alignment appears geometrically as centroid alignment inside the power-diagram partitions induced by the network and arises during the feature-learning phase. The authors then introduce GrokAlign, a regularizer that forces normal alignment, and show it speeds up the training dynamics associated with grokking. They further apply the same principle to create Recursive Feature Alignment Machines that improve adversarial robustness on tabular data relative to standard recursive feature machines.

Core claim

Normal-aligned classifiers—those whose input-output Jacobians are rank-one and align perfectly with the training data—minimize the training objective under norm constraints and achieve maximal local robustness under a non-zero Jacobian constraint. In continuous piecewise-affine deep networks, normal alignment manifests as centroid alignment within the network's induced power diagram partition and results from the feature-learning regime.

What carries the argument

Normal alignment: the property that a classifier's input-output Jacobian is rank-one and aligns exactly with each training data vector.

If this is right

Normal-aligned classifiers minimize the training objective when subject to norm constraints.
They achieve maximal local robustness whenever the Jacobian is required to be non-zero.
In continuous piecewise-affine networks, normal alignment appears as centroid alignment inside the induced power diagram partitions.
Regularization that induces normal alignment accelerates the training dynamics observed in grokking.
Recursive Feature Alignment Machines display greater adversarial robustness than standard recursive feature machines on tabular data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The geometric mechanism may connect the sudden drop in loss during grokking to the emergence of aligned Jacobians rather than to memorization alone.
The same alignment principle could be tested as an explanation for robustness gains in other sparse or high-dimensional tabular and image settings.
Feature-learning phases in deep networks may be preferred because they naturally generate the rank-one aligned geometry that supports both accuracy and robustness.

Load-bearing premise

Success in the sparse regime is explained by normal alignment rather than by other mechanisms, and that this alignment arises specifically from the feature-learning regime in continuous piecewise-affine networks.

What would settle it

A concrete counterexample would be a model trained on sparse data that reaches low training loss and high local robustness while its input-output Jacobians remain either higher-rank or misaligned with the training points.

Figures

Figures reproduced from arXiv: 2605.08464 by Ahmed Imtiaz Humayun, Randall Balestriero, Richard Baraniuk, Thomas Walker, T. Mitchell Roddenberry.

**Figure 1.** Figure 1: Dataset sparsity is a property of the dataset and model. In the left panel, we monitor the normal alignment during deep network training on a subset of MNIST across varying intensities of data augmentation. In the right panel, we train wide residual deep network architectures [19] robustly on different subset sizes of CIFAR10. At the end of training, we monitor the models’ normal alignment. For more experi… view at source ↗

**Figure 2.** Figure 2: A one-hidden-layer transformer training on modular arithmetic exhibits centroid alignment. Here we train a one-layer transformer on a modular arithmetic task. On the left, we show the model’s accuracy on the training and held-out test sets. On the right, we show the centroid alignment between the map from the embedding and the logits of the last token in the context. For more experimental details, see Sect… view at source ↗

**Figure 3.** Figure 3: A deep network with one hidden layer has the capacity to learn a normal-aligned solution for any training set. As the density of the dataset size increases, the irregularity of the deep network — measured by weight norm — increases. In the first and second panels (respectively, third and fourth panels), we depict a training set of size 5 (respectively, 10) along with the level sets of the neurons of a one-… view at source ↗

**Figure 4.** Figure 4: Centroid alignment increases for deep layers of deep networks. Here we obtain robust ResNet18 and ResNet50 models trained on CIFAR10 [18], and consider the centroid alignment of the map from the input space of intermediate layers to the output space. 101 102 103 0 0.5 1 Epochs Test Accuracy 6 Classes 8 Classes 10 Classes 101 102 103 0.6 0.7 0.8 0.9 1 Epochs Alignment 101 102 103 1.2 1.4 Epochs Effective Ra… view at source ↗

**Figure 5.** Figure 5: A Gaussian kernel logistic regression model exhibits normal alignment, validating Theorem 1. Here we train a Gaussian kernel logistic regression model on a ten-dimensional classification problem with either six, eight, or ten classes. In the left panel, we monitor the model’s test accuracy. In the middle panel, we monitor the model’s normal alignment. In the right panel, we monitor the model’s effective ra… view at source ↗

**Figure 6.** Figure 6: Optimal classifiers learn solutions with input-output Jacobians that are non-zero. Here, we train a fully connected deep network on a subset of MNIST with 1000 examples across 100 epochs. During training, PGD attacks are applied to the batches, a weight decay of 0.0001 is used, and a Frobenius norm penalty is applied to the loss function with weight γ. In the first panel, we report the model’s accuracy on … view at source ↗

**Figure 7.** Figure 7: GrokAlign is the most effective regularization strategy for inducing normal alignment in deep networks. We compare the regularization strategies of [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗

read the original abstract

The manifold hypothesis (MH) is often used to explain how machine learning can overcome the curse of dimensionality. However, the MH is only applicable in regimes where the training data provides a sufficiently dense sample of the underlying low-dimensional data manifold, or where such a low-dimensional manifold is conceivably present. We describe the regimes where the MH is not applicable as sparse. In this paper, we demonstrate that models succeed in the sparse regime by exploiting a highly structured local geometry, a property we formalize as normal alignment. We prove that normal-aligned classifiers -- whose input-output Jacobians are rank-one and align perfectly with the training data -- minimize the training objective under norm constraints and achieve maximal local robustness under a non-zero Jacobian constraint. For continuous piecewise-affine deep networks, normal alignment manifests geometrically as centroid alignment within the network's induced power diagram partition and results from the feature-learning regime. Motivated by these theoretical insights, we introduce GrokAlign, a regularization strategy that actively induces normal alignment. We demonstrate that GrokAlign significantly accelerates the training dynamics of deep networks relevant to the grokking phenomenon. Furthermore, we apply the principle of normal alignment to Recursive Feature Machines (RFMs) to introduce Recursive Feature Alignment Machines (RFAMs). We show that RFAMs exhibit greater adversarial robustness compared to RFMs when trained on tabular data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines normal alignment with clean optimality proofs and turns it into two usable methods, but the claim that this geometry drives sparse-regime success remains more assumed than demonstrated.

read the letter

The main takeaway is that the authors formalize normal alignment—rank-one input-output Jacobians that line up with the data—and prove it minimizes the training loss under norm constraints while maximizing local robustness when the Jacobian is nonzero. For continuous piecewise-affine networks they tie this to centroid alignment in the induced power-diagram partition that arises during feature learning. They then build GrokAlign, a regularizer that encourages the property and speeds up grokking dynamics, plus RFAMs that improve adversarial robustness over standard RFMs on tabular data.

Referee Report

2 major / 2 minor

Summary. The paper claims that in sparse regimes where the manifold hypothesis fails due to insufficient data density, machine learning models succeed by exploiting 'normal alignment', a geometric property where the input-output Jacobian is rank-one and aligns with the training data. The authors provide proofs that such normal-aligned classifiers minimize the training objective under norm constraints and achieve maximal local robustness under a non-zero Jacobian constraint. For continuous piecewise-affine deep networks, this alignment appears as centroid alignment in the power-diagram partition induced by the feature-learning regime. They propose GrokAlign, a regularization technique to induce normal alignment that accelerates training in grokking contexts, and Recursive Feature Alignment Machines (RFAMs) which demonstrate improved adversarial robustness on tabular data compared to Recursive Feature Machines.

Significance. If the theoretical claims hold and normal alignment is shown to be the primary mechanism, this work provides a new geometric framework for understanding learning in sparse data settings, potentially explaining phenomena like grokking and offering practical tools for faster training and better robustness. The proofs of optimality and robustness, along with the empirical applications to GrokAlign and RFAMs, represent potential contributions to the field of geometric deep learning and implicit bias analysis.

major comments (2)

[§1 and §4 (Introduction and Geometric Analysis)] The central claim that normal alignment explains success in the sparse regime (as opposed to other implicit biases or low-rank structures) is load-bearing but rests on an assumption rather than a necessity argument separating it from correlated effects; the power-diagram description in the feature-learning regime is presented as a manifestation without evidence that it is the driving mechanism.
[§3] §3 (Optimality and Robustness Proofs): The proofs that normal-aligned classifiers minimize the training objective under norm constraints and achieve maximal local robustness under a non-zero Jacobian constraint are stated to exist, but the manuscript must explicitly show that the rank-one alignment condition is not reducing to a definitional property of the chosen constraints or loss; without the full derivation steps, it is unclear whether the maximality follows directly from the stated assumptions.

minor comments (2)

[Abstract and §1] The abstract and introduction should include a quantitative definition or example distinguishing the 'sparse regime' from dense manifold sampling to make the scope of the claims precise.
[§2] Notation for the input-output Jacobian and its rank-one property should be introduced with an equation early in the theoretical section for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which help us strengthen the presentation of our results. We address the major comments below, providing clarifications and indicating planned revisions to the manuscript.

read point-by-point responses

Referee: [§1 and §4 (Introduction and Geometric Analysis)] The central claim that normal alignment explains success in the sparse regime (as opposed to other implicit biases or low-rank structures) is load-bearing but rests on an assumption rather than a necessity argument separating it from correlated effects; the power-diagram description in the feature-learning regime is presented as a manifestation without evidence that it is the driving mechanism.

Authors: Our work establishes normal alignment as a key geometric property that enables success in sparse regimes through optimality proofs under norm constraints. To address the referee's concern regarding separation from other implicit biases, we will add a new subsection in the introduction that contrasts normal alignment with low-rank Jacobian structures and other biases, using both theoretical arguments and simple counterexamples where low-rank but non-aligned models fail to achieve the same robustness. Regarding the power-diagram, we will include additional analysis showing that centroid alignment is not merely a byproduct but arises necessarily from the feature-learning dynamics in continuous piecewise-affine networks, supported by the proofs in Section 3. revision: partial
Referee: [§3] §3 (Optimality and Robustness Proofs): The proofs that normal-aligned classifiers minimize the training objective under norm constraints and achieve maximal local robustness under a non-zero Jacobian constraint are stated to exist, but the manuscript must explicitly show that the rank-one alignment condition is not reducing to a definitional property of the chosen constraints or loss; without the full derivation steps, it is unclear whether the maximality follows directly from the stated assumptions.

Authors: We acknowledge the need for greater transparency in the proof details. The current manuscript outlines the key steps, but to demonstrate that the rank-one alignment is not definitional, we will expand Section 3 with complete derivations. Specifically, we will show through intermediate steps that starting from the norm-constrained optimization problem, the optimality condition implies the alignment without assuming it a priori from the loss function. This will include explicit calculations for both the minimization of the training objective and the robustness maximization. revision: yes

Circularity Check

0 steps flagged

No circularity: proofs and geometric descriptions are independent of inputs

full rationale

The paper states it proves that normal-aligned classifiers minimize the training objective under norm constraints and achieve maximal local robustness under a non-zero Jacobian constraint. It further describes that for continuous piecewise-affine networks, normal alignment manifests as centroid alignment in the induced power diagram and results from the feature-learning regime. No equations, fitted parameters, or self-citations are shown reducing these claims to definitions by construction or renaming known results. The derivation chain relies on mathematical proofs and geometric formalizations that stand independently of the target explanations for sparse-regime success.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central claims rest on the new definition of normal alignment and the assumption that it is the operative mechanism in sparse regimes; no numerical free parameters are mentioned.

axioms (1)

domain assumption The manifold hypothesis applies only when training data provides a sufficiently dense sample of the underlying low-dimensional manifold
Used in the first sentence to demarcate the sparse regime where the new geometry is needed.

invented entities (3)

normal alignment no independent evidence
purpose: Formalize the highly structured local geometry that models exploit in sparse regimes
Introduced as the key property whose rank-one Jacobian alignment is proved optimal.
GrokAlign no independent evidence
purpose: Regularization strategy that actively induces normal alignment
Proposed to accelerate training dynamics relevant to grokking.
Recursive Feature Alignment Machines (RFAMs) no independent evidence
purpose: Variant of RFMs that incorporates normal alignment for improved adversarial robustness
Constructed by applying the normal-alignment principle to existing RFMs.

pith-pipeline@v0.9.0 · 5782 in / 1451 out tokens · 59047 ms · 2026-05-19T17:56:29.297098+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Theorem 1. ... under the constraint that ∥J_xi∥_F² + ∥b_xi∥₂² ≤ α ... the classifier which minimizes L is such that J_xi = c_i x_i^T ...
IndisputableMonolith/Cost.lean Jcost_unit0 / Jcost_pos_of_ne_one echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

normal-aligned classifiers ... implement the program of an optimal match filter on the training data
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean costAlphaLog_high_calibrated_iff refines

?

refines
Relation between the paper passage and the cited Recognition theorem.

centroid of a polytope is the row-sum of its Jacobian ... μ_φ(x) = J_x^T 1

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 2 internal anchors

[1]

Tenenbaum, Vin de Silva, and John C

Joshua B. Tenenbaum, Vin de Silva, and John C. Langford. A Global Geometric Framework For Nonlinear Dimensionality Reduction.Science, 290(5500), 2000

work page 2000
[2]

Roweis and Lawrence K

Sam T. Roweis and Lawrence K. Saul. Nonlinear Dimensionality Reduction by Locally Linear Embedding.Science, 290(5500), 2000

work page 2000
[3]

Dauphin, and David Lopez-Paz

Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. Mixup: Beyond Empirical Risk Minimization. InInternational Conference on Learning Representations, 2018

work page 2018
[4]

CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features

Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Young Joon Yoo. CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features. IEEE/CVF International Conference on Computer Vision, 2019

work page 2019
[5]

Data Augmentation Using Random Image Cropping and Patching for Deep CNNs.IEEE Trans

Ryo Takahashi, Takashi Matsubara, and Kuniaki Uehara. Data Augmentation Using Random Image Cropping and Patching for Deep CNNs.IEEE Trans. Cir. and Sys. for Video Technol., 30 (9), 2020

work page 2020
[6]

Random Erasing Data Augmentation

Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random Erasing Data Augmentation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, 2020

work page 2020
[7]

Autoaugment: Learning Augmentation Strategies from Data

Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment: Learning Augmentation Strategies from Data. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019

work page 2019
[8]

AugMix: A Simple Method to Improve Robustness and Uncertainty Under Data Shift

Dan Hendrycks, Norman Mu, Ekin Dogus Cubuk, Barret Zoph, Justin Gilmer, and Balaji Lakshminarayanan. AugMix: A Simple Method to Improve Robustness and Uncertainty Under Data Shift. InInternational Conference on Learning Representations, 2020

work page 2020
[9]

LeCun, B

Y . LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation Applied to Handwritten Zip Code Recognition.Neural Computation, 1(4), 1989

work page 1989
[10]

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You Need. InAdvances in Neural Information Processing Systems, 2017

work page 2017
[11]

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets.arXiv:2201.02177, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[12]

Mech- anism for Feature Learning in Neural Networks and Backpropagation-free Machine Learning Models.Science, 383, 2024

Adityanarayanan Radhakrishnan, Daniel Beaglehole, Parthe Pandit, and Mikhail Belkin. Mech- anism for Feature Learning in Neural Networks and Backpropagation-free Machine Learning Models.Science, 383, 2024

work page 2024
[13]

A Spline Theory of Deep Learning

Randall Balestriero and Richard Baraniuk. A Spline Theory of Deep Learning. InProceedings of the 35th International Conference on Machine Learning. PMLR, 2018

work page 2018
[14]

Mad Max: Affine Spline Insights Into Deep Learning.Proceedings of the IEEE, 2020

Randall Balestriero and Richard G Baraniuk. Mad Max: Affine Spline Insights Into Deep Learning.Proceedings of the IEEE, 2020

work page 2020
[15]

Bengio, P

Y . Bengio, P. Simard, and P. Frasconi. Learning Long-Term Dependencies with Gradient Descent Is Difficult.IEEE Transactions on Neural Networks, 5(2), 1994

work page 1994
[16]

On the Difficulty of Training Recurrent Neural Networks

Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the Difficulty of Training Recurrent Neural Networks. InProceedings of the 30th International Conference on Machine Learning, 2013

work page 2013
[17]

LeCun, L

Y . LeCun, L. Bottou, Y . Bengio, and P. Haffner. Gradient-Based Learning Applied to Document Recognition.Proceedings of the IEEE, 86(11), 1998

work page 1998
[18]

Learning Multiple Layers of Features from Tiny Images

Alex Krizhevsky and Geoffrey Hinton. Learning Multiple Layers of Features from Tiny Images. Technical report, University of Toronto, 2009. 11

work page 2009
[19]

Wide Residual Networks

Sergey Zagoruyko and Nikos Komodakis. Wide Residual Networks.arXiv:1605.07146, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[20]

The Geometry of Deep Networks: Power Diagram Subdivision

Randall Balestriero, Romain Cosentino, Behnaam Aazhang, and Richard Baraniuk. The Geometry of Deep Networks: Power Diagram Subdivision. InNeural Information Processing Systems, 2019

work page 2019
[21]

SplineCam: Exact Visualization and Characterization of Deep Network Geometry and Decision Boundaries

Ahmed Imtiaz Humayun, Randall Balestriero, Guha Balakrishnan, and Richard Baraniuk. SplineCam: Exact Visualization and Characterization of Deep Network Geometry and Decision Boundaries. InIEEE Conference on Computer Vision and Pattern Recognition, 2023

work page 2023
[22]

C. A. Rogers.Packing and Covering. Cambridge University Press, 1964. ISBN 978-0-521- 09034-6

work page 1964
[23]

V oronoi Diagram in the Laguerre Geometry and Its Applications.SIAM Journal on Computing, 14(1), 1985

Hiroshi Imai, Masao Iri, and Kazuo Murota. V oronoi Diagram in the Laguerre Geometry and Its Applications.SIAM Journal on Computing, 14(1), 1985

work page 1985
[24]

Scalable Recognition With a V ocabulary Tree

David Nister and Henrik Stewenius. Scalable Recognition With a V ocabulary Tree. InIEEE Conference on Computer Vision and Pattern Recognition, volume 2. IEEE, 2006

work page 2006
[25]

Neural Tangent Kernel: Convergence and Generalization in Neural Networks

Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural Tangent Kernel: Convergence and Generalization in Neural Networks. InProceedings of the 32nd International Conference on Neural Information Processing Systems, 2018

work page 2018
[26]

On Lazy Training in Differentiable Programming

Lénaïc Chizat, Edouard Oyallon, and Francis Bach. On Lazy Training in Differentiable Programming. InAdvances in Neural Information Processing Systems, 2019

work page 2019
[27]

Lee, Edward Moroshko, Pedro Savarese, Itay Golan, Daniel Soudry, and Nathan Srebro

Blake Woodworth, Suriya Gunasekar, Jason D. Lee, Edward Moroshko, Pedro Savarese, Itay Golan, Daniel Soudry, and Nathan Srebro. Kernel and Rich Regimes in Overparametrized Models. InProceedings of 33rd on Learning Theory, 2020

work page 2020
[28]

Lee, Nathan Srebro, and Daniel Soudry

Edward Moroshko, Blake Woodworth, Suriya Gunasekar, Jason D. Lee, Nathan Srebro, and Daniel Soudry. Implicit Bias in Deep Linear Classification: Initialization Scale vs Training Accuracy.Advances in Neural Information Processing Systems, 2020

work page 2020
[29]

Lee, and Wei Hu

Kaifeng Lyu, Jikai Jin, Zhiyuan Li, Simon Shaolei Du, Jason D. Lee, and Wei Hu. Dichotomy of Early and Late Phase Implicit Biases Can Provably Induce Grokking. InThe 12th International Conference on Learning Representations, 2024

work page 2024
[30]

Grokking as a First Order Phase Transition in Two Layer Networks

Noa Rubin, Inbar Seroussi, and Zohar Ringel. Grokking as a First Order Phase Transition in Two Layer Networks. InThe 12th International Conference on Learning Representations, 2024

work page 2024
[31]

Gershman, and Cengiz Pehlevan

Tanishq Kumar, Blake Bordelon, Samuel J. Gershman, and Cengiz Pehlevan. Grokking as the Transition From Lazy to Rich Training Dynamics. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[32]

v2, 5 Jun

Jaerin Lee, Bong Gyun Kang, Kihoon Kim, and Kyoung Mu Lee. Grokfast: Accelerated Grokking by Amplifying Slow Gradients.arXiv:2405.20233, 2024

work page arXiv 2024
[33]

Lucas Prieto, Melih Barsbey, Pedro A. M. Mediano, and Tolga Birdal. Grokking at the Edge of Numerical Stability. InThe 13th International Conference on Learning Representations, 2025

work page 2025
[34]

Let Me Grok For You: Accelerating Grokking via Embedding Transfer From a Weaker Model

Zhiwei Xu, Zhiyu Ni, Yixin Wang, and Wei Hu. Let Me Grok For You: Accelerating Grokking via Embedding Transfer From a Weaker Model. InThe 13th International Conference on Learning Representations, 2025

work page 2025
[35]

Michaud, and Max Tegmark

Ziming Liu, Eric J. Michaud, and Max Tegmark. Omnigrok: Grokking Beyond Algorithmic Data. InThe 11th International Conference on Learning Representations, 2022

work page 2022
[36]

arXiv preprint arXiv:2407.20199 , year=

Neil Mallinar, Daniel Beaglehole, Libin Zhu, Adityanarayanan Radhakrishnan, Parthe Pandit, and Mikhail Belkin. Emergence in non-neural models: Grokking modular arithmetic via average gradient outer product.arXiv:2407.20199, 2025

work page arXiv 2025
[37]

Do We Need Hundreds of Classifiers to Solve Real World Classification Problems?Journal of Machine Learning Research, 15(90), 2014

Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, and Dinani Amorim. Do We Need Hundreds of Classifiers to Solve Real World Classification Problems?Journal of Machine Learning Research, 15(90), 2014. 12

work page 2014
[38]

TabArena: A Living Benchmark for Machine Learning on Tabular Data

Nick Erickson, Lennart Purucker, Andrej Tschalzev, David Holzmüller, Prateek Mutalik Desai, David Salinas, and Frank Hutter. TabArena: A Living Benchmark for Machine Learning on Tabular Data. InProceedings of the 39th Conference on Neural Information Processing Systems, 2025

work page 2025
[39]

Robustness May Be at Odds with Accuracy

Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. Robustness May Be at Odds with Accuracy. InInternational Conference on Learning Representations, 2019

work page 2019
[40]

Towards Deep Learning Models Resistant to Adversarial Attacks

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards Deep Learning Models Resistant to Adversarial Attacks. InInternational Conference on Learning Representations, 2018

work page 2018
[41]

Deep Residual Learning for Image Recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. InIEEE Conference on Computer Vision and Pattern Recognition, 2016

work page 2016
[42]

Benjamin Recht, Maryam Fazel, and Pablo A. Parrilo. Guaranteed Minimum-rank Solutions of Linear Matrix Equations via Nuclear Norm Minimization.SIAM Review, 52(3), 2010

work page 2010
[43]

Nuclear Norm Regularization for Deep Learning

Christopher Scarvelis and Justin Solomon. Nuclear Norm Regularization for Deep Learning. In The 38th Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[44]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization. InInternational Conference on Learning Representations, 2019

work page 2019
[45]

Decou- pled Kullback-Leibler Divergence Loss

Jiequan Cui, Zhuotao Tian, Zhisheng Zhong, Xiaojun Qi, Bei Yu, and Hanwang Zhang. Decou- pled Kullback-Leibler Divergence Loss. InThe 38th Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[46]

Progress Measures for Grokking via Mechanistic Interpretability

Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress Measures for Grokking via Mechanistic Interpretability. InThe 11th International Conference on Learning Representations, 2022. 13 A Constructing Normal Aligned Deep Networks For simplicity, consider each xi to be unit norm and let f(x) =W 2σ(W 1x+b) where W 1 ∈ Rn×d, b∈R n...

work page 2022

[1] [1]

Tenenbaum, Vin de Silva, and John C

Joshua B. Tenenbaum, Vin de Silva, and John C. Langford. A Global Geometric Framework For Nonlinear Dimensionality Reduction.Science, 290(5500), 2000

work page 2000

[2] [2]

Roweis and Lawrence K

Sam T. Roweis and Lawrence K. Saul. Nonlinear Dimensionality Reduction by Locally Linear Embedding.Science, 290(5500), 2000

work page 2000

[3] [3]

Dauphin, and David Lopez-Paz

Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. Mixup: Beyond Empirical Risk Minimization. InInternational Conference on Learning Representations, 2018

work page 2018

[4] [4]

CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features

Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Young Joon Yoo. CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features. IEEE/CVF International Conference on Computer Vision, 2019

work page 2019

[5] [5]

Data Augmentation Using Random Image Cropping and Patching for Deep CNNs.IEEE Trans

Ryo Takahashi, Takashi Matsubara, and Kuniaki Uehara. Data Augmentation Using Random Image Cropping and Patching for Deep CNNs.IEEE Trans. Cir. and Sys. for Video Technol., 30 (9), 2020

work page 2020

[6] [6]

Random Erasing Data Augmentation

Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random Erasing Data Augmentation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, 2020

work page 2020

[7] [7]

Autoaugment: Learning Augmentation Strategies from Data

Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment: Learning Augmentation Strategies from Data. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019

work page 2019

[8] [8]

AugMix: A Simple Method to Improve Robustness and Uncertainty Under Data Shift

Dan Hendrycks, Norman Mu, Ekin Dogus Cubuk, Barret Zoph, Justin Gilmer, and Balaji Lakshminarayanan. AugMix: A Simple Method to Improve Robustness and Uncertainty Under Data Shift. InInternational Conference on Learning Representations, 2020

work page 2020

[9] [9]

LeCun, B

Y . LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation Applied to Handwritten Zip Code Recognition.Neural Computation, 1(4), 1989

work page 1989

[10] [10]

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You Need. InAdvances in Neural Information Processing Systems, 2017

work page 2017

[11] [11]

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets.arXiv:2201.02177, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[12] [12]

Mech- anism for Feature Learning in Neural Networks and Backpropagation-free Machine Learning Models.Science, 383, 2024

Adityanarayanan Radhakrishnan, Daniel Beaglehole, Parthe Pandit, and Mikhail Belkin. Mech- anism for Feature Learning in Neural Networks and Backpropagation-free Machine Learning Models.Science, 383, 2024

work page 2024

[13] [13]

A Spline Theory of Deep Learning

Randall Balestriero and Richard Baraniuk. A Spline Theory of Deep Learning. InProceedings of the 35th International Conference on Machine Learning. PMLR, 2018

work page 2018

[14] [14]

Mad Max: Affine Spline Insights Into Deep Learning.Proceedings of the IEEE, 2020

Randall Balestriero and Richard G Baraniuk. Mad Max: Affine Spline Insights Into Deep Learning.Proceedings of the IEEE, 2020

work page 2020

[15] [15]

Bengio, P

Y . Bengio, P. Simard, and P. Frasconi. Learning Long-Term Dependencies with Gradient Descent Is Difficult.IEEE Transactions on Neural Networks, 5(2), 1994

work page 1994

[16] [16]

On the Difficulty of Training Recurrent Neural Networks

Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the Difficulty of Training Recurrent Neural Networks. InProceedings of the 30th International Conference on Machine Learning, 2013

work page 2013

[17] [17]

LeCun, L

Y . LeCun, L. Bottou, Y . Bengio, and P. Haffner. Gradient-Based Learning Applied to Document Recognition.Proceedings of the IEEE, 86(11), 1998

work page 1998

[18] [18]

Learning Multiple Layers of Features from Tiny Images

Alex Krizhevsky and Geoffrey Hinton. Learning Multiple Layers of Features from Tiny Images. Technical report, University of Toronto, 2009. 11

work page 2009

[19] [19]

Wide Residual Networks

Sergey Zagoruyko and Nikos Komodakis. Wide Residual Networks.arXiv:1605.07146, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[20] [20]

The Geometry of Deep Networks: Power Diagram Subdivision

Randall Balestriero, Romain Cosentino, Behnaam Aazhang, and Richard Baraniuk. The Geometry of Deep Networks: Power Diagram Subdivision. InNeural Information Processing Systems, 2019

work page 2019

[21] [21]

SplineCam: Exact Visualization and Characterization of Deep Network Geometry and Decision Boundaries

Ahmed Imtiaz Humayun, Randall Balestriero, Guha Balakrishnan, and Richard Baraniuk. SplineCam: Exact Visualization and Characterization of Deep Network Geometry and Decision Boundaries. InIEEE Conference on Computer Vision and Pattern Recognition, 2023

work page 2023

[22] [22]

C. A. Rogers.Packing and Covering. Cambridge University Press, 1964. ISBN 978-0-521- 09034-6

work page 1964

[23] [23]

V oronoi Diagram in the Laguerre Geometry and Its Applications.SIAM Journal on Computing, 14(1), 1985

Hiroshi Imai, Masao Iri, and Kazuo Murota. V oronoi Diagram in the Laguerre Geometry and Its Applications.SIAM Journal on Computing, 14(1), 1985

work page 1985

[24] [24]

Scalable Recognition With a V ocabulary Tree

David Nister and Henrik Stewenius. Scalable Recognition With a V ocabulary Tree. InIEEE Conference on Computer Vision and Pattern Recognition, volume 2. IEEE, 2006

work page 2006

[25] [25]

Neural Tangent Kernel: Convergence and Generalization in Neural Networks

Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural Tangent Kernel: Convergence and Generalization in Neural Networks. InProceedings of the 32nd International Conference on Neural Information Processing Systems, 2018

work page 2018

[26] [26]

On Lazy Training in Differentiable Programming

Lénaïc Chizat, Edouard Oyallon, and Francis Bach. On Lazy Training in Differentiable Programming. InAdvances in Neural Information Processing Systems, 2019

work page 2019

[27] [27]

Lee, Edward Moroshko, Pedro Savarese, Itay Golan, Daniel Soudry, and Nathan Srebro

Blake Woodworth, Suriya Gunasekar, Jason D. Lee, Edward Moroshko, Pedro Savarese, Itay Golan, Daniel Soudry, and Nathan Srebro. Kernel and Rich Regimes in Overparametrized Models. InProceedings of 33rd on Learning Theory, 2020

work page 2020

[28] [28]

Lee, Nathan Srebro, and Daniel Soudry

Edward Moroshko, Blake Woodworth, Suriya Gunasekar, Jason D. Lee, Nathan Srebro, and Daniel Soudry. Implicit Bias in Deep Linear Classification: Initialization Scale vs Training Accuracy.Advances in Neural Information Processing Systems, 2020

work page 2020

[29] [29]

Lee, and Wei Hu

Kaifeng Lyu, Jikai Jin, Zhiyuan Li, Simon Shaolei Du, Jason D. Lee, and Wei Hu. Dichotomy of Early and Late Phase Implicit Biases Can Provably Induce Grokking. InThe 12th International Conference on Learning Representations, 2024

work page 2024

[30] [30]

Grokking as a First Order Phase Transition in Two Layer Networks

Noa Rubin, Inbar Seroussi, and Zohar Ringel. Grokking as a First Order Phase Transition in Two Layer Networks. InThe 12th International Conference on Learning Representations, 2024

work page 2024

[31] [31]

Gershman, and Cengiz Pehlevan

Tanishq Kumar, Blake Bordelon, Samuel J. Gershman, and Cengiz Pehlevan. Grokking as the Transition From Lazy to Rich Training Dynamics. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[32] [32]

v2, 5 Jun

Jaerin Lee, Bong Gyun Kang, Kihoon Kim, and Kyoung Mu Lee. Grokfast: Accelerated Grokking by Amplifying Slow Gradients.arXiv:2405.20233, 2024

work page arXiv 2024

[33] [33]

Lucas Prieto, Melih Barsbey, Pedro A. M. Mediano, and Tolga Birdal. Grokking at the Edge of Numerical Stability. InThe 13th International Conference on Learning Representations, 2025

work page 2025

[34] [34]

Let Me Grok For You: Accelerating Grokking via Embedding Transfer From a Weaker Model

Zhiwei Xu, Zhiyu Ni, Yixin Wang, and Wei Hu. Let Me Grok For You: Accelerating Grokking via Embedding Transfer From a Weaker Model. InThe 13th International Conference on Learning Representations, 2025

work page 2025

[35] [35]

Michaud, and Max Tegmark

Ziming Liu, Eric J. Michaud, and Max Tegmark. Omnigrok: Grokking Beyond Algorithmic Data. InThe 11th International Conference on Learning Representations, 2022

work page 2022

[36] [36]

arXiv preprint arXiv:2407.20199 , year=

Neil Mallinar, Daniel Beaglehole, Libin Zhu, Adityanarayanan Radhakrishnan, Parthe Pandit, and Mikhail Belkin. Emergence in non-neural models: Grokking modular arithmetic via average gradient outer product.arXiv:2407.20199, 2025

work page arXiv 2025

[37] [37]

Do We Need Hundreds of Classifiers to Solve Real World Classification Problems?Journal of Machine Learning Research, 15(90), 2014

Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, and Dinani Amorim. Do We Need Hundreds of Classifiers to Solve Real World Classification Problems?Journal of Machine Learning Research, 15(90), 2014. 12

work page 2014

[38] [38]

TabArena: A Living Benchmark for Machine Learning on Tabular Data

Nick Erickson, Lennart Purucker, Andrej Tschalzev, David Holzmüller, Prateek Mutalik Desai, David Salinas, and Frank Hutter. TabArena: A Living Benchmark for Machine Learning on Tabular Data. InProceedings of the 39th Conference on Neural Information Processing Systems, 2025

work page 2025

[39] [39]

Robustness May Be at Odds with Accuracy

Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. Robustness May Be at Odds with Accuracy. InInternational Conference on Learning Representations, 2019

work page 2019

[40] [40]

Towards Deep Learning Models Resistant to Adversarial Attacks

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards Deep Learning Models Resistant to Adversarial Attacks. InInternational Conference on Learning Representations, 2018

work page 2018

[41] [41]

Deep Residual Learning for Image Recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. InIEEE Conference on Computer Vision and Pattern Recognition, 2016

work page 2016

[42] [42]

Benjamin Recht, Maryam Fazel, and Pablo A. Parrilo. Guaranteed Minimum-rank Solutions of Linear Matrix Equations via Nuclear Norm Minimization.SIAM Review, 52(3), 2010

work page 2010

[43] [43]

Nuclear Norm Regularization for Deep Learning

Christopher Scarvelis and Justin Solomon. Nuclear Norm Regularization for Deep Learning. In The 38th Annual Conference on Neural Information Processing Systems, 2024

work page 2024

[44] [44]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization. InInternational Conference on Learning Representations, 2019

work page 2019

[45] [45]

Decou- pled Kullback-Leibler Divergence Loss

Jiequan Cui, Zhuotao Tian, Zhisheng Zhong, Xiaojun Qi, Bei Yu, and Hanwang Zhang. Decou- pled Kullback-Leibler Divergence Loss. InThe 38th Annual Conference on Neural Information Processing Systems, 2024

work page 2024

[46] [46]

Progress Measures for Grokking via Mechanistic Interpretability

Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress Measures for Grokking via Mechanistic Interpretability. InThe 11th International Conference on Learning Representations, 2022. 13 A Constructing Normal Aligned Deep Networks For simplicity, consider each xi to be unit norm and let f(x) =W 2σ(W 1x+b) where W 1 ∈ Rn×d, b∈R n...

work page 2022