pith. machine review for the scientific record. sign in

arxiv: 2605.08464 · v1 · submitted 2026-05-08 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

The Geometric Structure of Models Learning Sparse Data

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:29 UTC · model grok-4.3

classification 💻 cs.LG
keywords normal alignmentsparse regimemanifold hypothesisJacobiangrokkingadversarial robustnessdeep networkspower diagram
0
0 comments X

The pith

Normal-aligned classifiers with rank-one Jacobians minimize training loss and maximize local robustness in sparse regimes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

In regimes where data is too sparse for the manifold hypothesis to hold, models succeed by exploiting normal alignment, a geometric property where the input-output Jacobian is rank-one and aligns exactly with the training points. The paper proves that such normal-aligned classifiers minimize the training objective when subject to norm constraints and achieve the highest possible local robustness as long as the Jacobian remains non-zero. This alignment appears in deep networks as centroid alignment inside the power-diagram partitions induced by feature learning. Motivated by the theory, the authors introduce GrokAlign, a regularizer that forces normal alignment and speeds up grokking, and they derive Recursive Feature Alignment Machines that improve adversarial robustness over standard recursive feature machines on tabular data.

Core claim

Normal-aligned classifiers whose input-output Jacobians are rank-one and align perfectly with the training data minimize the training objective under norm constraints and achieve maximal local robustness under a non-zero Jacobian constraint. For continuous piecewise-affine deep networks, normal alignment manifests geometrically as centroid alignment within the network's induced power diagram partition and results from the feature-learning regime.

What carries the argument

Normal alignment, the property that the input-output Jacobian is rank-one and aligns perfectly with the training data, which enables the optimality and robustness results.

If this is right

  • Normal-aligned classifiers minimize the training objective under norm constraints.
  • They achieve maximal local robustness whenever the Jacobian is constrained to be non-zero.
  • GrokAlign regularization induces normal alignment and accelerates training dynamics in grokking scenarios.
  • Recursive Feature Alignment Machines exhibit greater adversarial robustness than standard Recursive Feature Machines on tabular data.
  • In continuous piecewise-affine networks, normal alignment appears as centroid alignment inside the induced power diagram partitions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If normal alignment is the operative structure in sparse regimes, then similar alignment-inducing regularizers could be tested on other architectures that currently struggle with high-dimensional sparse inputs.
  • The link between normal alignment and grokking raises the possibility that the sudden generalization phase coincides with the emergence of rank-one Jacobian alignment during training.
  • The same geometric principle might be used to derive robust variants of other feature-learning methods beyond recursive feature machines.

Load-bearing premise

The sparse regime is precisely where the manifold hypothesis fails and that rank-one Jacobian alignment is both necessary and sufficient for the claimed optimality and robustness.

What would settle it

A classifier trained on sparse data that achieves strictly lower training loss than every normal-aligned model under identical norm constraints, or a normal-aligned model that fails to attain maximal local robustness.

Figures

Figures reproduced from arXiv: 2605.08464 by Ahmed Imtiaz Humayun, Randall Balestriero, Richard Baraniuk, Thomas Walker, T. Mitchell Roddenberry.

Figure 1
Figure 1. Figure 1: Dataset sparsity is a property of the dataset and model. In the left panel, we monitor the normal alignment during deep network training on a subset of MNIST across varying intensities of data augmentation. In the right panel, we train wide residual deep network architectures [19] robustly on different subset sizes of CIFAR10. At the end of training, we monitor the models’ normal alignment. For more experi… view at source ↗
Figure 2
Figure 2. Figure 2: A one-hidden-layer transformer training on modular arithmetic exhibits centroid alignment. Here we train a one-layer transformer on a modular arithmetic task. On the left, we show the model’s accuracy on the training and held-out test sets. On the right, we show the centroid alignment between the map from the embedding and the logits of the last token in the context. For more experimental details, see Sect… view at source ↗
Figure 3
Figure 3. Figure 3: A deep network with one hidden layer has the capacity to learn a normal-aligned solution for any training set. As the density of the dataset size increases, the irregularity of the deep network — measured by weight norm — increases. In the first and second panels (respectively, third and fourth panels), we depict a training set of size 5 (respectively, 10) along with the level sets of the neurons of a one-… view at source ↗
Figure 4
Figure 4. Figure 4: Centroid alignment increases for deep layers of deep networks. Here we obtain robust ResNet18 and ResNet50 models trained on CIFAR10 [18], and consider the centroid alignment of the map from the input space of intermediate layers to the output space. 101 102 103 0 0.5 1 Epochs Test Accuracy 6 Classes 8 Classes 10 Classes 101 102 103 0.6 0.7 0.8 0.9 1 Epochs Alignment 101 102 103 1.2 1.4 Epochs Effective Ra… view at source ↗
Figure 5
Figure 5. Figure 5: A Gaussian kernel logistic regression model exhibits normal alignment, validating Theorem 1. Here we train a Gaussian kernel logistic regression model on a ten-dimensional classification problem with either six, eight, or ten classes. In the left panel, we monitor the model’s test accuracy. In the middle panel, we monitor the model’s normal alignment. In the right panel, we monitor the model’s effective ra… view at source ↗
Figure 6
Figure 6. Figure 6: Optimal classifiers learn solutions with input-output Jacobians that are non-zero. Here, we train a fully connected deep network on a subset of MNIST with 1000 examples across 100 epochs. During training, PGD attacks are applied to the batches, a weight decay of 0.0001 is used, and a Frobenius norm penalty is applied to the loss function with weight γ. In the first panel, we report the model’s accuracy on … view at source ↗
Figure 7
Figure 7. Figure 7: GrokAlign is the most effective regularization strategy for inducing normal alignment in deep networks. We compare the regularization strategies of [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗
read the original abstract

The manifold hypothesis (MH) is often used to explain how machine learning can overcome the curse of dimensionality. However, the MH is only applicable in regimes where the training data provides a sufficiently dense sample of the underlying low-dimensional data manifold, or where such a low-dimensional manifold is conceivably present. We describe the regimes where the MH is not applicable as sparse. In this paper, we demonstrate that models succeed in the sparse regime by exploiting a highly structured local geometry, a property we formalize as normal alignment. We prove that normal-aligned classifiers -- whose input-output Jacobians are rank-one and align perfectly with the training data -- minimize the training objective under norm constraints and achieve maximal local robustness under a non-zero Jacobian constraint. For continuous piecewise-affine deep networks, normal alignment manifests geometrically as centroid alignment within the network's induced power diagram partition and results from the feature-learning regime. Motivated by these theoretical insights, we introduce GrokAlign, a regularization strategy that actively induces normal alignment. We demonstrate that GrokAlign significantly accelerates the training dynamics of deep networks relevant to the grokking phenomenon. Furthermore, we apply the principle of normal alignment to Recursive Feature Machines (RFMs) to introduce Recursive Feature Alignment Machines (RFAMs). We show that RFAMs exhibit greater adversarial robustness compared to RFMs when trained on tabular data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 4 minor

Summary. The paper claims that in sparse regimes where the manifold hypothesis fails (due to insufficient data density on any low-dimensional manifold), models succeed via 'normal alignment': classifiers whose input-output Jacobians are rank-one and perfectly aligned with the training data. It proves these minimize the training objective under norm constraints and maximize local robustness under a non-zero Jacobian constraint. For continuous piecewise-affine networks this manifests as centroid alignment in the induced power diagram during feature learning. The authors introduce GrokAlign regularization to enforce normal alignment (accelerating grokking) and Recursive Feature Alignment Machines (RFAMs) that improve adversarial robustness over RFMs on tabular data.

Significance. If the optimality and robustness proofs hold, the work supplies a geometric account of learning outside the manifold hypothesis, with direct implications for grokking dynamics and adversarial robustness. Credit is due for the explicit derivation of normal alignment from power-diagram geometry in piecewise-affine networks and for the reproducible empirical protocols on grokking acceleration and tabular robustness.

minor comments (4)
  1. [§3.2] §3.2, definition of normal alignment: the rank-one Jacobian condition is stated in terms of the input-output map, but the precise alignment metric (inner product with data vectors) should be written as an equation to avoid ambiguity with the later centroid-alignment claim.
  2. [§4.1] §4.1, GrokAlign objective: the regularization term is introduced without an explicit comparison to the baseline loss; adding a short paragraph contrasting the two would clarify why the added term enforces centroid alignment rather than merely increasing Jacobian norm.
  3. [Table 1] Table 1, grokking experiments: the reported acceleration is given only for a single architecture; a brief note on whether the effect size is stable across widths or depths would strengthen the claim that normal alignment is the operative mechanism.
  4. [§5.3] §5.3, RFAM adversarial evaluation: the attack strength (ε) and number of steps are not stated in the caption or text; this detail is needed to interpret the robustness gain relative to RFM baselines.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, recognition of the paper's significance, and recommendation of minor revision. We appreciate the acknowledgment of the theoretical contributions on normal alignment and the empirical protocols for GrokAlign and RFAMs.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The core claim establishes that normal-aligned classifiers (rank-one Jacobians aligned to data) minimize the training objective under norm constraints and maximize local robustness under non-zero Jacobian constraint. This is first proven for general classifiers via direct optimization arguments. For continuous piecewise-affine networks, normal alignment is shown to arise geometrically as centroid alignment in the induced power diagram under the feature-learning regime. These steps rely on independent definitions of normal alignment and power diagrams, without reducing any prediction to a fitted parameter or depending on self-citations for the load-bearing theorems. Applications like GrokAlign and RFAMs are downstream uses, not part of the optimality derivation. The chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 3 invented entities

Based on abstract only, the claims rest on the distinction between dense and sparse regimes under the manifold hypothesis and on the assumption that continuous piecewise-affine networks capture the relevant geometry.

axioms (2)
  • domain assumption Manifold hypothesis applies only when training data densely samples the underlying low-dimensional manifold
    Stated as background for defining the sparse regime.
  • domain assumption Continuous piecewise-affine deep networks induce a power diagram partition whose centroids align with data under feature learning
    Invoked to translate normal alignment into network geometry.
invented entities (3)
  • normal alignment no independent evidence
    purpose: Formal property of rank-one Jacobians that align with training points
    Newly defined geometric structure claimed to explain sparse-regime success
  • GrokAlign no independent evidence
    purpose: Regularization strategy to induce normal alignment
    New method motivated by the geometric analysis
  • RFAM no independent evidence
    purpose: Recursive Feature Alignment Machine extending RFM with alignment principle
    New variant claimed to improve adversarial robustness

pith-pipeline@v0.9.0 · 5551 in / 1424 out tokens · 41051 ms · 2026-05-12T03:29:37.746846+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 2 internal anchors

  1. [1]

    Tenenbaum, Vin de Silva, and John C

    Joshua B. Tenenbaum, Vin de Silva, and John C. Langford. A Global Geometric Framework For Nonlinear Dimensionality Reduction.Science, 290(5500), 2000

  2. [2]

    Roweis and Lawrence K

    Sam T. Roweis and Lawrence K. Saul. Nonlinear Dimensionality Reduction by Locally Linear Embedding.Science, 290(5500), 2000

  3. [3]

    Dauphin, and David Lopez-Paz

    Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. Mixup: Beyond Empirical Risk Minimization. InInternational Conference on Learning Representations, 2018

  4. [4]

    CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features

    Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Young Joon Yoo. CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features. IEEE/CVF International Conference on Computer Vision, 2019

  5. [5]

    Data Augmentation Using Random Image Cropping and Patching for Deep CNNs.IEEE Trans

    Ryo Takahashi, Takashi Matsubara, and Kuniaki Uehara. Data Augmentation Using Random Image Cropping and Patching for Deep CNNs.IEEE Trans. Cir. and Sys. for Video Technol., 30 (9), 2020

  6. [6]

    Random Erasing Data Augmentation

    Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random Erasing Data Augmentation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, 2020

  7. [7]

    Autoaugment: Learning Augmentation Strategies from Data

    Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment: Learning Augmentation Strategies from Data. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019

  8. [8]

    AugMix: A Simple Method to Improve Robustness and Uncertainty Under Data Shift

    Dan Hendrycks, Norman Mu, Ekin Dogus Cubuk, Barret Zoph, Justin Gilmer, and Balaji Lakshminarayanan. AugMix: A Simple Method to Improve Robustness and Uncertainty Under Data Shift. InInternational Conference on Learning Representations, 2020

  9. [9]

    LeCun, B

    Y . LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation Applied to Handwritten Zip Code Recognition.Neural Computation, 1(4), 1989

  10. [10]

    Attention Is All You Need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You Need. InAdvances in Neural Information Processing Systems, 2017

  11. [11]

    Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

    Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets.arXiv:2201.02177, 2022

  12. [12]

    Mech- anism for Feature Learning in Neural Networks and Backpropagation-free Machine Learning Models.Science, 383, 2024

    Adityanarayanan Radhakrishnan, Daniel Beaglehole, Parthe Pandit, and Mikhail Belkin. Mech- anism for Feature Learning in Neural Networks and Backpropagation-free Machine Learning Models.Science, 383, 2024

  13. [13]

    A Spline Theory of Deep Learning

    Randall Balestriero and Richard Baraniuk. A Spline Theory of Deep Learning. InProceedings of the 35th International Conference on Machine Learning. PMLR, 2018

  14. [14]

    Mad Max: Affine Spline Insights Into Deep Learning.Proceedings of the IEEE, 2020

    Randall Balestriero and Richard G Baraniuk. Mad Max: Affine Spline Insights Into Deep Learning.Proceedings of the IEEE, 2020

  15. [15]

    Bengio, P

    Y . Bengio, P. Simard, and P. Frasconi. Learning Long-Term Dependencies with Gradient Descent Is Difficult.IEEE Transactions on Neural Networks, 5(2), 1994

  16. [16]

    On the Difficulty of Training Recurrent Neural Networks

    Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the Difficulty of Training Recurrent Neural Networks. InProceedings of the 30th International Conference on Machine Learning, 2013

  17. [17]

    LeCun, L

    Y . LeCun, L. Bottou, Y . Bengio, and P. Haffner. Gradient-Based Learning Applied to Document Recognition.Proceedings of the IEEE, 86(11), 1998

  18. [18]

    Learning Multiple Layers of Features from Tiny Images

    Alex Krizhevsky and Geoffrey Hinton. Learning Multiple Layers of Features from Tiny Images. Technical report, University of Toronto, 2009. 11

  19. [19]

    Wide Residual Networks

    Sergey Zagoruyko and Nikos Komodakis. Wide Residual Networks.arXiv:1605.07146, 2017

  20. [20]

    The Geometry of Deep Networks: Power Diagram Subdivision

    Randall Balestriero, Romain Cosentino, Behnaam Aazhang, and Richard Baraniuk. The Geometry of Deep Networks: Power Diagram Subdivision. InNeural Information Processing Systems, 2019

  21. [21]

    SplineCam: Exact Visualization and Characterization of Deep Network Geometry and Decision Boundaries

    Ahmed Imtiaz Humayun, Randall Balestriero, Guha Balakrishnan, and Richard Baraniuk. SplineCam: Exact Visualization and Characterization of Deep Network Geometry and Decision Boundaries. InIEEE Conference on Computer Vision and Pattern Recognition, 2023

  22. [22]

    C. A. Rogers.Packing and Covering. Cambridge University Press, 1964. ISBN 978-0-521- 09034-6

  23. [23]

    V oronoi Diagram in the Laguerre Geometry and Its Applications.SIAM Journal on Computing, 14(1), 1985

    Hiroshi Imai, Masao Iri, and Kazuo Murota. V oronoi Diagram in the Laguerre Geometry and Its Applications.SIAM Journal on Computing, 14(1), 1985

  24. [24]

    Scalable Recognition With a V ocabulary Tree

    David Nister and Henrik Stewenius. Scalable Recognition With a V ocabulary Tree. InIEEE Conference on Computer Vision and Pattern Recognition, volume 2. IEEE, 2006

  25. [25]

    Neural Tangent Kernel: Convergence and Generalization in Neural Networks

    Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural Tangent Kernel: Convergence and Generalization in Neural Networks. InProceedings of the 32nd International Conference on Neural Information Processing Systems, 2018

  26. [26]

    On Lazy Training in Differentiable Programming

    Lénaïc Chizat, Edouard Oyallon, and Francis Bach. On Lazy Training in Differentiable Programming. InAdvances in Neural Information Processing Systems, 2019

  27. [27]

    Lee, Edward Moroshko, Pedro Savarese, Itay Golan, Daniel Soudry, and Nathan Srebro

    Blake Woodworth, Suriya Gunasekar, Jason D. Lee, Edward Moroshko, Pedro Savarese, Itay Golan, Daniel Soudry, and Nathan Srebro. Kernel and Rich Regimes in Overparametrized Models. InProceedings of 33rd on Learning Theory, 2020

  28. [28]

    Lee, Nathan Srebro, and Daniel Soudry

    Edward Moroshko, Blake Woodworth, Suriya Gunasekar, Jason D. Lee, Nathan Srebro, and Daniel Soudry. Implicit Bias in Deep Linear Classification: Initialization Scale vs Training Accuracy.Advances in Neural Information Processing Systems, 2020

  29. [29]

    Lee, and Wei Hu

    Kaifeng Lyu, Jikai Jin, Zhiyuan Li, Simon Shaolei Du, Jason D. Lee, and Wei Hu. Dichotomy of Early and Late Phase Implicit Biases Can Provably Induce Grokking. InThe 12th International Conference on Learning Representations, 2024

  30. [30]

    Grokking as a First Order Phase Transition in Two Layer Networks

    Noa Rubin, Inbar Seroussi, and Zohar Ringel. Grokking as a First Order Phase Transition in Two Layer Networks. InThe 12th International Conference on Learning Representations, 2024

  31. [31]

    Gershman, and Cengiz Pehlevan

    Tanishq Kumar, Blake Bordelon, Samuel J. Gershman, and Cengiz Pehlevan. Grokking as the Transition From Lazy to Rich Training Dynamics. InThe Twelfth International Conference on Learning Representations, 2024

  32. [32]

    Grokfast: Accelerated grokking by amplifying slow gradients, 2024

    Jaerin Lee, Bong Gyun Kang, Kihoon Kim, and Kyoung Mu Lee. Grokfast: Accelerated Grokking by Amplifying Slow Gradients.arXiv:2405.20233, 2024

  33. [33]

    Lucas Prieto, Melih Barsbey, Pedro A. M. Mediano, and Tolga Birdal. Grokking at the Edge of Numerical Stability. InThe 13th International Conference on Learning Representations, 2025

  34. [34]

    Let Me Grok For You: Accelerating Grokking via Embedding Transfer From a Weaker Model

    Zhiwei Xu, Zhiyu Ni, Yixin Wang, and Wei Hu. Let Me Grok For You: Accelerating Grokking via Embedding Transfer From a Weaker Model. InThe 13th International Conference on Learning Representations, 2025

  35. [35]

    Michaud, and Max Tegmark

    Ziming Liu, Eric J. Michaud, and Max Tegmark. Omnigrok: Grokking Beyond Algorithmic Data. InThe 11th International Conference on Learning Representations, 2022

  36. [36]

    Emergence in non-neural models: grokking modular arithmetic via average gradient outer product , shorttitle =

    Neil Mallinar, Daniel Beaglehole, Libin Zhu, Adityanarayanan Radhakrishnan, Parthe Pandit, and Mikhail Belkin. Emergence in non-neural models: Grokking modular arithmetic via average gradient outer product.arXiv:2407.20199, 2025

  37. [37]

    Do We Need Hundreds of Classifiers to Solve Real World Classification Problems?Journal of Machine Learning Research, 15(90), 2014

    Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, and Dinani Amorim. Do We Need Hundreds of Classifiers to Solve Real World Classification Problems?Journal of Machine Learning Research, 15(90), 2014. 12

  38. [38]

    TabArena: A Living Benchmark for Machine Learning on Tabular Data

    Nick Erickson, Lennart Purucker, Andrej Tschalzev, David Holzmüller, Prateek Mutalik Desai, David Salinas, and Frank Hutter. TabArena: A Living Benchmark for Machine Learning on Tabular Data. InProceedings of the 39th Conference on Neural Information Processing Systems, 2025

  39. [39]

    Robustness May Be at Odds with Accuracy

    Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. Robustness May Be at Odds with Accuracy. InInternational Conference on Learning Representations, 2019

  40. [40]

    Towards Deep Learning Models Resistant to Adversarial Attacks

    Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards Deep Learning Models Resistant to Adversarial Attacks. InInternational Conference on Learning Representations, 2018

  41. [41]

    Deep Residual Learning for Image Recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. InIEEE Conference on Computer Vision and Pattern Recognition, 2016

  42. [42]

    Benjamin Recht, Maryam Fazel, and Pablo A. Parrilo. Guaranteed Minimum-rank Solutions of Linear Matrix Equations via Nuclear Norm Minimization.SIAM Review, 52(3), 2010

  43. [43]

    Nuclear Norm Regularization for Deep Learning

    Christopher Scarvelis and Justin Solomon. Nuclear Norm Regularization for Deep Learning. In The 38th Annual Conference on Neural Information Processing Systems, 2024

  44. [44]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization. InInternational Conference on Learning Representations, 2019

  45. [45]

    Decou- pled Kullback-Leibler Divergence Loss

    Jiequan Cui, Zhuotao Tian, Zhisheng Zhong, Xiaojun Qi, Bei Yu, and Hanwang Zhang. Decou- pled Kullback-Leibler Divergence Loss. InThe 38th Annual Conference on Neural Information Processing Systems, 2024

  46. [46]

    Progress Measures for Grokking via Mechanistic Interpretability

    Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress Measures for Grokking via Mechanistic Interpretability. InThe 11th International Conference on Learning Representations, 2022. 13 A Constructing Normal Aligned Deep Networks For simplicity, consider each xi to be unit norm and let f(x) =W 2σ(W 1x+b) where W 1 ∈ Rn×d, b∈R n...