Feature Identification via the Empirical NTK
Pith reviewed 2026-05-18 10:57 UTC · model grok-4.3
The pith
Eigenanalysis of the empirical neural tangent kernel surfaces feature directions in trained neural networks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Eigenanalysis of the empirical neural tangent kernel can surface feature directions in trained neural networks. Across a one-layer MLP and a one-layer transformer trained on modular addition, the leading eNTK eigenspaces align with the Fourier features that implement the ground-truth algorithms. The same alignment evolves during training and its rate of change peaks near the onset of grokking. When applied to context windows from TinyStories in Gemma-3-270M, the top eNTK eigendirections match an automatically generated set of grammatical features more closely than a matched-budget PCA decomposition of model activations.
What carries the argument
The top eigenspaces of the empirical neural tangent kernel (eNTK), which identify directions in input space along which the trained network's output varies most strongly.
If this is right
- In modular-arithmetic networks the top eNTK subspaces recover the exact Fourier components that implement the algorithm.
- Alignment strength increases during training and its derivative peaks at the grokking transition.
- In a pretrained language model the eNTK directions match grammatical features more accurately than an equal-cost PCA baseline.
- The method therefore supplies an input-space probe for mechanistic interpretability that does not require inspecting hidden activations.
Where Pith is reading between the lines
- The same procedure could be run on larger models to discover previously unknown feature directions without hand-crafted probes.
- One could test whether editing model weights to reduce sensitivity along eNTK directions alters behavior more than editing along random directions.
- If the alignment persists across architectures, it would link kernel-based sensitivity analysis to the emergence of structured computation in overparameterized networks.
Load-bearing premise
The alignment between top eNTK eigenspaces and known or interpretable features shows that the kernel is highlighting the directions the model actually uses rather than merely reflecting dataset statistics or kernel construction.
What would settle it
Measure whether ablating or perturbing inputs along the leading eNTK eigendirections produces larger output changes than equivalent perturbations along PCA directions, in a model whose ground-truth features are known.
Figures
read the original abstract
We provide evidence that eigenanalysis of the empirical neural tangent kernel (eNTK) can surface feature directions in trained neural networks. Across three increasingly realistic settings -- a 1-layer MLP trained on modular addition, a 1-layer Transformer trained on modular addition and the pretrained language model Gemma-3-270M -- we show that top eigenspaces of the eNTK align with ground-truth or interpretable features. In the modular arithmetic examples, top eNTK eigenspaces align with the Fourier features used by the MLP and the Fourier features at seed-dependent frequencies used by the Transformer to implement known ground-truth algorithms. Moreover, the alignment of the relevant subspaces evolves over training, with its first derivative peaking near the onset of grokking. For Gemma-3-270M, we compute top eNTK eigendirections on a dataset of TinyStories context windows and check their alignment with an automatically-generated set of parts-of-speech and other grammatical feature directions. We find that the alignment of eNTK eigendirections with grammar features outperforms a same-budget baseline of PCA on model activations. These results suggest that eNTK eigenanalysis may provide a new handle towards identifying features in trained models for mechanistic interpretability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that eigenanalysis of the empirical neural tangent kernel (eNTK) can surface feature directions in trained neural networks. Evidence is provided via subspace alignments between top eNTK eigenspaces and ground-truth Fourier modes in a 1-layer MLP and 1-layer Transformer on modular addition (with alignment dynamics peaking near grokking), plus superior alignment with automatically generated grammatical features versus a PCA baseline on activations in Gemma-3-270M.
Significance. If the alignments reflect mechanistically relevant directions rather than data-distribution artifacts, this would supply a gradient-based, largely unsupervised handle for feature identification that complements activation-space methods such as PCA. The cross-model empirical scope (synthetic arithmetic to pretrained LLM) and the temporal link to grokking are concrete strengths that, if statistically grounded, could influence mechanistic interpretability practice.
major comments (2)
- [§3.3] §3.3 (Gemma-3-270M): the reported outperformance of eNTK eigendirections over PCA on grammatical features is stated without the precise alignment metric (e.g., average cosine similarity, principal-angle overlap, or projection norm), the number of top-k eigenspaces or features compared, or any statistical significance tests or multiple-comparison corrections. These omissions make the quantitative superiority claim difficult to assess and are load-bearing for the central empirical argument.
- [§3.1–§3.2] §3.1–§3.2 (modular-addition experiments): the manuscript interprets alignment between top eNTK eigenspaces and Fourier features as evidence that eNTK surfaces directions the network actually employs for computation. No causal test (e.g., ablating the identified directions and measuring task-performance degradation, or comparing against random subspaces of equal dimension) is reported; the observed overlaps could arise from the eNTK definition (inner product of gradients) preferentially weighting high-sensitivity directions without those directions being mechanistically active in the forward pass.
minor comments (3)
- [Abstract] Abstract: quantitative alignment metrics and the exact definition of “alignment” should be stated explicitly rather than left as “align with.”
- [Figures] Figure captions (temporal-alignment plots): include error bars across random seeds and state the number of independent training runs used to generate the derivative peak near grokking.
- [Methods] Methods: specify the exact procedure for computing the empirical NTK (full-batch vs. mini-batch, any low-rank approximations, and the precise dataset subset used for Gemma-3-270M).
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. The comments highlight important areas for clarifying our quantitative claims and strengthening the interpretation of the reported alignments. We address each major comment below and indicate the corresponding revisions.
read point-by-point responses
-
Referee: [§3.3] §3.3 (Gemma-3-270M): the reported outperformance of eNTK eigendirections over PCA on grammatical features is stated without the precise alignment metric (e.g., average cosine similarity, principal-angle overlap, or projection norm), the number of top-k eigenspaces or features compared, or any statistical significance tests or multiple-comparison corrections. These omissions make the quantitative superiority claim difficult to assess and are load-bearing for the central empirical argument.
Authors: We agree that the original manuscript omitted key details needed to evaluate the quantitative claim. In the revised version we will specify that alignment is measured by average cosine similarity, that we compare the top-10 eNTK eigendirections against the automatically generated grammatical feature directions, and that statistical significance is assessed via a permutation test (1,000 shuffles) with Bonferroni correction for multiple comparisons. These additions will make the reported outperformance over the PCA baseline directly verifiable. revision: yes
-
Referee: [§3.1–§3.2] §3.1–§3.2 (modular-addition experiments): the manuscript interprets alignment between top eNTK eigenspaces and Fourier features as evidence that eNTK surfaces directions the network actually employs for computation. No causal test (e.g., ablating the identified directions and measuring task-performance degradation, or comparing against random subspaces of equal dimension) is reported; the observed overlaps could arise from the eNTK definition (inner product of gradients) preferentially weighting high-sensitivity directions without those directions being mechanistically active in the forward pass.
Authors: The referee correctly identifies that our current evidence is correlational rather than causal. In the modular-addition settings the Fourier features are independently known to implement the task (as established by prior grokking literature), and the temporal alignment with the grokking transition supplies additional support. To address the possibility of an eNTK-specific artifact, the revision will include a direct comparison of observed alignments against alignments obtained with random subspaces of identical dimension. Explicit ablation of the identified directions lies outside the scope of the present study but remains a natural direction for future work. revision: partial
Circularity Check
No circularity: empirical alignments measured against external labels
full rationale
The paper reports direct empirical measurements of subspace overlap between top eNTK eigenvectors and independently supplied ground-truth Fourier modes (modular addition) or auto-generated grammatical directions (Gemma-3-270M). These alignments are computed from the trained model and compared to external references; they are not obtained by fitting parameters inside the paper and then relabeling the fit as a prediction. No derivation chain, self-citation, or ansatz reduces the reported overlap scores to quantities defined by the paper's own inputs. The work is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The empirical NTK approximates the tangent kernel of the trained network at the current parameters
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
eigenanalysis of the empirical neural tangent kernel (eNTK) can surface feature directions in trained neural networks
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
top eNTK eigenspaces align with ground-truth Fourier features
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
The loss kernel: A geometric probe for deep learning interpretability
Maxwell Adam, Zach Furman, and Jesse Hoogland. The loss kernel: A geometric probe for deep learning interpretability. arXiv preprint arXiv:2509.26537,
-
[2]
Understanding intermediate layers using linear classifier probes
Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Neural networks as kernel learners: The silent alignment effect
Alexander Atanasov, Blake Bordelon, and Cengiz Pehlevan. Neural networks as kernel learners: The silent alignment effect. ICLR 2022, 11
work page 2022
-
[4]
URLhttps://arxiv.org/pdf/2111.00034. pdf. Aristide Baratin, Thomas George, C ´esar Laurent, R Devon Hjelm, Guillaume Lajoie, Pascal Vincent, and Simon Lacoste-Julien. Implicit regularization via neural feature alignment. In International Conference on Artificial Intelligence and Statistics, pages 2269–2277. PMLR,
-
[5]
Lucius Bushnaq, Stefan Heimersheim, Nicholas Goldowsky-Dill, Dan Braun, Jake Mendel, Kaarel H¨anni, Avery Griffin, J ¨orn St ¨ohler, Magdalena Wache, and Marius Hobbhahn. The local in- teraction basis: Identifying computationally-relevant and sparsely interacting features in neural networks. arXiv preprint arXiv:2405.10928, 2024a. Lucius Bushnaq, Jake Men...
-
[6]
Grokking modular arithmetic.arXiv preprint arXiv:2301.02679, 2023
Andrey Gromov. Grokking modular arithmetic. arXiv preprint arXiv:2301.02679,
-
[7]
Roger Grosse, Juhan Bae, Cem Anil, Nelson Elhage, Alex Tamkin, Amirhossein Tajdini, Benoit Steiner, Dustin Li, Esin Durmus, Ethan Perez, et al. Studying large language model generalization with influence functions. arXiv preprint arXiv:2308.03296,
-
[8]
Progress measures for grokking via mechanistic interpretability
Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability. arXiv preprint arXiv:2301.05217,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Mechanistic interpretability, variables, and the importance of interpretable bases
Chris Olah. Mechanistic interpretability, variables, and the importance of interpretable bases. URL:https://www.transformer-circuits.pub/2022/mech-interp-essay,
work page 2022
-
[10]
doi: 10.23915/distill.00024.001. https://distill.pub/2020/circuits/zoom-in. Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal N...
-
[11]
Guillermo Ortiz-Jim ´enez, Seyed-Mohsen Moosavi-Dezfooli, and Pascal Frossard
https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html. Guillermo Ortiz-Jim ´enez, Seyed-Mohsen Moosavi-Dezfooli, and Pascal Frossard. What can lin- earized neural networks actually say about generalization? Advances in Neural Information Processing Systems, 34:8998–9010,
work page 2022
-
[12]
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Gen- eralization beyond overfitting on small algorithmic datasets. arXiv preprint arXiv:2201.02177,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Open Problems in Mechanistic Interpretability
Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, et al. Open problems in mechanistic interpretability. arXiv preprint arXiv:2501.16496,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Nikolaos Tsilivis and Julia Kempe
URLhttps://transformer-circuits.pub/2024/scaling-monosemanticity/ index.html. Nikolaos Tsilivis and Julia Kempe. What can the neural tangent kernel tell us about adversarial robustness? Advances in Neural Information Processing Systems, 35:18116–18130,
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.