pith. sign in

arxiv: 2510.00468 · v4 · submitted 2025-10-01 · 💻 cs.LG · cs.AI

Feature Identification via the Empirical NTK

Pith reviewed 2026-05-18 10:57 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords empirical neural tangent kernelfeature identificationmechanistic interpretabilitygrokkingmodular arithmeticlanguage modelseigenanalysisFourier features
0
0 comments X

The pith

Eigenanalysis of the empirical neural tangent kernel surfaces feature directions in trained neural networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that the top eigenvectors of the empirical neural tangent kernel computed on a trained network's data point to input directions that match the features the network actually uses. This holds in a one-layer MLP solving modular addition, where the directions recover the known Fourier basis, and in a one-layer transformer that uses seed-dependent frequencies for the same task. In the small pretrained language model Gemma-3-270M, the same eigenspaces align with automatically detected grammatical features such as parts of speech more closely than a same-cost PCA baseline on activations. Alignment strength changes across training and reaches its steepest rise near the grokking transition. These observations indicate that eNTK eigenanalysis offers a practical way to extract the input features driving model behavior.

Core claim

Eigenanalysis of the empirical neural tangent kernel can surface feature directions in trained neural networks. Across a one-layer MLP and a one-layer transformer trained on modular addition, the leading eNTK eigenspaces align with the Fourier features that implement the ground-truth algorithms. The same alignment evolves during training and its rate of change peaks near the onset of grokking. When applied to context windows from TinyStories in Gemma-3-270M, the top eNTK eigendirections match an automatically generated set of grammatical features more closely than a matched-budget PCA decomposition of model activations.

What carries the argument

The top eigenspaces of the empirical neural tangent kernel (eNTK), which identify directions in input space along which the trained network's output varies most strongly.

If this is right

  • In modular-arithmetic networks the top eNTK subspaces recover the exact Fourier components that implement the algorithm.
  • Alignment strength increases during training and its derivative peaks at the grokking transition.
  • In a pretrained language model the eNTK directions match grammatical features more accurately than an equal-cost PCA baseline.
  • The method therefore supplies an input-space probe for mechanistic interpretability that does not require inspecting hidden activations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same procedure could be run on larger models to discover previously unknown feature directions without hand-crafted probes.
  • One could test whether editing model weights to reduce sensitivity along eNTK directions alters behavior more than editing along random directions.
  • If the alignment persists across architectures, it would link kernel-based sensitivity analysis to the emergence of structured computation in overparameterized networks.

Load-bearing premise

The alignment between top eNTK eigenspaces and known or interpretable features shows that the kernel is highlighting the directions the model actually uses rather than merely reflecting dataset statistics or kernel construction.

What would settle it

Measure whether ablating or perturbing inputs along the leading eNTK eigendirections produces larger output changes than equivalent perturbations along PCA directions, in a model whose ground-truth features are known.

Figures

Figures reproduced from arXiv: 2510.00468 by Jennifer Lin.

Figure 1
Figure 1. Figure 1: Eigenvalue spectrum of the flattened eNTK (top row) and column-normalized, importance [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Full and layerwise eNTK spectrum of the modular arithmetic model after training to [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Spectral change at the grokking phase transition. Left: Train and test accuracy shows [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Results from applying the two-stage per-axis graph smoothness algorithm described in [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Results from applying the two-stage graph smoothness algorithm described in Appendix [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
read the original abstract

We provide evidence that eigenanalysis of the empirical neural tangent kernel (eNTK) can surface feature directions in trained neural networks. Across three increasingly realistic settings -- a 1-layer MLP trained on modular addition, a 1-layer Transformer trained on modular addition and the pretrained language model Gemma-3-270M -- we show that top eigenspaces of the eNTK align with ground-truth or interpretable features. In the modular arithmetic examples, top eNTK eigenspaces align with the Fourier features used by the MLP and the Fourier features at seed-dependent frequencies used by the Transformer to implement known ground-truth algorithms. Moreover, the alignment of the relevant subspaces evolves over training, with its first derivative peaking near the onset of grokking. For Gemma-3-270M, we compute top eNTK eigendirections on a dataset of TinyStories context windows and check their alignment with an automatically-generated set of parts-of-speech and other grammatical feature directions. We find that the alignment of eNTK eigendirections with grammar features outperforms a same-budget baseline of PCA on model activations. These results suggest that eNTK eigenanalysis may provide a new handle towards identifying features in trained models for mechanistic interpretability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper claims that eigenanalysis of the empirical neural tangent kernel (eNTK) can surface feature directions in trained neural networks. Evidence is provided via subspace alignments between top eNTK eigenspaces and ground-truth Fourier modes in a 1-layer MLP and 1-layer Transformer on modular addition (with alignment dynamics peaking near grokking), plus superior alignment with automatically generated grammatical features versus a PCA baseline on activations in Gemma-3-270M.

Significance. If the alignments reflect mechanistically relevant directions rather than data-distribution artifacts, this would supply a gradient-based, largely unsupervised handle for feature identification that complements activation-space methods such as PCA. The cross-model empirical scope (synthetic arithmetic to pretrained LLM) and the temporal link to grokking are concrete strengths that, if statistically grounded, could influence mechanistic interpretability practice.

major comments (2)
  1. [§3.3] §3.3 (Gemma-3-270M): the reported outperformance of eNTK eigendirections over PCA on grammatical features is stated without the precise alignment metric (e.g., average cosine similarity, principal-angle overlap, or projection norm), the number of top-k eigenspaces or features compared, or any statistical significance tests or multiple-comparison corrections. These omissions make the quantitative superiority claim difficult to assess and are load-bearing for the central empirical argument.
  2. [§3.1–§3.2] §3.1–§3.2 (modular-addition experiments): the manuscript interprets alignment between top eNTK eigenspaces and Fourier features as evidence that eNTK surfaces directions the network actually employs for computation. No causal test (e.g., ablating the identified directions and measuring task-performance degradation, or comparing against random subspaces of equal dimension) is reported; the observed overlaps could arise from the eNTK definition (inner product of gradients) preferentially weighting high-sensitivity directions without those directions being mechanistically active in the forward pass.
minor comments (3)
  1. [Abstract] Abstract: quantitative alignment metrics and the exact definition of “alignment” should be stated explicitly rather than left as “align with.”
  2. [Figures] Figure captions (temporal-alignment plots): include error bars across random seeds and state the number of independent training runs used to generate the derivative peak near grokking.
  3. [Methods] Methods: specify the exact procedure for computing the empirical NTK (full-batch vs. mini-batch, any low-rank approximations, and the precise dataset subset used for Gemma-3-270M).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments highlight important areas for clarifying our quantitative claims and strengthening the interpretation of the reported alignments. We address each major comment below and indicate the corresponding revisions.

read point-by-point responses
  1. Referee: [§3.3] §3.3 (Gemma-3-270M): the reported outperformance of eNTK eigendirections over PCA on grammatical features is stated without the precise alignment metric (e.g., average cosine similarity, principal-angle overlap, or projection norm), the number of top-k eigenspaces or features compared, or any statistical significance tests or multiple-comparison corrections. These omissions make the quantitative superiority claim difficult to assess and are load-bearing for the central empirical argument.

    Authors: We agree that the original manuscript omitted key details needed to evaluate the quantitative claim. In the revised version we will specify that alignment is measured by average cosine similarity, that we compare the top-10 eNTK eigendirections against the automatically generated grammatical feature directions, and that statistical significance is assessed via a permutation test (1,000 shuffles) with Bonferroni correction for multiple comparisons. These additions will make the reported outperformance over the PCA baseline directly verifiable. revision: yes

  2. Referee: [§3.1–§3.2] §3.1–§3.2 (modular-addition experiments): the manuscript interprets alignment between top eNTK eigenspaces and Fourier features as evidence that eNTK surfaces directions the network actually employs for computation. No causal test (e.g., ablating the identified directions and measuring task-performance degradation, or comparing against random subspaces of equal dimension) is reported; the observed overlaps could arise from the eNTK definition (inner product of gradients) preferentially weighting high-sensitivity directions without those directions being mechanistically active in the forward pass.

    Authors: The referee correctly identifies that our current evidence is correlational rather than causal. In the modular-addition settings the Fourier features are independently known to implement the task (as established by prior grokking literature), and the temporal alignment with the grokking transition supplies additional support. To address the possibility of an eNTK-specific artifact, the revision will include a direct comparison of observed alignments against alignments obtained with random subspaces of identical dimension. Explicit ablation of the identified directions lies outside the scope of the present study but remains a natural direction for future work. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical alignments measured against external labels

full rationale

The paper reports direct empirical measurements of subspace overlap between top eNTK eigenvectors and independently supplied ground-truth Fourier modes (modular addition) or auto-generated grammatical directions (Gemma-3-270M). These alignments are computed from the trained model and compared to external references; they are not obtained by fitting parameters inside the paper and then relabeling the fit as a prediction. No derivation chain, self-citation, or ansatz reduces the reported overlap scores to quantities defined by the paper's own inputs. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the standard assumption that the empirical NTK is a useful local linearization of the network and that top eigenvectors of this matrix can be meaningfully compared to feature directions. No new free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption The empirical NTK approximates the tangent kernel of the trained network at the current parameters
    Standard background assumption in the NTK literature invoked to justify eigenanalysis

pith-pipeline@v0.9.0 · 5734 in / 1312 out tokens · 37158 ms · 2026-05-18T10:57:59.321020+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 4 internal anchors

  1. [1]

    The loss kernel: A geometric probe for deep learning interpretability

    Maxwell Adam, Zach Furman, and Jesse Hoogland. The loss kernel: A geometric probe for deep learning interpretability. arXiv preprint arXiv:2509.26537,

  2. [2]

    Understanding intermediate layers using linear classifier probes

    Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644,

  3. [3]

    Neural networks as kernel learners: The silent alignment effect

    Alexander Atanasov, Blake Bordelon, and Cengiz Pehlevan. Neural networks as kernel learners: The silent alignment effect. ICLR 2022, 11

  4. [4]

    URLhttps://arxiv.org/pdf/2111.00034. pdf. Aristide Baratin, Thomas George, C ´esar Laurent, R Devon Hjelm, Guillaume Lajoie, Pascal Vincent, and Simon Lacoste-Julien. Implicit regularization via neural feature alignment. In International Conference on Artificial Intelligence and Statistics, pages 2269–2277. PMLR,

  5. [5]

    The local in- teraction basis: Identifying computationally-relevant and sparsely interacting features in neural networks

    Lucius Bushnaq, Stefan Heimersheim, Nicholas Goldowsky-Dill, Dan Braun, Jake Mendel, Kaarel H¨anni, Avery Griffin, J ¨orn St ¨ohler, Magdalena Wache, and Marius Hobbhahn. The local in- teraction basis: Identifying computationally-relevant and sparsely interacting features in neural networks. arXiv preprint arXiv:2405.10928, 2024a. Lucius Bushnaq, Jake Men...

  6. [6]

    Grokking modular arithmetic.arXiv preprint arXiv:2301.02679, 2023

    Andrey Gromov. Grokking modular arithmetic. arXiv preprint arXiv:2301.02679,

  7. [7]

    Studying large language model generalization with influence functions.arXiv preprint arXiv:2308.03296,

    Roger Grosse, Juhan Bae, Cem Anil, Nelson Elhage, Alex Tamkin, Amirhossein Tajdini, Benoit Steiner, Dustin Li, Esin Durmus, Ethan Perez, et al. Studying large language model generalization with influence functions. arXiv preprint arXiv:2308.03296,

  8. [8]

    Progress measures for grokking via mechanistic interpretability

    Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability. arXiv preprint arXiv:2301.05217,

  9. [9]

    Mechanistic interpretability, variables, and the importance of interpretable bases

    Chris Olah. Mechanistic interpretability, variables, and the importance of interpretable bases. URL:https://www.transformer-circuits.pub/2022/mech-interp-essay,

  10. [10]

    Distill , year =

    doi: 10.23915/distill.00024.001. https://distill.pub/2020/circuits/zoom-in. Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal N...

  11. [11]

    Guillermo Ortiz-Jim ´enez, Seyed-Mohsen Moosavi-Dezfooli, and Pascal Frossard

    https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html. Guillermo Ortiz-Jim ´enez, Seyed-Mohsen Moosavi-Dezfooli, and Pascal Frossard. What can lin- earized neural networks actually say about generalization? Advances in Neural Information Processing Systems, 34:8998–9010,

  12. [12]

    Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

    Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Gen- eralization beyond overfitting on small algorithmic datasets. arXiv preprint arXiv:2201.02177,

  13. [13]

    Open Problems in Mechanistic Interpretability

    Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, et al. Open problems in mechanistic interpretability. arXiv preprint arXiv:2501.16496,

  14. [14]

    Nikolaos Tsilivis and Julia Kempe

    URLhttps://transformer-circuits.pub/2024/scaling-monosemanticity/ index.html. Nikolaos Tsilivis and Julia Kempe. What can the neural tangent kernel tell us about adversarial robustness? Advances in Neural Information Processing Systems, 35:18116–18130,