pith. sign in

arxiv: 2605.17606 · v2 · pith:KJ54URAYnew · submitted 2026-05-17 · 💻 cs.LG

The Neural Tangent Kernel for Classification

Pith reviewed 2026-05-20 14:21 UTC · model grok-4.3

classification 💻 cs.LG
keywords neural tangent kernellazy trainingclassificationcross-entropy losswide neural networksregularizationlinearized modelmodel uncertainty
0
0 comments X

The pith

Wide neural networks with cross-entropy loss remain in the lazy training regime when parameter regularization is used or when all class probabilities are strictly positive.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the Neural Tangent Kernel stays approximately constant throughout training for wide networks on classification tasks under two main conditions. Parameter-space regularization maintains the constant NTK for cross-entropy loss. Without regularization the same regime holds when targets are non-degenerate so every class has positive probability. A sympathetic reader would care because these conditions let the full nonlinear training be replaced by an explicit linear model whose solution is written directly in terms of the NTK, extending kernel-style analysis and uncertainty estimates to practical classification settings.

Core claim

We show that parameter-space regularization ensures a constant NTK during training for cross-entropy loss, while in the absence of regularization the regime is recovered when targets are non-degenerate, i.e. when all classes have strictly positive probability. Under these conditions, training is well-approximated by the linearized model, yielding an explicit characterization of the solution in terms of the NTK. We further analyze the distribution of trained predictors induced by random initialization and relate this notion of model uncertainty to Bayesian methods.

What carries the argument

The infinite-width Neural Tangent Kernel, which remains constant under the stated conditions and thereby lets the entire training trajectory be replaced by a linear model whose solution is expressed directly through the kernel.

If this is right

  • Training dynamics on classification are well approximated by a linear model in parameter space.
  • The final predictor admits an explicit closed-form characterization written in terms of the NTK.
  • The distribution of predictors arising from different random initializations connects directly to Bayesian posterior uncertainty.
  • The same constant-NTK regime that was previously known only for regression losses now applies to cross-entropy classification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The explicit NTK solution could be used to derive generalization bounds for classification without running full gradient descent.
  • Adding modest parameter regularization may be a practical way to keep wide networks inside the theoretically tractable lazy regime.
  • The same non-degeneracy condition on targets might allow NTK analysis for other nonlinear output transformations beyond cross-entropy.

Load-bearing premise

The networks are wide enough for the infinite-width NTK limit to hold and for random initialization to place the network in the linearization regime that persists for the whole training run.

What would settle it

Train a sufficiently wide network on a multi-class dataset with cross-entropy loss, measure the NTK at initialization and after many steps, and check whether it stays approximately constant when weight decay is present or when every class has positive probability in the targets.

Figures

Figures reproduced from arXiv: 2605.17606 by Alvaro Cartea, Jonathan Plenk, Kamil Ciosek, Mark van der Wilk, Sergio Calvo-Ordonez, Yarin Gal.

Figure 1
Figure 1. Figure 1: 1d-classification with 3 classes. Left: An ensemble over wide networks, starting from different parameter initializations. Right: The infinite-width limit of the ensemble, using the function￾space ODE. 4 Connection of the infinite-width ensemble to Bayesian methods The previous section characterized the trained linearized predictor through the inverse map Φ −1 . We now use this characterization to study th… view at source ↗
Figure 2
Figure 2. Figure 2: Blue: Pre-softmax NTK (constant). Red: Post-softmax NTK (not constant). 5.2 MNIST Classification Following Yu et al. [2025] we train a four-layer fully connected neural network on MNIST [LeCun et al., 2002] using 2 classes (odd or even). By using softmax with a reference class, the logit dimension is 1 and thus the kernel is scalar-valued. We plot the evolution of the empirical NTK t 7→ Θˆ θt (x, x) during… view at source ↗
Figure 3
Figure 3. Figure 3: The NTK for MNIST classification does not diverge when using label smoothing or [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The function-space Brier score with a regularizer can have multiple stationary points. [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗
read the original abstract

In wide neural networks, the Neural Tangent Kernel (NTK) remains approximately constant during training, providing a powerful theoretical tool for studying training dynamics, generalization, and connections to kernel methods. However, this theory is largely restricted to regression losses. It was previously thought that training on a classification loss, or more generally losses involving nonlinear output transformations, breaks this property, leading to divergent logits and a breakdown of the linearization. In this paper, we extend NTK theory to classification by identifying conditions under which wide neural networks remain in the lazy training regime. We show that parameter-space regularization ensures a constant NTK during training for cross-entropy loss, while in the absence of regularization the regime is recovered when targets are non-degenerate, i.e. when all classes have strictly positive probability. Under these conditions, training is well-approximated by the linearized model, yielding an explicit characterization of the solution in terms of the NTK. We further analyze the distribution of trained predictors induced by random initialization and relate this notion of model uncertainty to Bayesian methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript extends Neural Tangent Kernel (NTK) theory from regression to classification with cross-entropy loss. It shows that parameter-space regularization keeps the NTK constant during training of wide networks. Without regularization, the lazy regime holds when targets are non-degenerate (all classes have strictly positive probability). Under these conditions, dynamics are approximated by the linearized model, yielding an explicit NTK-based solution characterization. The work also analyzes the distribution of predictors induced by random initialization and relates it to Bayesian methods.

Significance. If the derivations and conditions are rigorously established, the result would be a meaningful advance by bringing NTK analysis to classification, a dominant practical setting. The explicit solution form and initialization-induced uncertainty analysis supply concrete, testable predictions that connect to kernel methods and Bayesian views. The non-degenerate-target restriction is a genuine limitation on scope but does not invalidate the conditional claims.

major comments (2)
  1. [§4] §4, Definition 4.1 and surrounding text: the non-degenerate targets condition (strictly positive probability for every class) excludes standard one-hot labels. The abstract and title frame the contribution as addressing classification, yet the unregularized case therefore applies only to softened targets; the manuscript should either prove an approximate constancy result in the limit of vanishing probabilities or supply empirical evidence that the NTK remains nearly constant for near-degenerate targets.
  2. [§3.2] §3.2, Eq. (12): the constancy proof for the regularized cross-entropy case is derived under continuous-time gradient flow. It is unclear whether the same invariance holds for the discrete SGD steps used in all reported experiments; a short perturbation argument or numerical check would strengthen the claim that the NTK remains constant throughout practical training.
minor comments (2)
  1. [Introduction] Introduction, paragraph 3: the phrase 'non-degenerate, i.e., when all classes have strictly positive probability' could be accompanied by a one-sentence remark on how this relates to label smoothing or temperature scaling commonly used in practice.
  2. [Figure 2] Figure 2 caption: the experimental details (network width, depth, dataset, and regularization strength) are missing; adding them would make the NTK-evolution plots easier to interpret.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and describe the revisions we will incorporate.

read point-by-point responses
  1. Referee: [§4] §4, Definition 4.1 and surrounding text: the non-degenerate targets condition (strictly positive probability for every class) excludes standard one-hot labels. The abstract and title frame the contribution as addressing classification, yet the unregularized case therefore applies only to softened targets; the manuscript should either prove an approximate constancy result in the limit of vanishing probabilities or supply empirical evidence that the NTK remains nearly constant for near-degenerate targets.

    Authors: The non-degenerate target condition is required for the exact invariance result in the unregularized case, as the proof shows that degenerate targets (such as one-hot labels) can cause logits to diverge and break constancy. This is a genuine scope limitation for the unregularized setting, which we already state explicitly. The regularized case holds without this restriction. To strengthen the manuscript, we will add a new subsection with numerical experiments demonstrating that the NTK remains nearly constant (within 1-2% relative change) for targets with small positive probabilities down to 10^{-6}, approximating one-hot labels. We will also clarify in the introduction that practical classification often employs label smoothing, which satisfies the non-degenerate condition. revision: partial

  2. Referee: [§3.2] §3.2, Eq. (12): the constancy proof for the regularized cross-entropy case is derived under continuous-time gradient flow. It is unclear whether the same invariance holds for the discrete SGD steps used in all reported experiments; a short perturbation argument or numerical check would strengthen the claim that the NTK remains constant throughout practical training.

    Authors: We agree that the invariance is proven under continuous-time gradient flow. For discrete SGD, we will add a brief perturbation analysis in §3.2 showing that the per-step change in the NTK is O(η) where η is the learning rate; thus for the small step sizes used in practice the NTK remains approximately constant, with the error accumulating linearly in the number of steps but remaining small over typical training horizons. We will also include a numerical check in the experiments confirming that the NTK variation under SGD matches the continuous prediction within measurement noise. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained under stated assumptions

full rationale

The paper identifies explicit conditions (parameter-space regularization or non-degenerate targets with strictly positive class probabilities) under which the NTK remains constant for cross-entropy loss, then uses the linearized model to characterize the solution. These conditions are derived from the gradient-flow dynamics and Jacobian evolution equations rather than being fitted to data or defined in terms of the target result. No load-bearing self-citations, self-definitional steps, or renamings of known results are present that would make the central claim equivalent to its inputs by construction. The work extends prior NTK regression results with domain-specific analysis that remains independently falsifiable via the infinite-width limit assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on the standard infinite-width NTK limit and random initialization assumptions from prior NTK theory; no new free parameters, ad-hoc axioms, or invented entities are indicated in the abstract.

axioms (1)
  • domain assumption Infinite-width limit in which the NTK remains constant during training
    Core background assumption of NTK theory invoked to justify the lazy regime under the stated conditions.

pith-pipeline@v0.9.0 · 5727 in / 1327 out tokens · 50933 ms · 2026-05-20T14:21:24.704220+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.