pith. machine review for the scientific record. sign in

arxiv: 2603.13085 · v2 · submitted 2026-03-13 · 💻 cs.LG · cs.CV· cs.NA· math.NA· stat.ML

Recognition: 3 theorem links

· Lean Theorem

Linearized Attention Cannot Enter the Kernel Regime at Any Practical Width

Authors on Pith no claims yet

Pith reviewed 2026-05-15 11:29 UTC · model grok-4.3

classification 💻 cs.LG cs.CVcs.NAmath.NAstat.ML
keywords linearized attentionneural tangent kernelkernel regimetransformersinfluence functionscondition numberGram matrixadversarial robustness
0
0 comments X

The pith

Linearized attention does not converge to its neural tangent kernel limit at any practical network width.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that even the linearized version of attention, which serves as a tractable proxy for the nonlinear softmax attention in transformers, fails to enter the kernel regime where it would behave like a fixed kernel method. This non-convergence stems from the attention mechanism amplifying the condition number of the data's Gram matrix by a cubic factor, leading to a required width that scales with the sixth power of this condition number multiplied by the number of samples and log terms. For real datasets like MNIST, this threshold surpasses 10^24 parameters, far beyond current models. Consequently, assumptions underlying influence functions for explaining transformer decisions may not hold. The analysis also reveals that this data-dependent kernel can improve approximation when targets match data structure but increases susceptibility to adversarial perturbations.

Core claim

This paper establishes an exact correspondence between parameter-free linearized attention and a data-dependent Gram-induced kernel. Through spectral amplification analysis, it shows that the attention transformation cubes the Gram matrix's condition number, requiring a width m of order kappa_d(G) to the sixth times n log n for convergence to the NTK limit. For natural image datasets, this renders the kernel regime unattainable, with m exceeding 10^24 for MNIST and 10^29 for CIFAR-10. It further introduces influence malleability to quantify the resulting sensitivity, finding linearized attention 2 to 9 times more malleable than ReLU networks, and notes that the structural argument applies to

What carries the argument

The exact Gram-kernel correspondence combined with spectral amplification analysis, which cubes the effective condition number kappa_d(G) of the rank-min(n,d) input Gram matrix and sets the width threshold for NTK convergence.

Load-bearing premise

The spectral amplification analysis and exact Gram-kernel correspondence hold exactly for parameter-free linearized attention and extend directly to trainable QKV attention under standard initialization, with the effective condition number fixed by the input data.

What would settle it

Training linearized attention models at increasing widths up to 10^6 or 10^9 on MNIST and checking whether the empirical kernel approximation error to the NTK prediction falls below a small threshold would settle the non-convergence claim.

Figures

Figures reproduced from arXiv: 2603.13085 by Jose Marie Antonio Mi\~noza, Paulo Mario P. Medina, Sebastian C. Iba\~nez.

Figure 1
Figure 1. Figure 1: NTK distance ∥fm − fNTK∥ across network widths, where fm is the finite-width trained model and fNTK is the infinite-width NTK predictor. 2L-ReLU (blue) shows expected convergence: distance decreases as m → ∞. MLP-Attn (orange) shows fundamentally different behavior: distance fails to decrease monotonically on either dataset (non-monotonic on MNIST, increasing on CIFAR-10), demonstrating that linearized att… view at source ↗
Figure 2
Figure 2. Figure 2: Analysis of influence dynamics. (a) MLP-Attn exhibits consistently higher influence malleability (flip rate) across all perturbation types, reflecting its operation in the feature learning regime. (b) Rank correlation analysis reveals that 2L-ReLU maintains rigid influence rankings while MLP-Attn shows lower correlation, indicating continuous re-evaluation of data dependencies. The “Transformed” interventi… view at source ↗
Figure 3
Figure 3. Figure 3: Top influential training examples for a representative test digit. Positive influencers (top rows) are examples whose removal increases test loss; negative influencers (bottom rows) decrease test loss when removed. While visual differences between architectures may be subtle, the key distinction is quantitative: MLP-Attn’s influence scores are more sensitive to perturbations (28.9% flip rate vs. 3.3% for 2… view at source ↗
Figure 4
Figure 4. Figure 4: Contribution to model complexity (blue) and average influence score (red) for MLP-Attn across three datasets. The U-shaped complexity curve indicates that the most influential points (both harmful and helpful) contribute most to model complexity, consistent with findings of Zhang et al. (2022). Each term (x T i xk)(x T k xℓ)(x T ℓ xj ) represents a fourth-order interaction involving four vectors (xi , xk, … view at source ↗
Figure 5
Figure 5. Figure 5: Loss landscape comparison (MNIST) at finite width m = 1024 (left two panels) and infinite-width NTK limit (right two panels). 2L-ReLU (blue) converges toward its NTK landscape: both finite and infinite-width surfaces share similar geometry. MLP-Attn (orange) shows a qualitatively different finite-width landscape (sharp, deep minimum) compared to its NTK limit (broad, shallow basin), visualizing the NTK non… view at source ↗
read the original abstract

Understanding whether attention mechanisms converge to the kernel regime is foundational to the validity of influence functions for transformer accountability. Exact NTK characterization of softmax attention is precluded by its exponential nonlinearity; linearized attention is the canonical tractable proxy and the object of study here. This paper establishes that even this proxy does not converge to its NTK limit at any practical width, revealing a fundamental trade-off in the learning dynamics of attention. An exact correspondence is established between parameter-free linearized attention and a data-dependent Gram-induced kernel; spectral amplification analysis shows that the attention transformation cubes the Gram matrix's condition number, requiring width $m = \Omega(\kappa_d(\mathbf{G})^6 n\log n)$ for NTK convergence, where $\kappa_d(\mathbf{G})$ is the effective condition number of the rank-$\min(n,d)$ truncation of the input Gram matrix; for natural image datasets this threshold is physically infeasible ($m \gg 10^{24}$ for MNIST and $m \gg 10^{29}$ for CIFAR-10, 12--17 orders of magnitude beyond the largest known architectures). \emph{Influence malleability} is introduced to characterize this non-convergence: linearized attention exhibits 2--9$\times$ higher malleability than ReLU networks under adversarial data perturbation, with the gap depending on dataset condition number and task setting. A dual implication is established: the same data-dependent kernel is shown theoretically to reduce approximation error when targets align with the data geometry, while, empirically, creating vulnerability to adversarial manipulation of the training data. The structural argument extends to trainable QKV attention under standard initialization, with direct consequences for influence methods applied to deployed transformer architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that linearized attention does not converge to its NTK limit at any practical width. It establishes an exact correspondence between parameter-free linearized attention and a data-dependent Gram-induced kernel, uses spectral analysis to show that the attention transformation cubes the Gram matrix condition number, and derives a prohibitive width lower bound m = Ω(κ_d(G)^6 n log n) that is infeasible for natural image datasets (e.g., m ≫ 10^24 for MNIST). The work introduces 'influence malleability' to quantify non-convergence (showing 2–9× higher values than ReLU networks) and extends the structural argument to trainable QKV attention under standard initialization, with implications for influence functions and adversarial vulnerability.

Significance. If the central claims hold, the result would be significant for the theoretical foundations of attention mechanisms, as it challenges the applicability of NTK approximations to transformers and questions the reliability of influence-based accountability methods. The exact Gram-kernel correspondence, the concrete width bounds derived from matrix spectral properties, and the empirical malleability comparisons on MNIST/CIFAR-10 provide a clear, data-dependent demonstration of the scale of the issue. The dual implication (reduced approximation error when targets align with data geometry, but increased adversarial vulnerability) is a notable strength.

major comments (2)
  1. [Abstract] Abstract and the structural extension section: the claim that the exact Gram-kernel correspondence and κ^3 spectral amplification carry over to trainable QKV attention under standard initialization is load-bearing for the central width bound. The parameter-free construction fixes the linear map directly on the input Gram G, but trainable random Q, K, V matrices can alter the effective feature map and thus the spectrum of the induced kernel; it is unclear whether the effective condition number κ_d(G) remains unchanged or is reduced, which would invalidate the m = Ω(κ_d(G)^6 n log n) threshold.
  2. [Spectral analysis] Spectral amplification analysis: the derivation that the attention transformation cubes the condition number (yielding the sixth-power dependence in the width bound) lacks explicit error bounds or verification of the cubing step. Without these, it is difficult to confirm that the bound holds exactly rather than approximately, directly affecting the claim that convergence is impossible at practical widths.
minor comments (1)
  1. [Notation] The notation for the effective condition number κ_d(G) (rank-min(n,d) truncation) should be defined more explicitly in the main text with a reference to its precise definition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments highlight important points on the extension to trainable attention and the rigor of the spectral bounds. We address each below and will revise the manuscript to strengthen the presentation with additional details and error bounds while preserving the central claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract and the structural extension section: the claim that the exact Gram-kernel correspondence and κ^3 spectral amplification carry over to trainable QKV attention under standard initialization is load-bearing for the central width bound. The parameter-free construction fixes the linear map directly on the input Gram G, but trainable random Q, K, V matrices can alter the effective feature map and thus the spectrum of the induced kernel; it is unclear whether the effective condition number κ_d(G) remains unchanged or is reduced, which would invalidate the m = Ω(κ_d(G)^6 n log n) threshold.

    Authors: Under standard Gaussian initialization of the Q, K, V matrices (variance scaled by 1/d), the composed maps preserve the input Gram matrix G in expectation, and concentration inequalities ensure that the effective data-dependent kernel and its condition number κ_d(G) match the parameter-free case with high probability as the hidden dimension grows. This is established in the structural extension section via direct computation of the induced kernel under random projections. We will revise the abstract and section to include a concise proof sketch citing standard random matrix concentration results, confirming that κ_d(G) is asymptotically unchanged. revision: yes

  2. Referee: [Spectral analysis] Spectral amplification analysis: the derivation that the attention transformation cubes the condition number (yielding the sixth-power dependence in the width bound) lacks explicit error bounds or verification of the cubing step. Without these, it is difficult to confirm that the bound holds exactly rather than approximately, directly affecting the claim that convergence is impossible at practical widths.

    Authors: The cubing of the condition number follows exactly from the closed-form expression for the Gram-induced kernel in the parameter-free case (Theorem 3.1), where the attention operator produces eigenvalues λ_i^3 for the rank-min(n,d) truncation of G. The sixth-power dependence in the width bound then arises from the variance terms in the NTK approximation. We will add an appendix providing explicit error bounds via the Davis-Kahan sin-Θ theorem and matrix Bernstein inequalities, showing that the spectral perturbation is O(1/sqrt(m)) and vanishes in the infinite-width limit, thereby confirming the bound holds with high probability. revision: yes

Circularity Check

0 steps flagged

No circularity detected in the derivation chain

full rationale

The paper derives an exact correspondence between parameter-free linearized attention and a data-dependent Gram-induced kernel directly from the model definition and input data matrix G. The spectral amplification result (cubic condition number) follows from algebraic properties of the attention transformation applied to this Gram matrix. The width lower bound m = Ω(κ_d(G)^6 n log n) is obtained by substituting the derived kernel into standard NTK convergence rates. No step reduces to a fitted parameter renamed as prediction, self-definition of the target quantity, or load-bearing self-citation. The structural extension to trainable QKV attention under standard initialization is asserted without creating a circular reduction in the core claims.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on standard linear-algebra facts about eigenvalues and condition numbers of Gram matrices together with the exact algebraic correspondence between linearized attention and the Gram kernel. No free parameters are fitted to data. The only invented entity is influence malleability, introduced to quantify adversarial sensitivity but without independent external evidence.

axioms (2)
  • domain assumption Spectral amplification property of the linearized attention operator on Gram matrices
    The cubing of the condition number is invoked as a direct consequence of the attention transformation.
  • standard math Standard matrix concentration and random-feature bounds for NTK convergence
    Used to translate the amplified condition number into the required width scaling.
invented entities (1)
  • influence malleability no independent evidence
    purpose: Quantify the sensitivity of linearized attention predictions to adversarial perturbations of the training data
    Newly defined metric whose 2–9× gap versus ReLU networks is reported but lacks external validation outside this work.

pith-pipeline@v0.9.0 · 5631 in / 1488 out tokens · 56559 ms · 2026-05-15T11:29:41.458948+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages

  1. [1]

    Baseline (2L-ReLU):Two-layer ReLU network with direct input processing

  2. [2]

    Attention-Enhanced (MLP-Attn):Sequential archi- tecture with linearized attention preprocessing Both architectures employ identical MLP components with the following specifications: •Hidden layer width:m= 1024neurons • First layer initialization: wr ∼ N(0, κ 2Id) with κ= 0.01 •Second layer weights:a r ∈ {−1,1} •Activation function:ReLU •Output scaling:1/ ...

  3. [3]

    Deterministic Initialization:Fixed random seeds (42) for PyTorch, NumPy, and Python random modules with deterministic CUDA operations

  4. [4]

    Consistent Data Processing:All datasets undergo identical preprocessing pipelines with fixed normaliza- tion parameters and feature scaling

  5. [5]

    11 Influence Malleability in Linearized Attention: Non-Convergent NTK Dynamics A.7

    Hardware:Experiments conducted on NVIDIA A100 (for 5 and 10 classes) and NVIDIA T4 GPUs (for bi- nary classification) using PyTorch with CUDA acceler- ation and mixed-precision training. 11 Influence Malleability in Linearized Attention: Non-Convergent NTK Dynamics A.7. Model Complexity Analysis Figure 4 presents the relationship between influence scores ...

  6. [6]

    Positive definiteness:Assert λmin(K)>−τ with τ= 10 −12 2.Conditioning:Assertκ(K+λI)<10 12

  7. [7]

    Inversion accuracy:Verify ∥(K+λI) −1(K+λI)− I∥F < τ 4.Symmetry:Assert∥K−K T ∥F < τ All experimental configurations passed these checks, con- firming that the NTK-based influence computations are nu- merically reliable. A.9. Loss Landscape Analysis Figure 5 presents the 3D loss landscape for both architec- tures at width m= 1024 on MNIST, visualized along ...

  8. [8]

    O(d4) for general fourth-order kernels

    Memory efficiency: O(nd) storage vs. O(d4) for general fourth-order kernels

  9. [9]

    Deterministic computation: Fourth-order interactions computed directly from data geometry

  10. [10]

    Transitive interactions: Paths (xi →x k →x ℓ → xj)enable multi-step similarity propagation

  11. [11]

    Theoretical tractability: Fixed structure enables exact mathematical analysis B.6. Fourth-Order Representational Power Analysis Theorem B.8(Fourth-Order Function Space Characteriza- tion).The parameter-free linearized attention kernel spans the RKHS: HLinAttn = n f:f(x) = nX k,ℓ=1 αkℓ nX m=1 (xTxk)(xT k xm)(xT mxℓ), ∥α∥K−1 <∞ o Fourth-Order Capabilities: ...

  12. [12]

    2.Two-layer ReLU:S relu =O ∥y∥2 / λ2

    Linearized attention: Satt =O n∥G∥ 2 ∥y∥2 / λ2 , whereG=XX T . 2.Two-layer ReLU:S relu =O ∥y∥2 / λ2 . Consequently, the sensitivity gap scales as Satt Srelu =O(n λ 1(G)), where λ1(G) is the leading eigenvalue of the Gram matrix. For natural image datasets withλ1(G)≫1 , this gap grows with both dataset size and spectral concentration, consistent with the e...

  13. [13]

    Nystr¨om Approximation:Low-rank kernel approx- imation reducing complexity to O(r3 +nr 2) where r≪n

  14. [14]

    Iterative Solvers:Conjugate gradient methods achiev- ingO(n 2 log(1/ϵ))complexity

  15. [15]

    Scope of Linearization The analysis relies on linearized attention f att(X) = XXTX, which approximates softmax attention via first- order Taylor expansion

    Stochastic Estimation:Random sampling approaches for large-scale approximate influence computation D.2. Scope of Linearization The analysis relies on linearized attention f att(X) = XXTX, which approximates softmax attention via first- order Taylor expansion. This removes the competitive nor- malization dynamics of softmax. Whether the observed NTK non-co...

  16. [16]

    Parameterized Attention Analysis:Joint training dynamics of query, key, value matrices

  17. [17]

    Multi-Head Attention:Compositional kernel analysis for multiple attention heads

  18. [18]

    Theoretical Gaps Several theoretical questions remain open:

    Layer Normalization:Impact of normalization on kernel structure and influence patterns D.4. Theoretical Gaps Several theoretical questions remain open:

  19. [19]

    Finite-Width Bounds:Precise characterization of ap- proximation quality for practical network sizes

  20. [20]

    Training Dynamics:Evolution of influence patterns throughout the training process

  21. [21]

    Generalization Bounds:Connection between influ- ence malleability and generalization performance E. Extended Learning Dynamics Analysis Table 6 presents a comprehensive analysis of attention learn- ing dynamics throughout training, revealing how influence 18 Influence Malleability in Linearized Attention: Non-Convergent NTK Dynamics malleability develops ...