Recognition: 3 theorem links
· Lean TheoremLinearized Attention Cannot Enter the Kernel Regime at Any Practical Width
Pith reviewed 2026-05-15 11:29 UTC · model grok-4.3
The pith
Linearized attention does not converge to its neural tangent kernel limit at any practical network width.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
This paper establishes an exact correspondence between parameter-free linearized attention and a data-dependent Gram-induced kernel. Through spectral amplification analysis, it shows that the attention transformation cubes the Gram matrix's condition number, requiring a width m of order kappa_d(G) to the sixth times n log n for convergence to the NTK limit. For natural image datasets, this renders the kernel regime unattainable, with m exceeding 10^24 for MNIST and 10^29 for CIFAR-10. It further introduces influence malleability to quantify the resulting sensitivity, finding linearized attention 2 to 9 times more malleable than ReLU networks, and notes that the structural argument applies to
What carries the argument
The exact Gram-kernel correspondence combined with spectral amplification analysis, which cubes the effective condition number kappa_d(G) of the rank-min(n,d) input Gram matrix and sets the width threshold for NTK convergence.
Load-bearing premise
The spectral amplification analysis and exact Gram-kernel correspondence hold exactly for parameter-free linearized attention and extend directly to trainable QKV attention under standard initialization, with the effective condition number fixed by the input data.
What would settle it
Training linearized attention models at increasing widths up to 10^6 or 10^9 on MNIST and checking whether the empirical kernel approximation error to the NTK prediction falls below a small threshold would settle the non-convergence claim.
Figures
read the original abstract
Understanding whether attention mechanisms converge to the kernel regime is foundational to the validity of influence functions for transformer accountability. Exact NTK characterization of softmax attention is precluded by its exponential nonlinearity; linearized attention is the canonical tractable proxy and the object of study here. This paper establishes that even this proxy does not converge to its NTK limit at any practical width, revealing a fundamental trade-off in the learning dynamics of attention. An exact correspondence is established between parameter-free linearized attention and a data-dependent Gram-induced kernel; spectral amplification analysis shows that the attention transformation cubes the Gram matrix's condition number, requiring width $m = \Omega(\kappa_d(\mathbf{G})^6 n\log n)$ for NTK convergence, where $\kappa_d(\mathbf{G})$ is the effective condition number of the rank-$\min(n,d)$ truncation of the input Gram matrix; for natural image datasets this threshold is physically infeasible ($m \gg 10^{24}$ for MNIST and $m \gg 10^{29}$ for CIFAR-10, 12--17 orders of magnitude beyond the largest known architectures). \emph{Influence malleability} is introduced to characterize this non-convergence: linearized attention exhibits 2--9$\times$ higher malleability than ReLU networks under adversarial data perturbation, with the gap depending on dataset condition number and task setting. A dual implication is established: the same data-dependent kernel is shown theoretically to reduce approximation error when targets align with the data geometry, while, empirically, creating vulnerability to adversarial manipulation of the training data. The structural argument extends to trainable QKV attention under standard initialization, with direct consequences for influence methods applied to deployed transformer architectures.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that linearized attention does not converge to its NTK limit at any practical width. It establishes an exact correspondence between parameter-free linearized attention and a data-dependent Gram-induced kernel, uses spectral analysis to show that the attention transformation cubes the Gram matrix condition number, and derives a prohibitive width lower bound m = Ω(κ_d(G)^6 n log n) that is infeasible for natural image datasets (e.g., m ≫ 10^24 for MNIST). The work introduces 'influence malleability' to quantify non-convergence (showing 2–9× higher values than ReLU networks) and extends the structural argument to trainable QKV attention under standard initialization, with implications for influence functions and adversarial vulnerability.
Significance. If the central claims hold, the result would be significant for the theoretical foundations of attention mechanisms, as it challenges the applicability of NTK approximations to transformers and questions the reliability of influence-based accountability methods. The exact Gram-kernel correspondence, the concrete width bounds derived from matrix spectral properties, and the empirical malleability comparisons on MNIST/CIFAR-10 provide a clear, data-dependent demonstration of the scale of the issue. The dual implication (reduced approximation error when targets align with data geometry, but increased adversarial vulnerability) is a notable strength.
major comments (2)
- [Abstract] Abstract and the structural extension section: the claim that the exact Gram-kernel correspondence and κ^3 spectral amplification carry over to trainable QKV attention under standard initialization is load-bearing for the central width bound. The parameter-free construction fixes the linear map directly on the input Gram G, but trainable random Q, K, V matrices can alter the effective feature map and thus the spectrum of the induced kernel; it is unclear whether the effective condition number κ_d(G) remains unchanged or is reduced, which would invalidate the m = Ω(κ_d(G)^6 n log n) threshold.
- [Spectral analysis] Spectral amplification analysis: the derivation that the attention transformation cubes the condition number (yielding the sixth-power dependence in the width bound) lacks explicit error bounds or verification of the cubing step. Without these, it is difficult to confirm that the bound holds exactly rather than approximately, directly affecting the claim that convergence is impossible at practical widths.
minor comments (1)
- [Notation] The notation for the effective condition number κ_d(G) (rank-min(n,d) truncation) should be defined more explicitly in the main text with a reference to its precise definition.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review. The comments highlight important points on the extension to trainable attention and the rigor of the spectral bounds. We address each below and will revise the manuscript to strengthen the presentation with additional details and error bounds while preserving the central claims.
read point-by-point responses
-
Referee: [Abstract] Abstract and the structural extension section: the claim that the exact Gram-kernel correspondence and κ^3 spectral amplification carry over to trainable QKV attention under standard initialization is load-bearing for the central width bound. The parameter-free construction fixes the linear map directly on the input Gram G, but trainable random Q, K, V matrices can alter the effective feature map and thus the spectrum of the induced kernel; it is unclear whether the effective condition number κ_d(G) remains unchanged or is reduced, which would invalidate the m = Ω(κ_d(G)^6 n log n) threshold.
Authors: Under standard Gaussian initialization of the Q, K, V matrices (variance scaled by 1/d), the composed maps preserve the input Gram matrix G in expectation, and concentration inequalities ensure that the effective data-dependent kernel and its condition number κ_d(G) match the parameter-free case with high probability as the hidden dimension grows. This is established in the structural extension section via direct computation of the induced kernel under random projections. We will revise the abstract and section to include a concise proof sketch citing standard random matrix concentration results, confirming that κ_d(G) is asymptotically unchanged. revision: yes
-
Referee: [Spectral analysis] Spectral amplification analysis: the derivation that the attention transformation cubes the condition number (yielding the sixth-power dependence in the width bound) lacks explicit error bounds or verification of the cubing step. Without these, it is difficult to confirm that the bound holds exactly rather than approximately, directly affecting the claim that convergence is impossible at practical widths.
Authors: The cubing of the condition number follows exactly from the closed-form expression for the Gram-induced kernel in the parameter-free case (Theorem 3.1), where the attention operator produces eigenvalues λ_i^3 for the rank-min(n,d) truncation of G. The sixth-power dependence in the width bound then arises from the variance terms in the NTK approximation. We will add an appendix providing explicit error bounds via the Davis-Kahan sin-Θ theorem and matrix Bernstein inequalities, showing that the spectral perturbation is O(1/sqrt(m)) and vanishes in the infinite-width limit, thereby confirming the bound holds with high probability. revision: yes
Circularity Check
No circularity detected in the derivation chain
full rationale
The paper derives an exact correspondence between parameter-free linearized attention and a data-dependent Gram-induced kernel directly from the model definition and input data matrix G. The spectral amplification result (cubic condition number) follows from algebraic properties of the attention transformation applied to this Gram matrix. The width lower bound m = Ω(κ_d(G)^6 n log n) is obtained by substituting the derived kernel into standard NTK convergence rates. No step reduces to a fitted parameter renamed as prediction, self-definition of the target quantity, or load-bearing self-citation. The structural extension to trainable QKV attention under standard initialization is asserted without creating a circular reduction in the core claims.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Spectral amplification property of the linearized attention operator on Gram matrices
- standard math Standard matrix concentration and random-feature bounds for NTK convergence
invented entities (1)
-
influence malleability
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel matches?
matchesMATCHES: this paper passage directly uses, restates, or depends on the cited Recognition theorem or module.
Theorem 4.1: KLinAttn(xi,xj)=Σk,ℓ(xᵢᵀxk)(xkᵀxℓ)(xℓᵀxj) … In matrix form, KLinAttn=G³ where G=XXᵀ
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative matches?
matchesMATCHES: this paper passage directly uses, restates, or depends on the cited Recognition theorem or module.
Theorem 4.7: attention-transformed Gram matrix satisfies G̃=G³, amplifying the condition number: κ(G̃)=κ(G)³ … m=Ω(κ(G)⁶ n log n)
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanembed_strictMono_of_one_lt echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
Proposition 4.5: |KLinAttn(xi+δ,xj)−KLinAttn(xi,xj)|≤∥Gxj∥₁·ϵ … sensitivity depends on the correlation structure of the entire dataset through G
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Baseline (2L-ReLU):Two-layer ReLU network with direct input processing
-
[2]
Attention-Enhanced (MLP-Attn):Sequential archi- tecture with linearized attention preprocessing Both architectures employ identical MLP components with the following specifications: •Hidden layer width:m= 1024neurons • First layer initialization: wr ∼ N(0, κ 2Id) with κ= 0.01 •Second layer weights:a r ∈ {−1,1} •Activation function:ReLU •Output scaling:1/ ...
work page 2009
-
[3]
Deterministic Initialization:Fixed random seeds (42) for PyTorch, NumPy, and Python random modules with deterministic CUDA operations
-
[4]
Consistent Data Processing:All datasets undergo identical preprocessing pipelines with fixed normaliza- tion parameters and feature scaling
-
[5]
11 Influence Malleability in Linearized Attention: Non-Convergent NTK Dynamics A.7
Hardware:Experiments conducted on NVIDIA A100 (for 5 and 10 classes) and NVIDIA T4 GPUs (for bi- nary classification) using PyTorch with CUDA acceler- ation and mixed-precision training. 11 Influence Malleability in Linearized Attention: Non-Convergent NTK Dynamics A.7. Model Complexity Analysis Figure 4 presents the relationship between influence scores ...
-
[6]
Positive definiteness:Assert λmin(K)>−τ with τ= 10 −12 2.Conditioning:Assertκ(K+λI)<10 12
-
[7]
Inversion accuracy:Verify ∥(K+λI) −1(K+λI)− I∥F < τ 4.Symmetry:Assert∥K−K T ∥F < τ All experimental configurations passed these checks, con- firming that the NTK-based influence computations are nu- merically reliable. A.9. Loss Landscape Analysis Figure 5 presents the 3D loss landscape for both architec- tures at width m= 1024 on MNIST, visualized along ...
work page 2017
-
[8]
O(d4) for general fourth-order kernels
Memory efficiency: O(nd) storage vs. O(d4) for general fourth-order kernels
-
[9]
Deterministic computation: Fourth-order interactions computed directly from data geometry
-
[10]
Transitive interactions: Paths (xi →x k →x ℓ → xj)enable multi-step similarity propagation
-
[11]
Theoretical tractability: Fixed structure enables exact mathematical analysis B.6. Fourth-Order Representational Power Analysis Theorem B.8(Fourth-Order Function Space Characteriza- tion).The parameter-free linearized attention kernel spans the RKHS: HLinAttn = n f:f(x) = nX k,ℓ=1 αkℓ nX m=1 (xTxk)(xT k xm)(xT mxℓ), ∥α∥K−1 <∞ o Fourth-Order Capabilities: ...
work page 2022
-
[12]
2.Two-layer ReLU:S relu =O ∥y∥2 / λ2
Linearized attention: Satt =O n∥G∥ 2 ∥y∥2 / λ2 , whereG=XX T . 2.Two-layer ReLU:S relu =O ∥y∥2 / λ2 . Consequently, the sensitivity gap scales as Satt Srelu =O(n λ 1(G)), where λ1(G) is the leading eigenvalue of the Gram matrix. For natural image datasets withλ1(G)≫1 , this gap grows with both dataset size and spectral concentration, consistent with the e...
-
[13]
Nystr¨om Approximation:Low-rank kernel approx- imation reducing complexity to O(r3 +nr 2) where r≪n
-
[14]
Iterative Solvers:Conjugate gradient methods achiev- ingO(n 2 log(1/ϵ))complexity
-
[15]
Stochastic Estimation:Random sampling approaches for large-scale approximate influence computation D.2. Scope of Linearization The analysis relies on linearized attention f att(X) = XXTX, which approximates softmax attention via first- order Taylor expansion. This removes the competitive nor- malization dynamics of softmax. Whether the observed NTK non-co...
work page 2017
-
[16]
Parameterized Attention Analysis:Joint training dynamics of query, key, value matrices
-
[17]
Multi-Head Attention:Compositional kernel analysis for multiple attention heads
-
[18]
Theoretical Gaps Several theoretical questions remain open:
Layer Normalization:Impact of normalization on kernel structure and influence patterns D.4. Theoretical Gaps Several theoretical questions remain open:
-
[19]
Finite-Width Bounds:Precise characterization of ap- proximation quality for practical network sizes
-
[20]
Training Dynamics:Evolution of influence patterns throughout the training process
-
[21]
Generalization Bounds:Connection between influ- ence malleability and generalization performance E. Extended Learning Dynamics Analysis Table 6 presents a comprehensive analysis of attention learn- ing dynamics throughout training, revealing how influence 18 Influence Malleability in Linearized Attention: Non-Convergent NTK Dynamics malleability develops ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.