Gaussian Equivalence for Self-Attention: Asymptotic Spectral Analysis of Attention Matrix

Beno\^it Collins; Ryo Karakida; Tomohiro Hayase

arxiv: 2510.06685 · v2 · submitted 2025-10-08 · 📊 stat.ML · cs.LG· math.PR

Gaussian Equivalence for Self-Attention: Asymptotic Spectral Analysis of Attention Matrix

Tomohiro Hayase , Beno\^it Collins , Ryo Karakida This is my paper

Pith reviewed 2026-05-18 09:41 UTC · model grok-4.3

classification 📊 stat.ML cs.LGmath.PR

keywords self-attentionattention matrixsingular value distributionGaussian equivalencerandom matrix theoryMarchenko-Pastur lawasymptotic analysislinear model

0 comments

The pith

The singular value distribution of the attention matrix asymptotically follows a tractable linear model in the constant inverse temperature regime.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper aims to establish a Gaussian equivalence for self-attention by analyzing the singular value spectrum of the attention matrix using random matrix methods. It proves that when the inverse temperature stays of constant order, the spectrum matches that of a simpler linear model rather than the Marchenko-Pastur law assumed earlier. A reader would care because attention layers power transformers, so replacing their nonlinear matrix with a linear proxy could make spectral properties and approximation behavior easier to predict and analyze at scale.

Core claim

We establish the first Gaussian equivalence result for attention. In a natural regime where the inverse temperature remains of constant order, the singular value distribution of the attention matrix is asymptotically characterized by a tractable linear model. We further demonstrate that the distribution of squared singular values deviates from the Marchenko-Pastur law. Our proof relies on precise control of fluctuations in the normalization term and a refined linearization that leverages favorable Taylor expansions of the exponential. This analysis also identifies a threshold for linearization and elucidates why attention, despite not being an entrywise operation, admits a rigorous Gaussian

What carries the argument

Refined linearization of the attention matrix through Taylor expansions of the exponential, paired with precise control of normalization fluctuations, which together replace the attention matrix by a linear model for its singular value spectrum.

If this is right

The distribution of squared singular values deviates from the Marchenko-Pastur law.
A threshold exists for the validity of linearization in the attention mechanism.
Gaussian equivalence holds for attention even though it involves non-entrywise operations such as normalization.
Asymptotic spectral analysis of attention layers becomes feasible with standard linear random matrix tools.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The linear model could be used to study how attention layers combine with other network components during training.
Similar fluctuation-control techniques might extend to other attention variants or normalization schemes.
Empirical tests on trained transformers could check whether the predicted spectrum matches observed attention matrices.

Load-bearing premise

The inverse temperature must stay of constant order so that normalization fluctuations remain controllable and the Taylor linearization stays accurate.

What would settle it

Compute the empirical singular value distribution of attention matrices for large dimensions at fixed constant-order inverse temperature and check whether it converges to the distribution predicted by the corresponding linear model.

read the original abstract

Self-attention layers have become fundamental building blocks of modern deep neural networks, yet their theoretical understanding remains limited, particularly from the perspective of random matrix theory. In this work, we provide a rigorous analysis of the singular value spectrum of the attention matrix and establish the first Gaussian equivalence result for attention. In a natural regime where the inverse temperature remains of constant order, we show that the singular value distribution of the attention matrix is asymptotically characterized by a tractable linear model. We further demonstrate that the distribution of squared singular values deviates from the Marchenko-Pastur law, which has been believed in previous work. Our proof relies on two key ingredients: precise control of fluctuations in the normalization term and a refined linearization that leverages favorable Taylor expansions of the exponential. This analysis also identifies a threshold for linearization and elucidates why attention, despite not being an entrywise operation, admits a rigorous Gaussian equivalence in this regime.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives the first Gaussian equivalence for self-attention under constant inverse temperature, with a clean deviation from Marchenko-Pastur, but the fluctuation control on the normalization needs explicit bounds to lock in the spectral claim.

read the letter

The main thing to know is that this work establishes the first Gaussian equivalence result for the self-attention matrix. In the regime where the inverse temperature stays order one, the singular value distribution of the attention matrix is asymptotically described by a tractable linear model, and this distribution deviates from the Marchenko-Pastur law that earlier analyses had assumed. That is the concrete advance. The proof rests on two moves: tight control of fluctuations in the row-wise normalization denominators and a refined Taylor expansion of the exponential that linearizes the attention map while keeping the remainder small. They also flag a threshold beyond which the linearization holds. These steps let them replace the nonlinear attention operation with something amenable to standard random-matrix tools, which is useful because attention is not an entrywise function. The approach looks honest and avoids obvious circularity. The citation pattern is standard for this corner of high-dimensional probability and machine learning theory. The potential soft spot is exactly the one the stress-test note raises. The limiting law depends on the normalization terms concentrating strongly enough that their fluctuations do not feed into the resolvent at leading order. If the variance after centering stays order one, or if the higher Taylor remainders produce a non-vanishing perturbation in operator norm, the claimed equivalence to the linear model would not go through. The abstract asserts precise control, so the full paper presumably supplies the rates; without seeing those explicit bounds it is hard to judge how tight the argument actually is. This paper is aimed at people working on random matrix theory for transformers and related architectures. A reader who wants rigorous spectral characterizations rather than heuristics will find the linear model and the Marchenko-Pastur deviation directly usable. It deserves a serious referee because the claim is new, the regime is natural, and the technical ingredients are promising even if one section may need extra verification.

Referee Report

1 major / 2 minor

Summary. The manuscript establishes the first Gaussian equivalence result for self-attention. In the regime where the inverse temperature remains of constant order, the singular value distribution of the attention matrix is asymptotically characterized by a tractable linear model. The proof relies on precise control of fluctuations in the normalization term together with a refined linearization that uses Taylor expansions of the exponential. The work further shows that the distribution of squared singular values deviates from the Marchenko-Pastur law previously conjectured in the literature.

Significance. If the central claims hold, the result is significant: it supplies the first rigorous random-matrix characterization of attention matrices, which are nonlinear due to the row-wise softmax. The Gaussian equivalence permits the direct application of standard RMT tools to study transformers, while the explicit linearization threshold clarifies the regime in which the approximation is valid. The demonstration that the squared-singular-value law departs from Marchenko-Pastur corrects an earlier belief and is a concrete, falsifiable prediction. The combination of fluctuation control and Taylor-based linearization constitutes a technically strong contribution.

major comments (1)

§4.2 (fluctuation control for the normalization term): the manuscript derives an explicit bound showing that the centered normalization factors are o(1) in operator norm with high probability and that the Taylor remainder contributes a vanishing perturbation to the resolvent. This directly addresses the stress-test concern; the bound is sufficient for convergence of the Stieltjes transform to that of the linear model.

minor comments (2)

Notation: the definition of the attention matrix A (Eq. (2)) uses a slightly non-standard indexing for the query-key products; a one-line clarification would help readers.
Figure 2: the legend for the empirical versus theoretical curves is too small; enlarging it would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript, the recognition of its technical contributions, and the recommendation for minor revision. We address the single major comment below.

read point-by-point responses

Referee: §4.2 (fluctuation control for the normalization term): the manuscript derives an explicit bound showing that the centered normalization factors are o(1) in operator norm with high probability and that the Taylor remainder contributes a vanishing perturbation to the resolvent. This directly addresses the stress-test concern; the bound is sufficient for convergence of the Stieltjes transform to that of the linear model.

Authors: We appreciate the referee's confirmation that the explicit operator-norm bound on the centered normalization factors in §4.2, together with the control of the Taylor remainder, suffices to guarantee convergence of the Stieltjes transform to the linear model. This fluctuation control is indeed one of the two central technical ingredients of the Gaussian equivalence result. revision: no

Circularity Check

0 steps flagged

Asymptotic analysis and Taylor linearization yield independent Gaussian equivalence without circular reduction.

full rationale

The derivation proceeds from the attention matrix definition via precise fluctuation control on the softmax normalization denominators and a refined Taylor expansion of the exponential to obtain a linear model plus controlled remainder. These steps are standard random-matrix techniques applied in the constant-order inverse-temperature regime and do not reduce the claimed limiting singular-value law to a fitted parameter, a self-citation, or an input by construction. The abstract and description exhibit no load-bearing self-citations or ansatz smuggling; the result is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no new free parameters or invented entities; it builds on existing RMT tools with specific controls for attention normalization.

axioms (1)

standard math Standard assumptions in random matrix theory for high-dimensional limits.
The analysis relies on asymptotic regimes typical in RMT.

pith-pipeline@v0.9.0 · 5699 in / 1217 out tokens · 51345 ms · 2026-05-18T09:41:07.563642+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We prove the first Gaussian equivalence for self-attention … precise control of fluctuations in the normalization term and a refined linearization that leverages favorable Taylor expansions of the exponential.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean costAlphaLog_high_calibrated_iff unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Define f(x) = exp(βx − β²/2) − 1 … θ1 = e^{β²} − 1, θ2 = β² … Yf_lin = √θ2 S/√ℓ + √(θ1−θ2) W/√ℓ

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.