Attention-based PCA
Pith reviewed 2026-05-20 09:03 UTC · model grok-4.3
The pith
Trained attention layers align parameters with principal eigenvectors of Gaussian data covariances.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When trained on Gaussian data, both softmax and linear attention layers learn parameters that align with the principal eigenvectors of the covariance matrix, thereby establishing a direct and explicit connection with PCA. Our analysis covers both finite and infinite prompt regimes. In the infinite-prompt limit, we prove convergence to globally optimal solutions aligned with the leading spectral direction, while in the finite-prompt setting we show that the same behavior emerges up to sampling effects. We further extend the analysis to an in-context setting with spiked Wishart covariances, where attention successfully recovers the underlying signal direction.
What carries the argument
Convergence of attention parameters to the leading eigenvectors of the data covariance under unsupervised training.
If this is right
- Attention performs PCA-style dimensionality reduction without any supervised signal.
- The same spectral alignment appears in both finite-data and infinite-data prompt regimes.
- In spiked covariance settings the mechanism recovers the planted signal direction.
- The result supplies a theoretical basis for attention's unsupervised representation learning.
Where Pith is reading between the lines
- The alignment might persist under mild deviations from Gaussianity if the loss still emphasizes variance maximization.
- Explicitly inserting a PCA-like regularizer into attention training could accelerate or stabilize the observed convergence.
- The finding invites direct comparison of attention layers against classical PCA on real non-Gaussian datasets to test robustness.
Load-bearing premise
The inputs are drawn from a Gaussian distribution or spiked Wishart model and the training objective is unsupervised with a loss that rewards alignment to spectral directions.
What would settle it
An experiment that trains the attention layers on Gaussian data under the stated unsupervised objective and finds that the learned parameters remain misaligned with the top eigenvectors of the covariance matrix would falsify the claimed convergence.
Figures
read the original abstract
We study attention mechanisms through the lens of a canonical unsupervised problem: principal component analysis (PCA). We show that, when trained on Gaussian data, both softmax and linear attention layers learn parameters that align with the principal eigenvectors of the covariance matrix, thereby establishing a direct and explicit connection with PCA. Our analysis covers both finite and infinite prompt regimes. In the infinite-prompt limit, we prove convergence to globally optimal solutions aligned with the leading spectral direction, while in the finiteprompt setting we show that the same behavior emerges up to sampling effects. We further extend the analysis to an in-context setting with spiked Wishart covariances, where attention successfully recovers the underlying signal direction. These results demonstrate that attention inherently performs PCA-like computations under unsupervised objectives, providing a theoretical foundation for its representation-learning capabilities.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that attention mechanisms can be understood through the lens of PCA. When trained unsupervised on Gaussian data, both softmax and linear attention layers learn parameters aligned with the principal eigenvectors of the covariance matrix. Proofs are given for global convergence to the leading eigenvector in the infinite-prompt limit; in the finite-prompt regime the same alignment is asserted to emerge up to sampling effects. The analysis is extended to an in-context spiked-Wishart setting in which attention recovers the underlying signal direction.
Significance. If the derivations are correct, the work supplies an explicit, parameter-free link between attention training and PCA, which is a notable contribution to the theoretical understanding of representation learning in transformers. The infinite-prompt proofs and the spiked-Wishart extension are concrete strengths. The result is of clear interest to the math.OC community studying implicit biases of attention objectives.
major comments (1)
- [Finite-prompt analysis] Finite-prompt analysis (abstract and § on finite-prompt behavior): the claim that alignment 'emerges up to sampling effects' is load-bearing for the central assertion of a 'direct and explicit connection' in practical regimes, yet no explicit deviation bounds, variance estimates, concentration inequalities, or high-probability guarantees are supplied on the distance between learned weights and eigenvectors when the number of prompts is small relative to dimension. Without these quantitative controls the finite-sample statement remains unverified and weakens the overall claim.
minor comments (2)
- [Methods / Loss definition] The loss function and its explicit form should be stated in the main text rather than deferred to an appendix, to allow readers to verify the unsupervised objective directly.
- [Notation] Notation for the attention parameters (e.g., the distinction between query/key/value matrices and the learned weight vector) is occasionally ambiguous; a single consolidated table of symbols would improve readability.
Simulated Author's Rebuttal
We thank the referee for their careful reading, positive assessment of the infinite-prompt proofs and spiked-Wishart extension, and for identifying an important point regarding the finite-prompt regime. We address the major comment below and describe the revisions we will undertake.
read point-by-point responses
-
Referee: [Finite-prompt analysis] Finite-prompt analysis (abstract and § on finite-prompt behavior): the claim that alignment 'emerges up to sampling effects' is load-bearing for the central assertion of a 'direct and explicit connection' in practical regimes, yet no explicit deviation bounds, variance estimates, concentration inequalities, or high-probability guarantees are supplied on the distance between learned weights and eigenvectors when the number of prompts is small relative to dimension. Without these quantitative controls the finite-sample statement remains unverified and weakens the overall claim.
Authors: We agree that the finite-prompt claims would be strengthened by explicit quantitative controls. The current manuscript shows alignment in the finite-prompt regime by analyzing the population objective (which favors the leading eigenvector) together with numerical experiments that illustrate close agreement for moderate prompt counts, with observed deviations attributed to sampling. However, we did not supply rigorous deviation bounds, variance estimates, or high-probability guarantees. In the revised version we will add a new subsection that derives such controls via matrix concentration inequalities applied to the empirical covariance matrix formed from the prompts. Specifically, we will prove that the distance between the learned weights and the true principal eigenvector is O(sqrt(d/N)) with high probability (via a matrix Bernstein bound), where d is dimension and N the number of prompts, together with explicit variance estimates. We will also update the abstract and the finite-prompt section to state the result with these rates. These additions will make the finite-sample connection fully rigorous while preserving the existing analysis. revision: yes
Circularity Check
Derivation from explicit unsupervised loss on Gaussian data to eigenvector alignment is self-contained
full rationale
The paper starts from a defined unsupervised training objective on Gaussian (or spiked Wishart) inputs and derives that attention parameters converge to principal eigenvectors in the infinite-prompt limit, with finite-prompt behavior following up to sampling. This is a direct mathematical analysis of the loss landscape and optimization dynamics rather than any reduction of a claimed prediction back to a fitted parameter or self-citation chain. No load-bearing step equates the target alignment to the input by construction; the result is obtained by analyzing the given objective. The finite-prompt caveat is a limitation on quantification but does not create circularity.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Input data is drawn from a multivariate Gaussian distribution (or spiked Wishart covariance).
- domain assumption The attention layer is trained with an unsupervised objective whose stationary points coincide with principal directions.
Reference graph
Works this paper leans on
-
[1]
Shivam Garg, Dimitris Tsipras, Percy Liang, and Gregory Valiant
URL https://arxiv.org/abs/2408.01367. Shivam Garg, Dimitris Tsipras, Percy Liang, and Gregory Valiant. What can transformers learn in-context? a case study of simple function classes, 2023. URLhttps://arxiv.org/abs/2208.01066. Gene H. Golub and Charles F. Van Loan.Matrix Computations. Johns Hopkins University Press, Baltimore, MD, USA, 3rd edition, 1996. ...
-
[2]
E [ (X⊤ 1 µ)2] =µT Σµ
-
[3]
E [⟨X1,X 2⟩⟨X1,µ⟩⟨X2,µ⟩] =µT Σ 2µ
-
[4]
E [ ⟨X1,X 2⟩⟨X1,µ⟩3⟨X2,µ⟩ ] = 3(µT Σµ)(µT Σ 2µ)
-
[5]
E [ ⟨X1,µ⟩2∥X1∥2 ] = tr(Σ)(µT Σµ) + 2µT Σ 2µ
-
[6]
E [ ⟨X1,µ⟩4∥X1∥2 ] = 3tr(Σ)(µT Σµ)2 + 12(µT Σµ)(µT Σ 2µ). Proof. All identities follow from linearity of expectation, independence ofX1 and X2, and Isserlis’ theorem Isserlis [1918] for centered Gaussian vectors
work page 1918
-
[7]
Since∥X1∥2 = ∑d i=1X2 1,i and E[X2 1,i] = Σii, E[∥X1∥2] = d∑ i=1 Σii = tr(Σ)
-
[8]
Expanding (X⊤ 1 µ)2 and using E[X1,iX1,j] = Σij gives E[(X⊤ 1 µ)2] = ∑ i,j µiµjΣij =µ⊤Σµ
-
[9]
Therefore, E[⟨X1,X 2⟩⟨X1,µ⟩⟨X2,µ⟩] = E[(X⊤ 1 µ)(X⊤ 1 Σµ)] =µ⊤Σ 2µ
Conditioning onX1 and using independence, E[⟨X1,X 2⟩⟨X2,µ⟩|X1] =X⊤ 1 E[X2X⊤ 2 ]µ=X⊤ 1 Σµ. Therefore, E[⟨X1,X 2⟩⟨X1,µ⟩⟨X2,µ⟩] = E[(X⊤ 1 µ)(X⊤ 1 Σµ)] =µ⊤Σ 2µ
-
[10]
Conditioning again onX1, E[⟨X1,X 2⟩⟨X1,µ⟩3⟨X2,µ⟩] = E[(X⊤ 1 µ)3(µ⊤ΣX1)]. For a centered Gaussian vectorX and vectorsµ0,µ1, Isserlis’ theorem yields E[(µ⊤ 0X)3(µ⊤ 1X)] = 3(µ⊤ 0 Σµ0)(µ⊤ 0 Σµ1). Applying this witha =µ0 and µ1 = Σµgives the claim. 24
-
[11]
Expanding ⟨X1,µ⟩2∥X1∥2 = ∑ i,j,k µiµjX1,iX1,jX2 1,k and applying Isserlis’ theorem to the fourth-order moment yields two types of pairings, leading to E[⟨X1,µ⟩2∥X1∥2] = tr(Σ)(µ⊤Σµ) + 2µ⊤Σ 2µ
-
[12]
Similarly, ⟨X1,µ⟩4∥X1∥2 = ∑ i,j,ℓ,m,k µiµjµℓµmX1,iX1,jX1,ℓX1,mX2 1,k. Applying Isserlis’ theorem to the sixth-order moment, pairings where the two copies ofk are paired together contribute 3tr(Σ)(µ⊤Σµ)2, while the remaining12 pairings contribute 12(µ⊤Σµ)(µ⊤Σ 2µ). Summing both terms gives the result. C.2 Optimization preliminaries In this subsection, we ga...
-
[13]
fn has a unique critical pointx⋆ n in B(x⋆,r ),
-
[14]
x⋆ n is nondegenerate and moreoverx⋆ n→x⋆as n→∞,
-
[15]
Letρ>0 such that Λ⋃ k=1 B(x(k),rk)⊂B(0,ρ)
If the set of critical points off is finite, i.e., crit(f) ={x(1),...,x (Λ)}, and eachx(k), k∈{1,..., Λ}, is nondegenerate, letrk > 0 be such thatB(x(k),rk) are pairwise disjoint. Letρ>0 such that Λ⋃ k=1 B(x(k),rk)⊂B(0,ρ). Then forn large enough, crit(fn)∩B(0,ρ) ={x(1) n ,...,x (Λ) n }. Moreover, if for somek∈{1,..., Λ}, x(k) is a strict saddle (resp. a s...
work page 2013
-
[16]
Since∇2fn(x⋆ n)→∇2f(x⋆) and the limit is invertible,x⋆ n is nondegenerate forn large. Let Γ := 1 ∥[∇2f(x⋆)]−1∥, by continuity of∇2f, shrinkingr if necessary, we have for everyx∈B(x⋆,r ), ∥∇2f(x)−∇2f(x⋆)∥≤Γ 2, 28 and since∥∇2f(x⋆)v∥≥Γ∥v∥, using Taylor we get ∇f(x) =∇2f(x⋆)(x−x⋆) + ∫ 1 0 (∇2f(x⋆+t(x−x⋆))−∇2f(x⋆))(x−x⋆)dt. We take norms and after bounding we...
-
[17]
Since the set is finite, define δ:= 1 2 min k̸=ℓ ∥x(k)−x(ℓ)∥> 0
Assume now that crit(f) ={x(1),...,x (Λ)}, ∇2f(x(k)) is invertible for allk. Since the set is finite, define δ:= 1 2 min k̸=ℓ ∥x(k)−x(ℓ)∥> 0. Applying items (1)–(2) to eachx(k), there existrk∈(0,δ) and nk∈N such that for alln≥nk: (a) fn has a unique critical pointx(k) n in B(x(k),rk), (b) x(k) n →x(k), (c) x(k) n is nondegenerate. Let n0 := max k∈{1,...,Λ...
-
[18]
E[µ⊤Σ 2µ] =n(n + 1)µ⊤V 2µ+n tr(V )µ⊤Vµ
-
[19]
E[(µ⊤Σ 2µ)(µ⊤Σ 2µ)] =n(n + 2)[(n + 3)(µ⊤Vµ)(µ⊤V 2µ) + (µ⊤Vµ)2 tr(V )]. Proof. We use the standard Gaussian representation of the Wishart distribution: Σ = n∑ r=1 xrx⊤ r, x r iid ∼N(0,V )
-
[20]
Since tr(xrx⊤ r ) =∥xr∥2 and E[∥xr∥2] = trV, linearity of expectation gives E[tr Σ] = n∑ r=1 E[∥xr∥2] =n trV
-
[21]
Write µ⊤Σ 2µ= n∑ i,j=1 (µ⊤αi)(α⊤ i xj)(µ⊤xj). Splitting the sum into the casesi =j and i̸=j: (i) Diagonal terms.Fori =j, E [ (µ⊤x)2(x⊤x) ] =µ⊤VµtrV + 2µ⊤V 2µ, by Isserlis’ formula. (ii) Off-diagonal terms.Fori̸=j, independence yields E[(µ⊤αi)(α⊤ i xj)(µ⊤xj)] =µ⊤V 2µ. Counting terms, E[µ⊤Σ 2µ] =n ( µ⊤VµtrV + 2µ⊤V 2µ ) +n(n−1)µ⊤V 2µ, which simplifies to E[µ...
-
[22]
Then µ⊤Σµ= ∑ k s2 k, µ ⊤Σ 2µ= ∑ i,j si(α⊤ i xj)sj
Let sr =µ⊤xr. Then µ⊤Σµ= ∑ k s2 k, µ ⊤Σ 2µ= ∑ i,j si(α⊤ i xj)sj. Hence E[(µ⊤Σµ)(µ⊤Σ 2µ)] = ∑ i,j,k E [ s2 ksi(α⊤ i xj)sj ] . The expectation depends on coincidences among the indices(i,j,k ). Using Isserlis’ theorem and independence, one obtains: - i,j,k all distinct: contribution(µ⊤Vµ)(µ⊤V 2µ). - i =j̸=k: contribution (µ⊤Vµ) ( µ⊤VµtrV + 2µ⊤V 2µ ) . - i =...
-
[23]
Orthogonal: α= 0, w̸= 0, withA(r,0) = 0
-
[24]
Aligned: w = 0, α̸= 0, so thatµ=αv, with A(r,α) +B(r,α) = 0
-
[25]
Off-axis: α̸= 0, w̸= 0, with A(r,α) = 0 and B(r,α) = 0. Proof. A stationary point satisfies A(r,α)µ+B(r,α)αv= 0. Writingµ=αv+w with w⊥v and projecting ontospan(v) and its orthogonal complement yields A(r,α)w = 0, α ( A(r,α) +B(r,α) ) = 0. The conclusion follows by considering whetherα= 0 or not and whetherw = 0 or not. Proposition 19(Characterization of f...
-
[26]
µ= 0 is a local maximum: the Hessian has only negative eigenvalues along all nonzero directions
-
[27]
Any admissible orthogonal point is a strict saddle: the Hessian has exactly one negative eigenvalue along v, one positive eigenvalue along the vector’s own direction andd−2 zero eigenvalues
-
[28]
Moreover, the only admissible aligned points are µ⋆=±α⋆v, forα⋆> 0 defined in(22)
Any admissible aligned point is a strict local minimum. Moreover, the only admissible aligned points are µ⋆=±α⋆v, forα⋆> 0 defined in(22)
-
[29]
SinceRICL ∞is coercive, these two points are the global minimizers of the function
There are no admissible off-axis solutions. SinceRICL ∞is coercive, these two points are the global minimizers of the function. Finally, for almost every initializationµ0∈Rd, the gradient flow ofRICL ∞converges to one of the two global minimizers±α⋆v. 40 Proof. We recall by Lemma 13 that∇RICL ∞(µ) = 2(A(r,α)µ+B(r,α)αv), where A(r,α) =a1r2 +a2α2−a3, B (r,α...
-
[30]
This point is a strict local maximum
The pointµ⋆ L,0 satisfiesµ⋆ L,0→0 as L→∞. This point is a strict local maximum
-
[31]
These points are local minimizers
The points±µ⋆ L,∥satisfy±µ⋆ L,∥→±α⋆v as L→∞, withα⋆according to(22). These points are local minimizers
-
[32]
For each orthogonal critical pointµ(j) ⊥, j = 1,..., 2(d−1) ofRICL ∞, there exists a neighborhoodUj such that every critical point ofRICL L in Uj is a strict saddle. Proof. The first two items 1 and 2 follow directly from Proposition 5. It remains to prove 3. Let crit(RICL ∞) ={0,±α⋆v,µ(1) ⊥,...,µ(2(d−1)) ⊥ }, where µ(j) ⊥denote the orthogonal critical po...
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.