pith. sign in

arxiv: 2605.18315 · v1 · pith:Q4MWDGYYnew · submitted 2026-05-18 · 🧮 math.OC · stat.ML

Attention-based PCA

Pith reviewed 2026-05-20 09:03 UTC · model grok-4.3

classification 🧮 math.OC stat.ML
keywords attention mechanismsprincipal component analysisunsupervised learningGaussian dataeigenvectorssoftmax attentionlinear attentionspiked Wishart
0
0 comments X

The pith

Trained attention layers align parameters with principal eigenvectors of Gaussian data covariances.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that both softmax and linear attention, when trained unsupervised on Gaussian data, converge to parameters matching the principal eigenvectors of the input covariance matrix. This alignment holds in finite prompt settings up to sampling variation and is proved to reach global optima aligned with the leading direction in the infinite-prompt limit. The analysis further covers in-context learning with spiked Wishart models, where attention recovers the planted signal direction. A sympathetic reader would care because the result supplies an explicit unsupervised mechanism by which attention acquires PCA-style representations without external labels.

Core claim

When trained on Gaussian data, both softmax and linear attention layers learn parameters that align with the principal eigenvectors of the covariance matrix, thereby establishing a direct and explicit connection with PCA. Our analysis covers both finite and infinite prompt regimes. In the infinite-prompt limit, we prove convergence to globally optimal solutions aligned with the leading spectral direction, while in the finite-prompt setting we show that the same behavior emerges up to sampling effects. We further extend the analysis to an in-context setting with spiked Wishart covariances, where attention successfully recovers the underlying signal direction.

What carries the argument

Convergence of attention parameters to the leading eigenvectors of the data covariance under unsupervised training.

If this is right

  • Attention performs PCA-style dimensionality reduction without any supervised signal.
  • The same spectral alignment appears in both finite-data and infinite-data prompt regimes.
  • In spiked covariance settings the mechanism recovers the planted signal direction.
  • The result supplies a theoretical basis for attention's unsupervised representation learning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The alignment might persist under mild deviations from Gaussianity if the loss still emphasizes variance maximization.
  • Explicitly inserting a PCA-like regularizer into attention training could accelerate or stabilize the observed convergence.
  • The finding invites direct comparison of attention layers against classical PCA on real non-Gaussian datasets to test robustness.

Load-bearing premise

The inputs are drawn from a Gaussian distribution or spiked Wishart model and the training objective is unsupervised with a loss that rewards alignment to spectral directions.

What would settle it

An experiment that trains the attention layers on Gaussian data under the stated unsupervised objective and finds that the learned parameters remain misaligned with the top eigenvectors of the covariance matrix would falsify the claimed convergence.

Figures

Figures reproduced from arXiv: 2605.18315 by Claire Boyer (IUF), Rodrigo Maulen-Soto (LPSM, SU).

Figure 1
Figure 1. Figure 1: Alignment toward the principal eigenvector [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Alignment toward the signal direction v over iterations: SGD on RICL L (L = 100) vs GD on RICL ∞ Following the argument of Proposition 8, one proves that for L large enough, the critical points of RICL L in a compact set correspond to perturbations of those of RICL ∞ . In particular, in this compact set there is one local maximum near 0, two local minima near ±α ⋆v, denoted ±µ ⋆ L,∥ , and possible saddle p… view at source ↗
Figure 3
Figure 3. Figure 3: Softmax attention: finite vs. infinite prompt regimes. [PITH_FULL_IMAGE:figures/full_fig_p044_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Convergence of linear attention toward the principal eigenvector under empirical and analytic risks. [PITH_FULL_IMAGE:figures/full_fig_p045_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Final alignment as a function of the prompt length [PITH_FULL_IMAGE:figures/full_fig_p046_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Final alignment as a function of the dimension [PITH_FULL_IMAGE:figures/full_fig_p046_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: In-context learning: finite vs. infinite prompt regimes. [PITH_FULL_IMAGE:figures/full_fig_p047_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: ICL performance as a function of prompt length and dimension. [PITH_FULL_IMAGE:figures/full_fig_p047_8.png] view at source ↗
read the original abstract

We study attention mechanisms through the lens of a canonical unsupervised problem: principal component analysis (PCA). We show that, when trained on Gaussian data, both softmax and linear attention layers learn parameters that align with the principal eigenvectors of the covariance matrix, thereby establishing a direct and explicit connection with PCA. Our analysis covers both finite and infinite prompt regimes. In the infinite-prompt limit, we prove convergence to globally optimal solutions aligned with the leading spectral direction, while in the finiteprompt setting we show that the same behavior emerges up to sampling effects. We further extend the analysis to an in-context setting with spiked Wishart covariances, where attention successfully recovers the underlying signal direction. These results demonstrate that attention inherently performs PCA-like computations under unsupervised objectives, providing a theoretical foundation for its representation-learning capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript claims that attention mechanisms can be understood through the lens of PCA. When trained unsupervised on Gaussian data, both softmax and linear attention layers learn parameters aligned with the principal eigenvectors of the covariance matrix. Proofs are given for global convergence to the leading eigenvector in the infinite-prompt limit; in the finite-prompt regime the same alignment is asserted to emerge up to sampling effects. The analysis is extended to an in-context spiked-Wishart setting in which attention recovers the underlying signal direction.

Significance. If the derivations are correct, the work supplies an explicit, parameter-free link between attention training and PCA, which is a notable contribution to the theoretical understanding of representation learning in transformers. The infinite-prompt proofs and the spiked-Wishart extension are concrete strengths. The result is of clear interest to the math.OC community studying implicit biases of attention objectives.

major comments (1)
  1. [Finite-prompt analysis] Finite-prompt analysis (abstract and § on finite-prompt behavior): the claim that alignment 'emerges up to sampling effects' is load-bearing for the central assertion of a 'direct and explicit connection' in practical regimes, yet no explicit deviation bounds, variance estimates, concentration inequalities, or high-probability guarantees are supplied on the distance between learned weights and eigenvectors when the number of prompts is small relative to dimension. Without these quantitative controls the finite-sample statement remains unverified and weakens the overall claim.
minor comments (2)
  1. [Methods / Loss definition] The loss function and its explicit form should be stated in the main text rather than deferred to an appendix, to allow readers to verify the unsupervised objective directly.
  2. [Notation] Notation for the attention parameters (e.g., the distinction between query/key/value matrices and the learned weight vector) is occasionally ambiguous; a single consolidated table of symbols would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading, positive assessment of the infinite-prompt proofs and spiked-Wishart extension, and for identifying an important point regarding the finite-prompt regime. We address the major comment below and describe the revisions we will undertake.

read point-by-point responses
  1. Referee: [Finite-prompt analysis] Finite-prompt analysis (abstract and § on finite-prompt behavior): the claim that alignment 'emerges up to sampling effects' is load-bearing for the central assertion of a 'direct and explicit connection' in practical regimes, yet no explicit deviation bounds, variance estimates, concentration inequalities, or high-probability guarantees are supplied on the distance between learned weights and eigenvectors when the number of prompts is small relative to dimension. Without these quantitative controls the finite-sample statement remains unverified and weakens the overall claim.

    Authors: We agree that the finite-prompt claims would be strengthened by explicit quantitative controls. The current manuscript shows alignment in the finite-prompt regime by analyzing the population objective (which favors the leading eigenvector) together with numerical experiments that illustrate close agreement for moderate prompt counts, with observed deviations attributed to sampling. However, we did not supply rigorous deviation bounds, variance estimates, or high-probability guarantees. In the revised version we will add a new subsection that derives such controls via matrix concentration inequalities applied to the empirical covariance matrix formed from the prompts. Specifically, we will prove that the distance between the learned weights and the true principal eigenvector is O(sqrt(d/N)) with high probability (via a matrix Bernstein bound), where d is dimension and N the number of prompts, together with explicit variance estimates. We will also update the abstract and the finite-prompt section to state the result with these rates. These additions will make the finite-sample connection fully rigorous while preserving the existing analysis. revision: yes

Circularity Check

0 steps flagged

Derivation from explicit unsupervised loss on Gaussian data to eigenvector alignment is self-contained

full rationale

The paper starts from a defined unsupervised training objective on Gaussian (or spiked Wishart) inputs and derives that attention parameters converge to principal eigenvectors in the infinite-prompt limit, with finite-prompt behavior following up to sampling. This is a direct mathematical analysis of the loss landscape and optimization dynamics rather than any reduction of a claimed prediction back to a fitted parameter or self-citation chain. No load-bearing step equates the target alignment to the input by construction; the result is obtained by analyzing the given objective. The finite-prompt caveat is a limitation on quantification but does not create circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the modeling choice that inputs are i.i.d. Gaussian (or spiked Wishart) and on the specific unsupervised training objective for the attention layer; no free parameters are fitted to data in the reported results, and no new entities are postulated.

axioms (2)
  • domain assumption Input data is drawn from a multivariate Gaussian distribution (or spiked Wishart covariance).
    Invoked throughout the analysis to obtain closed-form covariance and to prove eigenvector alignment.
  • domain assumption The attention layer is trained with an unsupervised objective whose stationary points coincide with principal directions.
    This modeling choice is required for the claimed alignment to emerge from gradient descent.

pith-pipeline@v0.9.0 · 5656 in / 1369 out tokens · 40780 ms · 2026-05-20T09:03:08.442160+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages

  1. [1]

    Shivam Garg, Dimitris Tsipras, Percy Liang, and Gregory Valiant

    URL https://arxiv.org/abs/2408.01367. Shivam Garg, Dimitris Tsipras, Percy Liang, and Gregory Valiant. What can transformers learn in-context? a case study of simple function classes, 2023. URLhttps://arxiv.org/abs/2208.01066. Gene H. Golub and Charles F. Van Loan.Matrix Computations. Johns Hopkins University Press, Baltimore, MD, USA, 3rd edition, 1996. ...

  2. [2]

    E [ (X⊤ 1 µ)2] =µT Σµ

  3. [3]

    E [⟨X1,X 2⟩⟨X1,µ⟩⟨X2,µ⟩] =µT Σ 2µ

  4. [4]

    E [ ⟨X1,X 2⟩⟨X1,µ⟩3⟨X2,µ⟩ ] = 3(µT Σµ)(µT Σ 2µ)

  5. [5]

    E [ ⟨X1,µ⟩2∥X1∥2 ] = tr(Σ)(µT Σµ) + 2µT Σ 2µ

  6. [6]

    E [ ⟨X1,µ⟩4∥X1∥2 ] = 3tr(Σ)(µT Σµ)2 + 12(µT Σµ)(µT Σ 2µ). Proof. All identities follow from linearity of expectation, independence ofX1 and X2, and Isserlis’ theorem Isserlis [1918] for centered Gaussian vectors

  7. [7]

    Since∥X1∥2 = ∑d i=1X2 1,i and E[X2 1,i] = Σii, E[∥X1∥2] = d∑ i=1 Σii = tr(Σ)

  8. [8]

    Expanding (X⊤ 1 µ)2 and using E[X1,iX1,j] = Σij gives E[(X⊤ 1 µ)2] = ∑ i,j µiµjΣij =µ⊤Σµ

  9. [9]

    Therefore, E[⟨X1,X 2⟩⟨X1,µ⟩⟨X2,µ⟩] = E[(X⊤ 1 µ)(X⊤ 1 Σµ)] =µ⊤Σ 2µ

    Conditioning onX1 and using independence, E[⟨X1,X 2⟩⟨X2,µ⟩|X1] =X⊤ 1 E[X2X⊤ 2 ]µ=X⊤ 1 Σµ. Therefore, E[⟨X1,X 2⟩⟨X1,µ⟩⟨X2,µ⟩] = E[(X⊤ 1 µ)(X⊤ 1 Σµ)] =µ⊤Σ 2µ

  10. [10]

    For a centered Gaussian vectorX and vectorsµ0,µ1, Isserlis’ theorem yields E[(µ⊤ 0X)3(µ⊤ 1X)] = 3(µ⊤ 0 Σµ0)(µ⊤ 0 Σµ1)

    Conditioning again onX1, E[⟨X1,X 2⟩⟨X1,µ⟩3⟨X2,µ⟩] = E[(X⊤ 1 µ)3(µ⊤ΣX1)]. For a centered Gaussian vectorX and vectorsµ0,µ1, Isserlis’ theorem yields E[(µ⊤ 0X)3(µ⊤ 1X)] = 3(µ⊤ 0 Σµ0)(µ⊤ 0 Σµ1). Applying this witha =µ0 and µ1 = Σµgives the claim. 24

  11. [11]

    Expanding ⟨X1,µ⟩2∥X1∥2 = ∑ i,j,k µiµjX1,iX1,jX2 1,k and applying Isserlis’ theorem to the fourth-order moment yields two types of pairings, leading to E[⟨X1,µ⟩2∥X1∥2] = tr(Σ)(µ⊤Σµ) + 2µ⊤Σ 2µ

  12. [12]

    Similarly, ⟨X1,µ⟩4∥X1∥2 = ∑ i,j,ℓ,m,k µiµjµℓµmX1,iX1,jX1,ℓX1,mX2 1,k. Applying Isserlis’ theorem to the sixth-order moment, pairings where the two copies ofk are paired together contribute 3tr(Σ)(µ⊤Σµ)2, while the remaining12 pairings contribute 12(µ⊤Σµ)(µ⊤Σ 2µ). Summing both terms gives the result. C.2 Optimization preliminaries In this subsection, we ga...

  13. [13]

    fn has a unique critical pointx⋆ n in B(x⋆,r ),

  14. [14]

    x⋆ n is nondegenerate and moreoverx⋆ n→x⋆as n→∞,

  15. [15]

    Letρ>0 such that Λ⋃ k=1 B(x(k),rk)⊂B(0,ρ)

    If the set of critical points off is finite, i.e., crit(f) ={x(1),...,x (Λ)}, and eachx(k), k∈{1,..., Λ}, is nondegenerate, letrk > 0 be such thatB(x(k),rk) are pairwise disjoint. Letρ>0 such that Λ⋃ k=1 B(x(k),rk)⊂B(0,ρ). Then forn large enough, crit(fn)∩B(0,ρ) ={x(1) n ,...,x (Λ) n }. Moreover, if for somek∈{1,..., Λ}, x(k) is a strict saddle (resp. a s...

  16. [16]

    Since∇2fn(x⋆ n)→∇2f(x⋆) and the limit is invertible,x⋆ n is nondegenerate forn large. Let Γ := 1 ∥[∇2f(x⋆)]−1∥, by continuity of∇2f, shrinkingr if necessary, we have for everyx∈B(x⋆,r ), ∥∇2f(x)−∇2f(x⋆)∥≤Γ 2, 28 and since∥∇2f(x⋆)v∥≥Γ∥v∥, using Taylor we get ∇f(x) =∇2f(x⋆)(x−x⋆) + ∫ 1 0 (∇2f(x⋆+t(x−x⋆))−∇2f(x⋆))(x−x⋆)dt. We take norms and after bounding we...

  17. [17]

    Since the set is finite, define δ:= 1 2 min k̸=ℓ ∥x(k)−x(ℓ)∥> 0

    Assume now that crit(f) ={x(1),...,x (Λ)}, ∇2f(x(k)) is invertible for allk. Since the set is finite, define δ:= 1 2 min k̸=ℓ ∥x(k)−x(ℓ)∥> 0. Applying items (1)–(2) to eachx(k), there existrk∈(0,δ) and nk∈N such that for alln≥nk: (a) fn has a unique critical pointx(k) n in B(x(k),rk), (b) x(k) n →x(k), (c) x(k) n is nondegenerate. Let n0 := max k∈{1,...,Λ...

  18. [18]

    E[µ⊤Σ 2µ] =n(n + 1)µ⊤V 2µ+n tr(V )µ⊤Vµ

  19. [19]

    E[(µ⊤Σ 2µ)(µ⊤Σ 2µ)] =n(n + 2)[(n + 3)(µ⊤Vµ)(µ⊤V 2µ) + (µ⊤Vµ)2 tr(V )]. Proof. We use the standard Gaussian representation of the Wishart distribution: Σ = n∑ r=1 xrx⊤ r, x r iid ∼N(0,V )

  20. [20]

    Since tr(xrx⊤ r ) =∥xr∥2 and E[∥xr∥2] = trV, linearity of expectation gives E[tr Σ] = n∑ r=1 E[∥xr∥2] =n trV

  21. [21]

    Splitting the sum into the casesi =j and i̸=j: (i) Diagonal terms.Fori =j, E [ (µ⊤x)2(x⊤x) ] =µ⊤VµtrV + 2µ⊤V 2µ, by Isserlis’ formula

    Write µ⊤Σ 2µ= n∑ i,j=1 (µ⊤αi)(α⊤ i xj)(µ⊤xj). Splitting the sum into the casesi =j and i̸=j: (i) Diagonal terms.Fori =j, E [ (µ⊤x)2(x⊤x) ] =µ⊤VµtrV + 2µ⊤V 2µ, by Isserlis’ formula. (ii) Off-diagonal terms.Fori̸=j, independence yields E[(µ⊤αi)(α⊤ i xj)(µ⊤xj)] =µ⊤V 2µ. Counting terms, E[µ⊤Σ 2µ] =n ( µ⊤VµtrV + 2µ⊤V 2µ ) +n(n−1)µ⊤V 2µ, which simplifies to E[µ...

  22. [22]

    Then µ⊤Σµ= ∑ k s2 k, µ ⊤Σ 2µ= ∑ i,j si(α⊤ i xj)sj

    Let sr =µ⊤xr. Then µ⊤Σµ= ∑ k s2 k, µ ⊤Σ 2µ= ∑ i,j si(α⊤ i xj)sj. Hence E[(µ⊤Σµ)(µ⊤Σ 2µ)] = ∑ i,j,k E [ s2 ksi(α⊤ i xj)sj ] . The expectation depends on coincidences among the indices(i,j,k ). Using Isserlis’ theorem and independence, one obtains: - i,j,k all distinct: contribution(µ⊤Vµ)(µ⊤V 2µ). - i =j̸=k: contribution (µ⊤Vµ) ( µ⊤VµtrV + 2µ⊤V 2µ ) . - i =...

  23. [23]

    Orthogonal: α= 0, w̸= 0, withA(r,0) = 0

  24. [24]

    Aligned: w = 0, α̸= 0, so thatµ=αv, with A(r,α) +B(r,α) = 0

  25. [25]

    Off-axis: α̸= 0, w̸= 0, with A(r,α) = 0 and B(r,α) = 0. Proof. A stationary point satisfies A(r,α)µ+B(r,α)αv= 0. Writingµ=αv+w with w⊥v and projecting ontospan(v) and its orthogonal complement yields A(r,α)w = 0, α ( A(r,α) +B(r,α) ) = 0. The conclusion follows by considering whetherα= 0 or not and whetherw = 0 or not. Proposition 19(Characterization of f...

  26. [26]

    µ= 0 is a local maximum: the Hessian has only negative eigenvalues along all nonzero directions

  27. [27]

    Any admissible orthogonal point is a strict saddle: the Hessian has exactly one negative eigenvalue along v, one positive eigenvalue along the vector’s own direction andd−2 zero eigenvalues

  28. [28]

    Moreover, the only admissible aligned points are µ⋆=±α⋆v, forα⋆> 0 defined in(22)

    Any admissible aligned point is a strict local minimum. Moreover, the only admissible aligned points are µ⋆=±α⋆v, forα⋆> 0 defined in(22)

  29. [29]

    SinceRICL ∞is coercive, these two points are the global minimizers of the function

    There are no admissible off-axis solutions. SinceRICL ∞is coercive, these two points are the global minimizers of the function. Finally, for almost every initializationµ0∈Rd, the gradient flow ofRICL ∞converges to one of the two global minimizers±α⋆v. 40 Proof. We recall by Lemma 13 that∇RICL ∞(µ) = 2(A(r,α)µ+B(r,α)αv), where A(r,α) =a1r2 +a2α2−a3, B (r,α...

  30. [30]

    This point is a strict local maximum

    The pointµ⋆ L,0 satisfiesµ⋆ L,0→0 as L→∞. This point is a strict local maximum

  31. [31]

    These points are local minimizers

    The points±µ⋆ L,∥satisfy±µ⋆ L,∥→±α⋆v as L→∞, withα⋆according to(22). These points are local minimizers

  32. [32]

    For each orthogonal critical pointµ(j) ⊥, j = 1,..., 2(d−1) ofRICL ∞, there exists a neighborhoodUj such that every critical point ofRICL L in Uj is a strict saddle. Proof. The first two items 1 and 2 follow directly from Proposition 5. It remains to prove 3. Let crit(RICL ∞) ={0,±α⋆v,µ(1) ⊥,...,µ(2(d−1)) ⊥ }, where µ(j) ⊥denote the orthogonal critical po...