Single-Head Attention in High Dimensions: A Theory of Generalization, Weights Spectra, and Scaling Laws

Emanuele Troiani; Fabrizio Boncoraglio; Florent Krzakala; Lenka Zdeborov\'a; Vittorio Erba; Yizhou Xu

arxiv: 2509.24914 · v2 · pith:XZDXCK6Wnew · submitted 2025-09-29 · 📊 stat.ML · cond-mat.dis-nn· cs.IT· cs.LG· math.IT

Single-Head Attention in High Dimensions: A Theory of Generalization, Weights Spectra, and Scaling Laws

Fabrizio Boncoraglio , Vittorio Erba , Emanuele Troiani , Yizhou Xu , Florent Krzakala , Lenka Zdeborov\'a This is my paper

classification 📊 stat.ML cond-mat.dis-nncs.ITcs.LGmath.IT

keywords spectraltheorytrainedattentiongeneralizationhigh-dimensionalincludingisolated

0 comments

read the original abstract

Trained attention layers exhibit striking and reproducible spectral structure of the weights, including low-rank collapse, bulk deformation, and isolated spectral outliers, yet the origin of these phenomena and their implications for generalization remain poorly understood. We study empirical risk minimization in a single-head tied-attention layer trained on synthetic high-dimensional sequence tasks generated from the attention-indexed model. Using tools from random matrix theory, spin-glass theory, and approximate message passing, we obtain an exact high-dimensional characterization of training and test error, interpolation and recovery thresholds, and the spectrum of the key and query matrices. Our theory predicts the full singular-value distribution of the trained query-key map, including low-rank structure and isolated spectral outliers, in qualitative agreement with observations in more realistic transformers. Finally, for targets with power-law spectra, we show that learning proceeds through sequential spectral recovery, leading to the emergence of power-law scaling laws.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Deep Learning as Neural Low-Degree Filtering: A Spectral Theory of Hierarchical Feature Learning
cs.LG 2026-05 unverdicted novelty 8.0

Neural LoFi models deep learning as layer-wise spectral filtering that selects maximal low-degree correlations, yielding a tractable surrogate for hierarchical representation learning beyond the lazy regime.
How Does Attention Help? Insights from Random Matrices on Signal Recovery from Sequence Models
stat.ML 2026-05 conditional novelty 7.0

Attention pooling produces a free-multiplicative-convolution bulk spectrum and two phase transitions for signal recovery; optimal weights are the top eigenvector of the positional correlation matrix R.