Specialization of softmax attention heads: insights from the high-dimensional single-location model

L. Zdeborov\'a; M. Sagitova; O. Duranthon

arxiv: 2603.03993 · v2 · pith:MAAZ4PBAnew · submitted 2026-03-04 · 💻 cs.LG · cond-mat.dis-nn

Specialization of softmax attention heads: insights from the high-dimensional single-location model

M. Sagitova , O. Duranthon , L. Zdeborov\'a This is my paper

classification 💻 cs.LG cond-mat.dis-nn

keywords attentionheadsspecializationmodelmulti-headpartperformancephase

0 comments

read the original abstract

Multi-head attention enables transformer models to represent multiple attention patterns simultaneously. Empirically, head specialization emerges in distinct stages during training, while many heads remain redundant and learn similar representations. We propose a theoretical model capturing this phenomenon, based on the multi-index and single-location regression frameworks. In the first part, we analyze the training dynamics of multi-head softmax attention under SGD, revealing an initial unspecialized phase followed by a multi-stage specialization phase in which different heads sequentially align with latent signal directions. In the second part, we study the impact of attention activation functions on performance. We introduce the Bayes-softmax attention, which achieves optimal prediction performance in this setting.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Weight Decay Regimes in Grokking Transformers: Cheap Online Diagnostics
cs.LG 2026-05 conditional novelty 6.0

Weight decay controls distinct learning regimes in grokking transformers on modular arithmetic, tracked by new cheap attention-based diagnostics with empirical critical value and exponent fits.