A Mean-Field Analysis of Multi-Head Self-Attention under Cross-Entropy Training

· 2026 · math.OC · arXiv 2606.10469

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

This paper develops a mean-field theory for a simplified single-layer causal multi-head self-attention model trained by cross-entropy minimization. Each attention head is treated as a particle in parameter space, and the empirical law of the heads is used as the large-head state variable. In the infinite-head limit, the averaged attention logits define a risk functional on probability measures, whose first variation generates a nonlinear Wasserstein gradient-flow equation. Unlike classical mean-field analyses of shallow networks that often focus on square-loss regression, the present model contains the softmax residual from the cross-entropy objective and the query-key-value structure of masked self-attention. We prove a static finite-head approximation bound for the optimal risk, characterize global minimizers through a variational support condition, and establish a quantitative finite-time propagation-of-chaos estimate comparing finite-head stochastic gradient descent with the limiting PDE. We then study the long-time behavior of the PDE: energy dissipation, convergence to the stationary set under compactness, convergence to a single stationary measure under topological or Kurdyka--{\L}ojasiewicz assumptions, and explicit convergence rates under gradient-domination conditions. Finally, we prove local exponential stability under a Wasserstein strong-monotonicity condition and give verifiable stability and instability criteria for Dirac stationary measures. The results provide a rigorous baseline mean-field framework for attention-head training and clarify the additional compactness, landscape, and curvature assumptions needed to pass from stationarity to convergence and stability.

representative citing papers

A First-Order Mean Field Control Analysis of Transformer Layers under Cross-Entropy Training

math.OC · 2026-06-22 · unverdicted · novelty 7.0

Transformer residual layers are approximated as an explicit Euler scheme for a controlled hidden-state flow whose mean-field limit is a first-order transport control problem with Pontryagin terminal condition given by the softmax residual.

citing papers explorer

Showing 1 of 1 citing paper.

A First-Order Mean Field Control Analysis of Transformer Layers under Cross-Entropy Training math.OC · 2026-06-22 · unverdicted · none · ref 10 · internal anchor
Transformer residual layers are approximated as an explicit Euler scheme for a controlled hidden-state flow whose mean-field limit is a first-order transport control problem with Pontryagin terminal condition given by the softmax residual.

A Mean-Field Analysis of Multi-Head Self-Attention under Cross-Entropy Training

fields

years

verdicts

representative citing papers

citing papers explorer