Quantitative Clustering in Mean-Field Transformer Models

· 2025 · cs.LG · arXiv 2504.14697

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

open full Pith review browse 7 citing papers arXiv PDF

abstract

The evolution of tokens through deep transformer models can be modeled as an interacting particle system that has been shown to exhibit an asymptotic clustering behavior akin to the synchronization phenomenon in Kuramoto models. In this work, we investigate the long-time clustering of mean-field transformer models. More precisely, under suitable assumptions on the transformer model parameters, we establish that any suitably regular mean-field initialization synchronizes exponentially fast to a Dirac point mass, with explicit quantitative convergence rates.

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Kinetic theory for Transformers and the lost-in-the-middle phenomenon

math.AP · 2026-05-09 · conditional · novelty 8.0

A mean-field kinetic theory derivation produces a closed-form U-shaped token retrieval profile that explains the lost-in-the-middle phenomenon in Transformers.

Stochastic Scaling Limits and Synchronization by Noise in Deep Transformer Models

math.PR · 2026-04-29 · unverdicted · novelty 7.0

Transformers converge pathwise to a stochastic particle system and SPDE in the scaling limit, exhibiting synchronization by noise and exponential energy dissipation when common noise is coercive relative to self-attention drift.

Spectral Selection in Symmetric Self-Attention Dynamics

math.DS · 2026-04-28 · unverdicted · novelty 7.0

Symmetric self-attention dynamics select the dominant eigendirection of V, producing homogeneous alignment when one positive eigenvalue dominates or sign-split polarization when V is negative definite.

Propagation of Chaos in Contextual Flow Maps

cs.LG · 2026-05-16 · unverdicted · novelty 6.0

Derives forward and backward propagation-of-chaos bounds for finite vs. infinite-context transformers modeled as contextual flow maps, achieving Wasserstein rate n^{-1/d} generally and n^{-1/2} for transformer-like cases.

Multi-Headed Transformer Architectures as Time-dependent Wasserstein Gradient Flows

cs.LG · 2026-05-15 · unverdicted · novelty 6.0

Models multi-head transformer data flow as time-dependent Wasserstein gradient flows of an attention-capturing interaction energy, with proofs on omega-limit stationary points and stability under weight and input perturbations.

Quantifying Concentration Phenomena of Mean-Field Transformers in the Low-Temperature Regime

math.AP · 2026-05-11 · unverdicted · novelty 6.0

In the low-temperature regime, the token distribution in mean-field transformers concentrates onto the push-forward under a key-query-value projection with Wasserstein distance scaling as √(log(β+1)/β) exp(Ct) + exp(-ct).

Gradient Flow Structure and Quantitative Dynamics of Multi-Head Self-Attention

cs.LG · 2026-05-05 · unverdicted · novelty 6.0 · 2 refs

Multi-head self-attention dynamics admit a non-decreasing energy functional under suitable score-matrix conditions, with closed-form clustering thresholds and monotonic entropy production in simplified regimes.

citing papers explorer

Showing 7 of 7 citing papers.

Kinetic theory for Transformers and the lost-in-the-middle phenomenon math.AP · 2026-05-09 · conditional · none · ref 9 · internal anchor
A mean-field kinetic theory derivation produces a closed-form U-shaped token retrieval profile that explains the lost-in-the-middle phenomenon in Transformers.
Stochastic Scaling Limits and Synchronization by Noise in Deep Transformer Models math.PR · 2026-04-29 · unverdicted · none · ref 12 · internal anchor
Transformers converge pathwise to a stochastic particle system and SPDE in the scaling limit, exhibiting synchronization by noise and exponential energy dissipation when common noise is coercive relative to self-attention drift.
Spectral Selection in Symmetric Self-Attention Dynamics math.DS · 2026-04-28 · unverdicted · none · ref 8 · internal anchor
Symmetric self-attention dynamics select the dominant eigendirection of V, producing homogeneous alignment when one positive eigenvalue dominates or sign-split polarization when V is negative definite.
Propagation of Chaos in Contextual Flow Maps cs.LG · 2026-05-16 · unverdicted · none · ref 10 · internal anchor
Derives forward and backward propagation-of-chaos bounds for finite vs. infinite-context transformers modeled as contextual flow maps, achieving Wasserstein rate n^{-1/d} generally and n^{-1/2} for transformer-like cases.
Multi-Headed Transformer Architectures as Time-dependent Wasserstein Gradient Flows cs.LG · 2026-05-15 · unverdicted · none · ref 10 · internal anchor
Models multi-head transformer data flow as time-dependent Wasserstein gradient flows of an attention-capturing interaction energy, with proofs on omega-limit stationary points and stability under weight and input perturbations.
Quantifying Concentration Phenomena of Mean-Field Transformers in the Low-Temperature Regime math.AP · 2026-05-11 · unverdicted · none · ref 19 · internal anchor
In the low-temperature regime, the token distribution in mean-field transformers concentrates onto the push-forward under a key-query-value projection with Wasserstein distance scaling as √(log(β+1)/β) exp(Ct) + exp(-ct).
Gradient Flow Structure and Quantitative Dynamics of Multi-Head Self-Attention cs.LG · 2026-05-05 · unverdicted · none · ref 4 · 2 links · internal anchor
Multi-head self-attention dynamics admit a non-decreasing energy functional under suitable score-matrix conditions, with closed-form clustering thresholds and monotonic entropy production in simplified regimes.

Quantitative Clustering in Mean-Field Transformer Models

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer