pith. sign in

arxiv: 2407.01602 · v2 · pith:ZR2PAAXVnew · submitted 2024-06-26 · 💻 cs.CL · cs.LG· math.DS· stat.ML

Clustering in pure-attention hardmax transformers and its role in sentiment analysis

Pith reviewed 2026-05-23 23:38 UTC · model grok-4.3

classification 💻 cs.CL cs.LGmath.DSstat.ML
keywords clusteringhardmax self-attentiontransformersdynamical systemsleader pointssentiment analysishyperplane separation
0
0 comments X

The pith

Hardmax self-attention transformers converge their inputs to clusters around leader points in the infinite-layer limit.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Transformers built from hardmax self-attention and normalization sublayers act as a discrete dynamical system on token embeddings in Euclidean space. The attention step separates points by hyperplanes, driving each token toward a small set of special leader points over repeated layers. The result is an asymptotic equilibrium in which inputs form clusters whose centers are exactly those leaders. The authors apply the same mechanism to sentiment analysis by letting leader words collect clusters of contextually related but semantically weaker tokens. This supplies a fully interpretable model whose clustering behavior directly explains how context is aggregated.

Core claim

By viewing such transformers as discrete-time dynamical systems and invoking the geometric hyperplane-separation property of hardmax attention, the transformer inputs asymptotically converge to a clustered equilibrium determined by special points called leaders.

What carries the argument

the hyperplane-separation property of hardmax self-attention, which selects attention weights and thereby steers the discrete dynamical system toward leader-determined clusters

If this is right

  • Inputs converge to a clustered equilibrium whose centers are the leader points.
  • Context in language tasks is captured by routing semantically weaker tokens into clusters around leader tokens.
  • The same dynamics yields a fully interpretable transformer that solves sentiment-analysis problems without learned parameters beyond the leader selection.
  • Remaining mathematical challenges must still be resolved before the clustering picture applies to trained, multi-head, softmax-based transformers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Leader points may correspond to the tokens that carry the strongest semantic signal in a given sequence.
  • The same geometric mechanism could be used to design new, explicitly clustered attention layers that remain interpretable at arbitrary depth.
  • If real transformers approximate hardmax behavior in deep layers, their attention maps should exhibit similar leader-driven clustering on natural-language data.

Load-bearing premise

The analysis assumes a pure-attention hardmax self-attention mechanism with normalization sublayers whose geometric hyperplane-separation property governs the infinite-layer limit.

What would settle it

A concrete numerical iteration of the hardmax-plus-normalization map on a finite point set that fails to produce clusters whose centers coincide with the leaders predicted by the hyperplane geometry.

Figures

Figures reproduced from arXiv: 2407.01602 by Albert Alcalde, Enrique Zuazua, Giovanni Fantuzzi.

Figure 1
Figure 1. Figure 1: Geometric interpretation of (1.1b) for i = 1 with (a) A = I and (b) A = ( 2 1 1 1 ). In (a), tokens z2 and z3 have the largest orthogonal projection on the direction of Az1 = z1, so C1(Z) = {2, 3}. In (b), token z4 has the largest projection on the direction of Az1, so C1(Z) = {4}. In both cases, tokens attracting z1 can only lie on the closed half-space H1 = {z : ⟨Az1, z − z1⟩ ≥ 0} (blue shading). discuss… view at source ↗
Figure 2
Figure 2. Figure 2: Simulations of (1.1) with α = 0.5, A = I, and four different initial token values. In each panel, stars denote tokens zi satisfying Ci(Z k ) = {i} at layer k ∈ N, while circles denote all other tokens. Colors indicate which tokens are being followed. Tokens painted in two halves follow two tokens. Tokens whose interior and edge colors are different, instead, follow tokens of their interior color and are fo… view at source ↗
Figure 3
Figure 3. Figure 3: Schematic illustrations of a deep neural network with normalization and feed￾forward sublayers (top), a full transformer with self-attention, normalization, and feed￾forward sublayers (middle), and a pure-attention transformer with only self-attention and normalization sublayers (bottom). Each model takes a matrix Z 0 ∈ R n×d as its input and outputs a matrix Z K ∈ R n×d after being processed by K transfor… view at source ↗
Figure 4
Figure 4. Figure 4: Sketch of the δ-neighborhood of the attracting set S. Combining the last two inequalities, we obtain (4.17) [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Illustration of the geometric intuition behind Lemma 4.5 for S ′ = {s1, s2}. If δ > 0 is small enough, then for all x ∈ Bδ(s0) the hyperplane with normal direction x passing through x (in green) separates s0 from the neighborhoods Bδ(s1) and Bδ(s2). Proof. Since S is finite by Lemma 4.1, there exists δ0 > 0 such that (i) is satisfied for all δ ≤ δ0. We now find δ ≤ δ0 such that (ii) holds following a const… view at source ↗
Figure 6
Figure 6. Figure 6: Illustration of the constructive argument in the proof of Proposition 4.6. Token zi remains in a neighborhood of s1 ∈ S1. Token za falls in case (a) of the analysis, so it remains in a neighborhood of s2 ∈ S2, while token zb falls in case (b), jumping from a neighborhood of s ′ 2 ∈ S2 to the neighborhood of s1, where it remains for all future times. for all k ≥ k2. This allows only two possibilities: (a) z… view at source ↗
Figure 7
Figure 7. Figure 7: Token values in Example 5.2 for (a) initial time k = 0 and (b) time k = 1. After one iteration, the token z2 has moved enough to be below the hyperplane separating the tokens influencing token z1 (in green), which causes token z1 to satisfy the leader condition C1(Z 1 ) = {1} at time k = 1. Additionally, z k i satisfies by definition (5.2) ⟨z k i , zk r ⟩ < ⟨z k i , zk i ⟩ for all r ̸= i. We have just prov… view at source ↗
Figure 8
Figure 8. Figure 8: Loss on the training set for encoder dimensions d ∈ {2, 4}, calculated at every epoch with the regularized softmax model used for training, and every 10 epochs with our hardmax model. We then use the gradient-based algorithm Adam [20] to minimize the average binary cross￾entropy loss (6.1) 1 N X N i=1 − (yi log(ˆyi) + (1 − yi) log(1 − yˆi)) calculated using a training set of N = 35 000 reviews. The remaini… view at source ↗
Figure 9
Figure 9. Figure 9: Evolution of the words of a positive review (top) and a negative review (bottom), as they are processed by the transformer layers. Color coded as [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Histogram of the 15 most frequent leaders zi of correctly classified test reviews and satisfying |H(zi)| ≥ 2, meaning that they are situated far from the separating hyperplane H [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
read the original abstract

Transformers are extremely successful machine learning models whose mathematical properties remain poorly understood. Here, we rigorously characterize the behavior of transformers with hardmax self-attention and normalization sublayers as the number of layers tends to infinity. By viewing such transformers as discrete-time dynamical systems describing the evolution of points in a Euclidean space, and thanks to a geometric interpretation of the self-attention mechanism based on hyperplane separation, we show that the transformer inputs asymptotically converge to a clustered equilibrium determined by special points called \textit{leaders}. We then leverage this theoretical understanding to solve sentiment analysis problems from language processing using a fully interpretable transformer model, which effectively captures `context' by clustering meaningless words around leader words carrying the most meaning. Finally, we outline remaining challenges to bridge the gap between the mathematical analysis of transformers and their real-life implementation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript rigorously characterizes the infinite-layer limit of pure-attention hardmax transformers equipped with normalization sublayers by modeling them as discrete-time dynamical systems on Euclidean space. A geometric argument based on hyperplane separation is used to prove that token representations asymptotically converge to a clustered equilibrium whose attractors are special points termed leaders. The derived clustering property is then applied to construct a fully interpretable transformer for sentiment-analysis tasks, in which semantically meaningless tokens cluster around leader tokens that carry the primary meaning; remaining challenges for closing the gap to practical implementations are outlined.

Significance. If the convergence theorem holds, the work supplies a parameter-free dynamical-systems explanation for clustering phenomena in a precisely defined subclass of attention models and demonstrates how the resulting leaders can be exploited for interpretable NLP. The geometric hyperplane-separation technique and the explicit scoping to hardmax-plus-normalization dynamics constitute clear strengths; the sentiment-analysis application supplies a concrete, falsifiable use case.

minor comments (3)
  1. The definition and selection rule for the 'leaders' should be stated explicitly in the introduction or in a dedicated preliminary section rather than introduced only in the convergence theorem statement.
  2. Notation for the normalization sublayers and the precise form of the hardmax operator should be unified across the dynamical-system formulation and the sentiment-analysis experiments.
  3. The discussion of the gap between the infinite-layer analysis and finite practical transformers would benefit from a short paragraph quantifying how many layers are typically required for the clustering to become observable in the reported experiments.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and constructive review. The recommendation for minor revision is appreciated, and we will make the necessary adjustments in the revised version.

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The central result is a mathematical characterization of the infinite-layer limit for the specific dynamical system of hardmax self-attention plus normalization, obtained via geometric hyperplane separation. No load-bearing step reduces by construction to a fitted parameter, self-definition, or self-citation chain; the clustering equilibrium follows from the stated model equations and geometric property without circular reduction. The sentiment-analysis application is scoped as a downstream illustration, not part of the convergence derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Analysis rests on viewing the transformer as a dynamical system and on the geometric property of hardmax attention; 'leaders' are introduced as the attractors. Full details unavailable from abstract.

axioms (2)
  • domain assumption Transformers with hardmax self-attention and normalization can be modeled as discrete-time dynamical systems on Euclidean space
    Stated in abstract as the basis for asymptotic analysis
  • domain assumption Self-attention admits a geometric interpretation based on hyperplane separation
    Invoked to prove convergence to clustered equilibria
invented entities (1)
  • leaders no independent evidence
    purpose: Special points that determine the clustered equilibrium
    New concept introduced to describe the attractors of the dynamical system

pith-pipeline@v0.9.0 · 5678 in / 1296 out tokens · 21976 ms · 2026-05-23T23:38:06.569452+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Control, Optimal Transport and Neural Differential Equations in Supervised Learning

    math.NA 2025-03 unverdicted novelty 6.0

    A novel framework approximates unbalanced optimal transport using Neural ODEs via a generalized discrete problem, a Sinkhorn-inspired scheme with proven convergence and error estimates, and derived transport dynamics.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    F. A. Acheampong, H. Nunoo-Mensah, and W. Chen. Transformer models for text-based emotion detection: a review of bert-based approaches.Artificial Intelligence Review, 54 (8):5789–5829, 2021

  2. [2]

    Alberti, N

    S. Alberti, N. Dern, L. Thesing, and G. Kutyniok. Sumformer: Universal approximation for efficient transformers. In Topological, Algebraic and Geometric Learning Workshops 2023, pages 72–86. PMLR, 2023

  3. [3]

    Borovikov

    V. Borovikov. On the intersection of a sequence of simplexes.Uspekhi Matematicheskikh Nauk, 7:179–180, 1952

  4. [4]

    Brown et al

    T. Brown et al. Language models are few-shot learners. InAdvances in Neural Information Processing Systems, volume 33, pages 1877–1901, 2020

  5. [5]

    Charton, A

    F. Charton, A. Hayat, and G. Lample. Learning advanced mathematical computations from examples. In9th International Conference on Learning Representations (ICLR 2021), 2021

  6. [6]

    Charton, A

    F. Charton, A. Hayat, S. T. McQuade, N. J. Merrill, and B. Piccoli. A deep language model to predict metabolic network equilibria. arXiv:2112.03588 [cs.LG], 2021

  7. [7]

    R. T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud. Neural ordinary differential equations. InAdvances in Neural Information Processing Systems, volume 31, 2018

  8. [8]

    Dosovitskiy et al

    A. Dosovitskiy et al. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2021

  9. [9]

    Effects of padding on LSTMs and CNNs

    M. Dwarampudi and N. Reddy. Effects of padding on lstms and cnns. arXiv:1903.07288 [cs.LG], 2019

  10. [10]

    W. E. A proposal on machine learning via dynamical systems.Communications in Math- ematics and Statistics, 5(1):1–11, 2017

  11. [11]

    I. M. Elfadel and J. L. Wyatt Jr. The ‘softmax’ nonlinearity: Derivation using statistical mechanics and useful properties as a multiterminal analog circuit element.Advances in Neural Information Processing Systems, 6, 1993

  12. [12]

    Geshkovski and E

    B. Geshkovski and E. Zuazua. Turnpike in optimal control of pdes, resnets, and beyond. Acta Numerica, 31:135–263, 2022

  13. [13]

    Geshkovski, C

    B. Geshkovski, C. Letrouit, Y. Polyanskiy, and P. Rigollet. The emergence of clusters in self-attention dynamics. arXiv:2305.05465 [cs.LG], 2023. CLUSTERING IN PURE-ATTENTION HARDMAX TRANSFORMERS 23

  14. [14]

    Letrouit, Y

    B. Geshkovski, C. Letrouit, Y. Polyanskiy, and P. Rigollet. A mathematical perspective on Transformers. arXiv:2312.10794 [cs.LG], 2023

  15. [15]

    Gloeckle, B

    F. Gloeckle, B. Rozière, A. Hayat, and G. Synnaeve. Temperature-scaled large language models for lean proofstep prediction. In37th Conference on Neural Information Processing Systems (NeurIPS 2023), 2023

  16. [16]

    S. Hayou. On the infinite-depth limit of finite-width neural networks. arXiv:2210.00688 [stat.ML], 2023

  17. [17]

    Hochreiter and J

    S. Hochreiter and J. Schmidhuber. Long Short-Term Memory.Neural Computation, 9(8): 1735–1780, 11 1997. ISSN 0899-7667

  18. [18]

    Ioffe and C

    S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. InProceedings of the 32nd International Conference on Machine Learning, volume 37, pages 448–456, 2015

  19. [19]

    J. M. Jumper et al. Highly accurate protein structure prediction with alphafold.Nature, 596:583–589, 2021

  20. [20]

    D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv:1412.6980 [cs.LG], 2014

  21. [21]

    Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut. Albert: A lite bert for self-supervised learning of language representations. InInternational Conference on Learning Representations, 2020

  22. [22]

    Q. Li, T. Lin, and Z. Shen. Deep learning via dynamical systems: An approximation perspective. Journal of the European Mathematical Society, 25(5):1671–1709, 2022

  23. [23]

    Lu et al

    Y. Lu et al. Understanding and improving transformer from a multi-particle dynamic system point of view. InICLR 2020 Workshop on Integration of Deep Neural Models and Differential Equations, 2019

  24. [24]

    A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts. Learning word vectors for sentiment analysis. InProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, 2011

  25. [25]

    Pascanu, T

    R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. InInternational conference on machine learning, pages 1310–1318, 2013

  26. [26]

    PyTorch: An Imperative Style, High-Performance Deep Learning Library

    A. Paszke et al. Pytorch: An imperative style, high-performance deep learning library. arXiv:1912.01703 [cs.LG], 2019

  27. [27]

    Peluchetti and S

    S. Peluchetti and S. Favaro. Infinitely deep neural networks as diffusion processes. In International Conference on Artificial Intelligence and Statistics, pages 1126–1136. PMLR, 2020

  28. [28]

    Peluchetti and S

    S. Peluchetti and S. Favaro. Doubly infinite residual neural networks: a diffusion process approach. Journal of Machine Learning Research, 22:175/1–48, 2021

  29. [29]

    Radford, K

    A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever. Improving lan- guage understanding by generative pre-training. Technical report, OpenAI, 2018. Available from: https://cdn.openai.com/research-covers/language-unsupervised/ language_understanding_paper.pdf

  30. [30]

    Radford, J

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. Mcleavey, and I. Sutskever. Robust speechrecognitionvialarge-scaleweaksupervision. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pages 28492–28518, 2023

  31. [31]

    Ruiz-Balet and E

    D. Ruiz-Balet and E. Zuazua. Neural ode control for classification, approximation, and transport. SIAM Review, 65(3):735–773, 2023

  32. [32]

    M. E. Sander, P. Ablin, M. Blondel, and G. Peyré. Sinkformers: Transformers with Doubly Stochastic Attention. In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, pages 3515–3530, 2022

  33. [33]

    Vaswani, N

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is All you Need. InAdvances in Neural Information Processing Systems, volume 30, 2017

  34. [34]

    C. Yun, S. Bhojanapalli, A. S. Rawat, S. J. Reddi, and S. Kumar. Are transformers universal approximators of sequence-to-sequence functions? InInternational Conference on Learning Representations, 2020. 24 A. ALCALDE, G. F ANTUZZI, AND E. ZUAZUA Email address: albert.alcalde@fau.de Email address: giovanni.fantuzzi@fau.de Email address: enrique.zuazua@fau.de