Clustering in pure-attention hardmax transformers and its role in sentiment analysis

Albert Alcalde; Enrique Zuazua; Giovanni Fantuzzi

arxiv: 2407.01602 · v2 · pith:ZR2PAAXVnew · submitted 2024-06-26 · 💻 cs.CL · cs.LG· math.DS· stat.ML

Clustering in pure-attention hardmax transformers and its role in sentiment analysis

Albert Alcalde , Giovanni Fantuzzi , Enrique Zuazua This is my paper

Pith reviewed 2026-05-23 23:38 UTC · model grok-4.3

classification 💻 cs.CL cs.LGmath.DSstat.ML

keywords clusteringhardmax self-attentiontransformersdynamical systemsleader pointssentiment analysishyperplane separation

0 comments

The pith

Hardmax self-attention transformers converge their inputs to clusters around leader points in the infinite-layer limit.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Transformers built from hardmax self-attention and normalization sublayers act as a discrete dynamical system on token embeddings in Euclidean space. The attention step separates points by hyperplanes, driving each token toward a small set of special leader points over repeated layers. The result is an asymptotic equilibrium in which inputs form clusters whose centers are exactly those leaders. The authors apply the same mechanism to sentiment analysis by letting leader words collect clusters of contextually related but semantically weaker tokens. This supplies a fully interpretable model whose clustering behavior directly explains how context is aggregated.

Core claim

By viewing such transformers as discrete-time dynamical systems and invoking the geometric hyperplane-separation property of hardmax attention, the transformer inputs asymptotically converge to a clustered equilibrium determined by special points called leaders.

What carries the argument

the hyperplane-separation property of hardmax self-attention, which selects attention weights and thereby steers the discrete dynamical system toward leader-determined clusters

If this is right

Inputs converge to a clustered equilibrium whose centers are the leader points.
Context in language tasks is captured by routing semantically weaker tokens into clusters around leader tokens.
The same dynamics yields a fully interpretable transformer that solves sentiment-analysis problems without learned parameters beyond the leader selection.
Remaining mathematical challenges must still be resolved before the clustering picture applies to trained, multi-head, softmax-based transformers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Leader points may correspond to the tokens that carry the strongest semantic signal in a given sequence.
The same geometric mechanism could be used to design new, explicitly clustered attention layers that remain interpretable at arbitrary depth.
If real transformers approximate hardmax behavior in deep layers, their attention maps should exhibit similar leader-driven clustering on natural-language data.

Load-bearing premise

The analysis assumes a pure-attention hardmax self-attention mechanism with normalization sublayers whose geometric hyperplane-separation property governs the infinite-layer limit.

What would settle it

A concrete numerical iteration of the hardmax-plus-normalization map on a finite point set that fails to produce clusters whose centers coincide with the leaders predicted by the hyperplane geometry.

Figures

Figures reproduced from arXiv: 2407.01602 by Albert Alcalde, Enrique Zuazua, Giovanni Fantuzzi.

**Figure 1.** Figure 1: Geometric interpretation of (1.1b) for i = 1 with (a) A = I and (b) A = ( 2 1 1 1 ). In (a), tokens z2 and z3 have the largest orthogonal projection on the direction of Az1 = z1, so C1(Z) = {2, 3}. In (b), token z4 has the largest projection on the direction of Az1, so C1(Z) = {4}. In both cases, tokens attracting z1 can only lie on the closed half-space H1 = {z : ⟨Az1, z − z1⟩ ≥ 0} (blue shading). discuss… view at source ↗

**Figure 2.** Figure 2: Simulations of (1.1) with α = 0.5, A = I, and four different initial token values. In each panel, stars denote tokens zi satisfying Ci(Z k ) = {i} at layer k ∈ N, while circles denote all other tokens. Colors indicate which tokens are being followed. Tokens painted in two halves follow two tokens. Tokens whose interior and edge colors are different, instead, follow tokens of their interior color and are fo… view at source ↗

**Figure 3.** Figure 3: Schematic illustrations of a deep neural network with normalization and feedforward sublayers (top), a full transformer with self-attention, normalization, and feedforward sublayers (middle), and a pure-attention transformer with only self-attention and normalization sublayers (bottom). Each model takes a matrix Z 0 ∈ R n×d as its input and outputs a matrix Z K ∈ R n×d after being processed by K transfor… view at source ↗

**Figure 4.** Figure 4: Sketch of the δ-neighborhood of the attracting set S. Combining the last two inequalities, we obtain (4.17) [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Illustration of the geometric intuition behind Lemma 4.5 for S ′ = {s1, s2}. If δ > 0 is small enough, then for all x ∈ Bδ(s0) the hyperplane with normal direction x passing through x (in green) separates s0 from the neighborhoods Bδ(s1) and Bδ(s2). Proof. Since S is finite by Lemma 4.1, there exists δ0 > 0 such that (i) is satisfied for all δ ≤ δ0. We now find δ ≤ δ0 such that (ii) holds following a const… view at source ↗

**Figure 6.** Figure 6: Illustration of the constructive argument in the proof of Proposition 4.6. Token zi remains in a neighborhood of s1 ∈ S1. Token za falls in case (a) of the analysis, so it remains in a neighborhood of s2 ∈ S2, while token zb falls in case (b), jumping from a neighborhood of s ′ 2 ∈ S2 to the neighborhood of s1, where it remains for all future times. for all k ≥ k2. This allows only two possibilities: (a) z… view at source ↗

**Figure 7.** Figure 7: Token values in Example 5.2 for (a) initial time k = 0 and (b) time k = 1. After one iteration, the token z2 has moved enough to be below the hyperplane separating the tokens influencing token z1 (in green), which causes token z1 to satisfy the leader condition C1(Z 1 ) = {1} at time k = 1. Additionally, z k i satisfies by definition (5.2) ⟨z k i , zk r ⟩ < ⟨z k i , zk i ⟩ for all r ̸= i. We have just prov… view at source ↗

**Figure 8.** Figure 8: Loss on the training set for encoder dimensions d ∈ {2, 4}, calculated at every epoch with the regularized softmax model used for training, and every 10 epochs with our hardmax model. We then use the gradient-based algorithm Adam [20] to minimize the average binary crossentropy loss (6.1) 1 N X N i=1 − (yi log(ˆyi) + (1 − yi) log(1 − yˆi)) calculated using a training set of N = 35 000 reviews. The remaini… view at source ↗

**Figure 9.** Figure 9: Evolution of the words of a positive review (top) and a negative review (bottom), as they are processed by the transformer layers. Color coded as [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Histogram of the 15 most frequent leaders zi of correctly classified test reviews and satisfying |H(zi)| ≥ 2, meaning that they are situated far from the separating hyperplane H [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

read the original abstract

Transformers are extremely successful machine learning models whose mathematical properties remain poorly understood. Here, we rigorously characterize the behavior of transformers with hardmax self-attention and normalization sublayers as the number of layers tends to infinity. By viewing such transformers as discrete-time dynamical systems describing the evolution of points in a Euclidean space, and thanks to a geometric interpretation of the self-attention mechanism based on hyperplane separation, we show that the transformer inputs asymptotically converge to a clustered equilibrium determined by special points called \textit{leaders}. We then leverage this theoretical understanding to solve sentiment analysis problems from language processing using a fully interpretable transformer model, which effectively captures `context' by clustering meaningless words around leader words carrying the most meaning. Finally, we outline remaining challenges to bridge the gap between the mathematical analysis of transformers and their real-life implementation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper proves clustering to 'leaders' in the infinite-layer hardmax transformer limit via dynamical systems and hyperplane geometry, then applies it to an interpretable sentiment model.

read the letter

This paper shows that inputs to pure hardmax self-attention transformers with normalization converge asymptotically to clusters around special leader points as depth goes to infinity. The argument treats the stack as a discrete dynamical system and uses the geometric property that hardmax attention separates points by hyperplanes to drive the clustering. That characterization is the main new piece, and the authors then use the leaders directly to build a sentiment classifier that groups uninformative tokens around the meaningful ones. The math is scoped tightly to this model class and they state the gap to real implementations, which keeps the claim honest. The sentiment application follows as a straightforward demonstration rather than a performance claim. The central limitation is the hardmax restriction itself; standard transformers use softmax and additional components, so the clustering behavior may not transfer. The result is narrow but cleanly derived within its bounds, with no obvious circularity or unstated assumptions that break the limit argument. This is for readers working on theoretical accounts of attention and dynamical views of deep networks. It deserves peer review because the core convergence result is worked out with explicit geometry and the application is a direct, interpretable consequence rather than an add-on.

Referee Report

0 major / 3 minor

Summary. The manuscript rigorously characterizes the infinite-layer limit of pure-attention hardmax transformers equipped with normalization sublayers by modeling them as discrete-time dynamical systems on Euclidean space. A geometric argument based on hyperplane separation is used to prove that token representations asymptotically converge to a clustered equilibrium whose attractors are special points termed leaders. The derived clustering property is then applied to construct a fully interpretable transformer for sentiment-analysis tasks, in which semantically meaningless tokens cluster around leader tokens that carry the primary meaning; remaining challenges for closing the gap to practical implementations are outlined.

Significance. If the convergence theorem holds, the work supplies a parameter-free dynamical-systems explanation for clustering phenomena in a precisely defined subclass of attention models and demonstrates how the resulting leaders can be exploited for interpretable NLP. The geometric hyperplane-separation technique and the explicit scoping to hardmax-plus-normalization dynamics constitute clear strengths; the sentiment-analysis application supplies a concrete, falsifiable use case.

minor comments (3)

The definition and selection rule for the 'leaders' should be stated explicitly in the introduction or in a dedicated preliminary section rather than introduced only in the convergence theorem statement.
Notation for the normalization sublayers and the precise form of the hardmax operator should be unified across the dynamical-system formulation and the sentiment-analysis experiments.
The discussion of the gap between the infinite-layer analysis and finite practical transformers would benefit from a short paragraph quantifying how many layers are typically required for the clustering to become observable in the reported experiments.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and constructive review. The recommendation for minor revision is appreciated, and we will make the necessary adjustments in the revised version.

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The central result is a mathematical characterization of the infinite-layer limit for the specific dynamical system of hardmax self-attention plus normalization, obtained via geometric hyperplane separation. No load-bearing step reduces by construction to a fitted parameter, self-definition, or self-citation chain; the clustering equilibrium follows from the stated model equations and geometric property without circular reduction. The sentiment-analysis application is scoped as a downstream illustration, not part of the convergence derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Analysis rests on viewing the transformer as a dynamical system and on the geometric property of hardmax attention; 'leaders' are introduced as the attractors. Full details unavailable from abstract.

axioms (2)

domain assumption Transformers with hardmax self-attention and normalization can be modeled as discrete-time dynamical systems on Euclidean space
Stated in abstract as the basis for asymptotic analysis
domain assumption Self-attention admits a geometric interpretation based on hyperplane separation
Invoked to prove convergence to clustered equilibria

invented entities (1)

leaders no independent evidence
purpose: Special points that determine the clustered equilibrium
New concept introduced to describe the attractors of the dynamical system

pith-pipeline@v0.9.0 · 5678 in / 1296 out tokens · 21976 ms · 2026-05-23T23:38:06.569452+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Control, Optimal Transport and Neural Differential Equations in Supervised Learning
math.NA 2025-03 unverdicted novelty 6.0

A novel framework approximates unbalanced optimal transport using Neural ODEs via a generalized discrete problem, a Sinkhorn-inspired scheme with proven convergence and error estimates, and derived transport dynamics.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

F. A. Acheampong, H. Nunoo-Mensah, and W. Chen. Transformer models for text-based emotion detection: a review of bert-based approaches.Artificial Intelligence Review, 54 (8):5789–5829, 2021

work page 2021
[2]

Alberti, N

S. Alberti, N. Dern, L. Thesing, and G. Kutyniok. Sumformer: Universal approximation for efficient transformers. In Topological, Algebraic and Geometric Learning Workshops 2023, pages 72–86. PMLR, 2023

work page 2023
[3]

Borovikov

V. Borovikov. On the intersection of a sequence of simplexes.Uspekhi Matematicheskikh Nauk, 7:179–180, 1952

work page 1952
[4]

Brown et al

T. Brown et al. Language models are few-shot learners. InAdvances in Neural Information Processing Systems, volume 33, pages 1877–1901, 2020

work page 1901
[5]

Charton, A

F. Charton, A. Hayat, and G. Lample. Learning advanced mathematical computations from examples. In9th International Conference on Learning Representations (ICLR 2021), 2021

work page 2021
[6]

Charton, A

F. Charton, A. Hayat, S. T. McQuade, N. J. Merrill, and B. Piccoli. A deep language model to predict metabolic network equilibria. arXiv:2112.03588 [cs.LG], 2021

work page arXiv 2021
[7]

R. T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud. Neural ordinary differential equations. InAdvances in Neural Information Processing Systems, volume 31, 2018

work page 2018
[8]

Dosovitskiy et al

A. Dosovitskiy et al. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2021

work page 2021
[9]

Effects of padding on LSTMs and CNNs

M. Dwarampudi and N. Reddy. Effects of padding on lstms and cnns. arXiv:1903.07288 [cs.LG], 2019

work page internal anchor Pith review Pith/arXiv arXiv 1903
[10]

W. E. A proposal on machine learning via dynamical systems.Communications in Math- ematics and Statistics, 5(1):1–11, 2017

work page 2017
[11]

I. M. Elfadel and J. L. Wyatt Jr. The ‘softmax’ nonlinearity: Derivation using statistical mechanics and useful properties as a multiterminal analog circuit element.Advances in Neural Information Processing Systems, 6, 1993

work page 1993
[12]

Geshkovski and E

B. Geshkovski and E. Zuazua. Turnpike in optimal control of pdes, resnets, and beyond. Acta Numerica, 31:135–263, 2022

work page 2022
[13]

Geshkovski, C

B. Geshkovski, C. Letrouit, Y. Polyanskiy, and P. Rigollet. The emergence of clusters in self-attention dynamics. arXiv:2305.05465 [cs.LG], 2023. CLUSTERING IN PURE-ATTENTION HARDMAX TRANSFORMERS 23

work page arXiv 2023
[14]

Letrouit, Y

B. Geshkovski, C. Letrouit, Y. Polyanskiy, and P. Rigollet. A mathematical perspective on Transformers. arXiv:2312.10794 [cs.LG], 2023

work page arXiv 2023
[15]

Gloeckle, B

F. Gloeckle, B. Rozière, A. Hayat, and G. Synnaeve. Temperature-scaled large language models for lean proofstep prediction. In37th Conference on Neural Information Processing Systems (NeurIPS 2023), 2023

work page 2023
[16]

S. Hayou. On the infinite-depth limit of finite-width neural networks. arXiv:2210.00688 [stat.ML], 2023

work page arXiv 2023
[17]

Hochreiter and J

S. Hochreiter and J. Schmidhuber. Long Short-Term Memory.Neural Computation, 9(8): 1735–1780, 11 1997. ISSN 0899-7667

work page 1997
[18]

Ioffe and C

S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. InProceedings of the 32nd International Conference on Machine Learning, volume 37, pages 448–456, 2015

work page 2015
[19]

J. M. Jumper et al. Highly accurate protein structure prediction with alphafold.Nature, 596:583–589, 2021

work page 2021
[20]

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv:1412.6980 [cs.LG], 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[21]

Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut. Albert: A lite bert for self-supervised learning of language representations. InInternational Conference on Learning Representations, 2020

work page 2020
[22]

Q. Li, T. Lin, and Z. Shen. Deep learning via dynamical systems: An approximation perspective. Journal of the European Mathematical Society, 25(5):1671–1709, 2022

work page 2022
[23]

Lu et al

Y. Lu et al. Understanding and improving transformer from a multi-particle dynamic system point of view. InICLR 2020 Workshop on Integration of Deep Neural Models and Differential Equations, 2019

work page 2020
[24]

A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts. Learning word vectors for sentiment analysis. InProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, 2011

work page 2011
[25]

Pascanu, T

R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. InInternational conference on machine learning, pages 1310–1318, 2013

work page 2013
[26]

PyTorch: An Imperative Style, High-Performance Deep Learning Library

A. Paszke et al. Pytorch: An imperative style, high-performance deep learning library. arXiv:1912.01703 [cs.LG], 2019

work page internal anchor Pith review Pith/arXiv arXiv 1912
[27]

Peluchetti and S

S. Peluchetti and S. Favaro. Infinitely deep neural networks as diffusion processes. In International Conference on Artificial Intelligence and Statistics, pages 1126–1136. PMLR, 2020

work page 2020
[28]

Peluchetti and S

S. Peluchetti and S. Favaro. Doubly infinite residual neural networks: a diffusion process approach. Journal of Machine Learning Research, 22:175/1–48, 2021

work page 2021
[29]

Radford, K

A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever. Improving lan- guage understanding by generative pre-training. Technical report, OpenAI, 2018. Available from: https://cdn.openai.com/research-covers/language-unsupervised/ language_understanding_paper.pdf

work page 2018
[30]

Radford, J

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. Mcleavey, and I. Sutskever. Robust speechrecognitionvialarge-scaleweaksupervision. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pages 28492–28518, 2023

work page 2023
[31]

Ruiz-Balet and E

D. Ruiz-Balet and E. Zuazua. Neural ode control for classification, approximation, and transport. SIAM Review, 65(3):735–773, 2023

work page 2023
[32]

M. E. Sander, P. Ablin, M. Blondel, and G. Peyré. Sinkformers: Transformers with Doubly Stochastic Attention. In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, pages 3515–3530, 2022

work page 2022
[33]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is All you Need. InAdvances in Neural Information Processing Systems, volume 30, 2017

work page 2017
[34]

C. Yun, S. Bhojanapalli, A. S. Rawat, S. J. Reddi, and S. Kumar. Are transformers universal approximators of sequence-to-sequence functions? InInternational Conference on Learning Representations, 2020. 24 A. ALCALDE, G. F ANTUZZI, AND E. ZUAZUA Email address: albert.alcalde@fau.de Email address: giovanni.fantuzzi@fau.de Email address: enrique.zuazua@fau.de

work page 2020

[1] [1]

F. A. Acheampong, H. Nunoo-Mensah, and W. Chen. Transformer models for text-based emotion detection: a review of bert-based approaches.Artificial Intelligence Review, 54 (8):5789–5829, 2021

work page 2021

[2] [2]

Alberti, N

S. Alberti, N. Dern, L. Thesing, and G. Kutyniok. Sumformer: Universal approximation for efficient transformers. In Topological, Algebraic and Geometric Learning Workshops 2023, pages 72–86. PMLR, 2023

work page 2023

[3] [3]

Borovikov

V. Borovikov. On the intersection of a sequence of simplexes.Uspekhi Matematicheskikh Nauk, 7:179–180, 1952

work page 1952

[4] [4]

Brown et al

T. Brown et al. Language models are few-shot learners. InAdvances in Neural Information Processing Systems, volume 33, pages 1877–1901, 2020

work page 1901

[5] [5]

Charton, A

F. Charton, A. Hayat, and G. Lample. Learning advanced mathematical computations from examples. In9th International Conference on Learning Representations (ICLR 2021), 2021

work page 2021

[6] [6]

Charton, A

F. Charton, A. Hayat, S. T. McQuade, N. J. Merrill, and B. Piccoli. A deep language model to predict metabolic network equilibria. arXiv:2112.03588 [cs.LG], 2021

work page arXiv 2021

[7] [7]

R. T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud. Neural ordinary differential equations. InAdvances in Neural Information Processing Systems, volume 31, 2018

work page 2018

[8] [8]

Dosovitskiy et al

A. Dosovitskiy et al. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2021

work page 2021

[9] [9]

Effects of padding on LSTMs and CNNs

M. Dwarampudi and N. Reddy. Effects of padding on lstms and cnns. arXiv:1903.07288 [cs.LG], 2019

work page internal anchor Pith review Pith/arXiv arXiv 1903

[10] [10]

W. E. A proposal on machine learning via dynamical systems.Communications in Math- ematics and Statistics, 5(1):1–11, 2017

work page 2017

[11] [11]

I. M. Elfadel and J. L. Wyatt Jr. The ‘softmax’ nonlinearity: Derivation using statistical mechanics and useful properties as a multiterminal analog circuit element.Advances in Neural Information Processing Systems, 6, 1993

work page 1993

[12] [12]

Geshkovski and E

B. Geshkovski and E. Zuazua. Turnpike in optimal control of pdes, resnets, and beyond. Acta Numerica, 31:135–263, 2022

work page 2022

[13] [13]

Geshkovski, C

B. Geshkovski, C. Letrouit, Y. Polyanskiy, and P. Rigollet. The emergence of clusters in self-attention dynamics. arXiv:2305.05465 [cs.LG], 2023. CLUSTERING IN PURE-ATTENTION HARDMAX TRANSFORMERS 23

work page arXiv 2023

[14] [14]

Letrouit, Y

B. Geshkovski, C. Letrouit, Y. Polyanskiy, and P. Rigollet. A mathematical perspective on Transformers. arXiv:2312.10794 [cs.LG], 2023

work page arXiv 2023

[15] [15]

Gloeckle, B

F. Gloeckle, B. Rozière, A. Hayat, and G. Synnaeve. Temperature-scaled large language models for lean proofstep prediction. In37th Conference on Neural Information Processing Systems (NeurIPS 2023), 2023

work page 2023

[16] [16]

S. Hayou. On the infinite-depth limit of finite-width neural networks. arXiv:2210.00688 [stat.ML], 2023

work page arXiv 2023

[17] [17]

Hochreiter and J

S. Hochreiter and J. Schmidhuber. Long Short-Term Memory.Neural Computation, 9(8): 1735–1780, 11 1997. ISSN 0899-7667

work page 1997

[18] [18]

Ioffe and C

S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. InProceedings of the 32nd International Conference on Machine Learning, volume 37, pages 448–456, 2015

work page 2015

[19] [19]

J. M. Jumper et al. Highly accurate protein structure prediction with alphafold.Nature, 596:583–589, 2021

work page 2021

[20] [20]

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv:1412.6980 [cs.LG], 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[21] [21]

Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut. Albert: A lite bert for self-supervised learning of language representations. InInternational Conference on Learning Representations, 2020

work page 2020

[22] [22]

Q. Li, T. Lin, and Z. Shen. Deep learning via dynamical systems: An approximation perspective. Journal of the European Mathematical Society, 25(5):1671–1709, 2022

work page 2022

[23] [23]

Lu et al

Y. Lu et al. Understanding and improving transformer from a multi-particle dynamic system point of view. InICLR 2020 Workshop on Integration of Deep Neural Models and Differential Equations, 2019

work page 2020

[24] [24]

A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts. Learning word vectors for sentiment analysis. InProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, 2011

work page 2011

[25] [25]

Pascanu, T

R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. InInternational conference on machine learning, pages 1310–1318, 2013

work page 2013

[26] [26]

PyTorch: An Imperative Style, High-Performance Deep Learning Library

A. Paszke et al. Pytorch: An imperative style, high-performance deep learning library. arXiv:1912.01703 [cs.LG], 2019

work page internal anchor Pith review Pith/arXiv arXiv 1912

[27] [27]

Peluchetti and S

S. Peluchetti and S. Favaro. Infinitely deep neural networks as diffusion processes. In International Conference on Artificial Intelligence and Statistics, pages 1126–1136. PMLR, 2020

work page 2020

[28] [28]

Peluchetti and S

S. Peluchetti and S. Favaro. Doubly infinite residual neural networks: a diffusion process approach. Journal of Machine Learning Research, 22:175/1–48, 2021

work page 2021

[29] [29]

Radford, K

A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever. Improving lan- guage understanding by generative pre-training. Technical report, OpenAI, 2018. Available from: https://cdn.openai.com/research-covers/language-unsupervised/ language_understanding_paper.pdf

work page 2018

[30] [30]

Radford, J

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. Mcleavey, and I. Sutskever. Robust speechrecognitionvialarge-scaleweaksupervision. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pages 28492–28518, 2023

work page 2023

[31] [31]

Ruiz-Balet and E

D. Ruiz-Balet and E. Zuazua. Neural ode control for classification, approximation, and transport. SIAM Review, 65(3):735–773, 2023

work page 2023

[32] [32]

M. E. Sander, P. Ablin, M. Blondel, and G. Peyré. Sinkformers: Transformers with Doubly Stochastic Attention. In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, pages 3515–3530, 2022

work page 2022

[33] [33]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is All you Need. InAdvances in Neural Information Processing Systems, volume 30, 2017

work page 2017

[34] [34]

C. Yun, S. Bhojanapalli, A. S. Rawat, S. J. Reddi, and S. Kumar. Are transformers universal approximators of sequence-to-sequence functions? InInternational Conference on Learning Representations, 2020. 24 A. ALCALDE, G. F ANTUZZI, AND E. ZUAZUA Email address: albert.alcalde@fau.de Email address: giovanni.fantuzzi@fau.de Email address: enrique.zuazua@fau.de

work page 2020