arxiv: 2604.26898 · v1 · submitted 2026-04-29 · 🧮 math.PR · cs.LG· stat.ML

Recognition: unknown

Stochastic Scaling Limits and Synchronization by Noise in Deep Transformer Models

Andrea Agazzi , Giuseppe Bruno , Eloy Mosig Garc\'ia , Samuele Saviozzi , Marco Romito

Authors on Pith no claims yet

Pith reviewed 2026-05-07 10:49 UTC · model grok-4.3

classification 🧮 math.PR cs.LGstat.ML

keywords transformer modelsstochastic scaling limitssynchronization by noiseinteracting particle systemspropagation of chaosstochastic partial differential equationsactivation functions

0 comments

The pith

Finite transformer token evolution converges pathwise to a stochastic particle system with noise-driven synchronization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proves that the discrete layerwise updates of tokens in a transformer with MLP blocks converge pathwise to the paths of a system of particles evolving in continuous time under stochastic dynamics. It identifies the corresponding stochastic partial differential equation for the particle distribution and shows that the empirical distribution converges to this as the token count increases. The limiting system exhibits synchronization by noise, meaning the particles align due to a shared noise source that dominates the self-attention drift, leading to exponential decay of interaction energy on average. This requires the noise to be coercive enough and holds for activation functions satisfying a characterized condition.

Core claim

We prove pathwise convergence of the layerwise evolution of tokens in a finite-depth, finite-width transformer model with MultiLayer Perceptron (MLP) blocks to a continuous-time stochastic interacting particle system. We also identify the stochastic partial differential equation describing the evolution of the tokens' distribution in this limit and prove propagation of chaos when the number of such tokens is large. The bounds we establish are quantitative and the limits we consider commute. We further prove that the limiting stochastic model displays synchronization by noise and establish exponential dissipation of the interaction energy on average, provided that the common noise is coercive

What carries the argument

The common noise term in the limiting stochastic interacting particle system, which drives synchronization by overpowering the deterministic self-attention drift when sufficiently coercive.

If this is right

The layerwise token dynamics admit quantitative approximation by the continuous-time stochastic particle system.
The distribution of tokens obeys a specific stochastic partial differential equation in the scaling limit.
Propagation of chaos holds, so the tokens behave as independent copies of the limiting distribution for large token counts.
Synchronization by noise occurs with exponential average dissipation of interaction energy when the coercivity condition holds.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The scaling limit opens the possibility of analyzing very deep transformers using tools from stochastic analysis instead of discrete recursion.
The synchronization phenomenon might motivate adding controlled shared noise to finite-width transformers to encourage alignment of token representations during training.
The characterization of suitable activation functions supplies a concrete design criterion for nonlinearities that promote dissipative behavior in the continuous limit.

Load-bearing premise

The common noise must be sufficiently coercive relative to the deterministic self-attention drift, together with the specific conditions on activation functions; if this coercivity fails, the synchronization and exponential energy dissipation claims do not hold.

What would settle it

Numerical integration of the limiting stochastic particle system with noise intensity below the coercivity threshold set by the self-attention drift, showing that the expected interaction energy fails to decay exponentially.

Figures

Figures reproduced from arXiv: 2604.26898 by Andrea Agazzi, Eloy Mosig Garc\'ia, Giuseppe Bruno, Marco Romito, Samuele Saviozzi.

**Figure 1.** Figure 1: Noise-induced synchronization without attention. Evolution of the first two compo view at source ↗

**Figure 2.** Figure 2: Numerical simulations of the discrete transformer dynamics with ReLU activation, view at source ↗

**Figure 3.** Figure 3: Numerical simulations of the discrete transformer dynamics with SiLU activation, view at source ↗

**Figure 4.** Figure 4: Numerical simulations of the discrete transformer dynamics with Sigmoid activation, view at source ↗

**Figure 5.** Figure 5: Numerical simulations of the discrete transformer dynamics with Tanh activation, view at source ↗

**Figure 6.** Figure 6: Numerical simulations of the discrete transformer dynamics without MLPs, for view at source ↗

read the original abstract

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a rigorous pathwise convergence of finite transformer token layers to a stochastic particle system plus SPDE limit and conditional synchronization by noise.

read the letter

The main thing here is that they prove pathwise convergence of the layerwise token evolution in a finite-depth, finite-width transformer with MLP blocks to a continuous-time stochastic interacting particle system, identify the corresponding SPDE, get propagation of chaos for large token counts, and show synchronization by noise with exponential energy dissipation when the common noise is coercive enough relative to the self-attention drift. They also characterize the activation functions that make this work and establish quantitative bounds with commuting limits. This is a direct application of stochastic analysis tools to the transformer setting. The combination of pathwise convergence, SPDE identification, and the noise-driven synchronization result for this architecture looks new based on the cited literature. The quantitative bounds and commuting limits are a plus, as is the explicit characterization of admissible activations. The argument stays conditional on the coercivity assumption, which they state clearly rather than hiding. That condition is the main practical limitation: if the noise does not dominate the deterministic drift as required, the dissipation fails, and standard transformers do not come with this explicit common noise term. The finite-to-continuous limit regime also needs the detailed proof to confirm it is well-posed without extra assumptions on well-posedness. This work is for people in mathematical probability who care about mean-field limits and interacting particles in machine learning models. A reader interested in rigorous foundations for deep networks would find the SPDE description and synchronization analysis useful. It deserves a serious referee because the claims are precise, the techniques are standard in the field, and the results are not restatements of prior work. I would send it out for review.

Referee Report

0 major / 3 minor

Summary. The paper proves pathwise convergence of the layerwise evolution of tokens in a finite-depth, finite-width transformer model with MLP blocks to a continuous-time stochastic interacting particle system. It identifies the SPDE for the limiting token distribution, establishes quantitative bounds with commuting limits, proves propagation of chaos for large token counts, and shows that the limiting model exhibits synchronization by noise with exponential dissipation of interaction energy on average, provided the common noise is sufficiently coercive relative to the deterministic self-attention drift. It also characterizes the activation functions satisfying this coercivity condition.

Significance. If the results hold, this work supplies a rigorous mathematical link between discrete transformer dynamics and continuous stochastic particle systems, with potential to explain synchronization phenomena in large models. The quantitative bounds, commuting limits, and explicit characterization of admissible activation functions are strengths that could support further analysis of scaling and emergent behavior in neural networks. The conditional nature of the synchronization result (tied to coercivity) is appropriately stated.

minor comments (3)

The introduction could more explicitly state the precise scaling regime (e.g., how the layer step size and width enter the quantitative error bounds) to make the limit passage clearer to readers outside the immediate subfield.
Notation for the interacting particle system and the common noise term should be introduced with a dedicated table or diagram in §2 to improve readability when tracking the passage from the discrete transformer to the SPDE.
A brief remark on the well-posedness of the limiting SPDE under the stated coercivity assumption would help readers verify that the synchronization result applies to a unique solution.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of our results on pathwise convergence of transformer token dynamics to stochastic particle systems and SPDEs, as well as the recognition of the quantitative bounds, commuting limits, propagation of chaos, and the conditional synchronization-by-noise result under coercivity. We appreciate the recommendation for minor revision and the acknowledgment that the conditional nature of the synchronization is appropriately stated. No major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained mathematical proof

full rationale

The paper establishes pathwise convergence of finite-depth finite-width transformer token dynamics (with MLP blocks) to a continuous-time stochastic interacting particle system, identifies the limiting SPDE, proves propagation of chaos, and shows conditional synchronization by noise with exponential energy dissipation. All steps are quantitative limit passages under explicitly stated assumptions on activation functions and a coercivity condition on common noise relative to self-attention drift. The final characterization of admissible activation functions is derived as part of the proof rather than presupposed. No parameter fitting, self-definitional reductions, load-bearing self-citations, or imported uniqueness theorems appear in the claimed chain; the argument remains conditional and externally falsifiable via the stated coercivity requirement.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard assumptions from stochastic analysis and the transformer architecture; no free parameters are fitted and no new entities are postulated.

axioms (2)

domain assumption The transformer has finite depth and width with standard self-attention and MLP blocks whose evolution can be tracked layerwise.
Invoked as the starting point for taking the scaling limit.
standard math Existence, uniqueness, and well-posedness of solutions to the limiting stochastic particle system and SPDE.
Required for the convergence statements and propagation of chaos.

pith-pipeline@v0.9.0 · 5425 in / 1486 out tokens · 63989 ms · 2026-05-07T10:49:54.334224+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Uniform Scaling Limits in AdamW-Trained Transformers
stat.ML 2026-05 unverdicted novelty 7.0

AdamW-trained transformer hidden states and backpropagated variables converge uniformly in L2 to a forward-backward ODE system (McKean-Vlasov when non-causal) at rate O(L^{-1}+L^{-1/3}H^{-1/2}) as depth L and heads H ...
Quantifying Concentration Phenomena of Mean-Field Transformers in the Low-Temperature Regime
math.AP 2026-05 unverdicted novelty 6.0

In the low-temperature regime, the token distribution in mean-field transformers concentrates onto the push-forward under a key-query-value projection with Wasserstein distance scaling as √(log(β+1)/β) exp(Ct) + exp(-ct).

Reference graph

Works this paper leans on

68 extracted references · 19 canonical work pages · cited by 2 Pith papers · 3 internal anchors

[1]

Adler and Jonathan E

Robert J. Adler and Jonathan E. Taylor. Orthogonal expansions. InRandom Fields and Geometry, Springer Monographs in Mathematics, chapter 3, pages 65–71. Springer, 2007

2007
[2]

Perceptrons and localization of attention’s mean-field landscape.arXiv preprint arXiv:2601.21366, 2026

Antonio Álvarez-López, Borjan Geshkovski, and Domènec Ruiz-Balet. Perceptrons and localization of attention’s mean-field landscape.arXiv preprint arXiv:2601.21366, 2026

work page internal anchor Pith review arXiv 2026
[3]

Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate.arXiv preprint arXiv:1409.0473, 2014

work page internal anchor Pith review arXiv 2014
[4]

On the structure of stationary solutions to McKean-Vlasov equations with applications to noisy transformers

Krishnakumar Balasubramanian, Sayan Banerjee, and Philippe Rigollet. On the structure of stationary solutions to McKean-Vlasov equations with applications to noisy transformers. arXiv preprint arXiv:2510.20094, 2025

work page arXiv 2025
[5]

Quantitative Gaussian approximation of randomly initialized deep neural networks.Machine Learning, 113(9):6373–6393, Sep 2024

Andrea Basteri and Dario Trevisan. Quantitative Gaussian approximation of randomly initialized deep neural networks.Machine Learning, 113(9):6373–6393, Sep 2024

2024
[6]

Geodesic distance Riesz energy on the sphere.Transactions of the American mathematical Society, 372(5):3141–3166, 2019

Dmitriy Bilyk and Feng Dai. Geodesic distance Riesz energy on the sphere.Transactions of the American mathematical Society, 372(5):3141–3166, 2019

2019
[7]

Emergence of meta-stable clus- tering in mean-field transformer models

Giuseppe Bruno, Federico Pasqualotto, and Andrea Agazzi. Emergence of meta-stable clus- tering in mean-field transformer models. InInternational Conference on Learning Repre- sentations (ICLR 2025), 2025

2025
[8]

A multiscale analysis of mean- field transformers in the moderate interaction regime

Giuseppe Bruno, Federico Pasqualotto, and Andrea Agazzi. A multiscale analysis of mean- field transformers in the moderate interaction regime. InThe Thirty-ninth Annual Confer- ence on Neural Information Processing Systems (NeurIPS), 2025

2025
[9]

A Unified Perspective on the Dynamics of Deep Transformers.arXiv preprint arXiv:2501.18322, 2025

Valérie Castin, Pierre Ablin, José Antonio Carrillo, and Gabriel Peyré. A unified perspective on the dynamics of deep transformers.arXiv preprint arXiv:2501.18322, 2025

work page arXiv 2025
[10]

arXiv preprint arXiv:2603.18168 , year=

Louis-Pierre Chaintron, Lénaïc Chizat, and Javier Maass. ResNets of all shapes and sizes: Convergence of training dynamics in the large-scale limit.arXiv preprint arXiv:2603.18168, 2026. 38

work page arXiv 2026
[11]

Neural ordinary differential equations.Advances in neural information processing systems, 31, 2018

Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations.Advances in neural information processing systems, 31, 2018

2018
[12]

Quantitative Clustering in Mean-Field Transformer Models

Shi Chen, Zhengjiang Lin, Yury Polyanskiy, and Philippe Rigollet. Quantitative clustering in mean-field transformer models.arXiv preprint arXiv:2504.14697, 2025

work page internal anchor Pith review arXiv 2025
[13]

The Hidden Width of Deep ResNets: Tight Error Bounds and Phase Diagrams

Lénaïc Chizat. The hidden width of deep ResNets: Tight error bounds and phase diagrams. arXiv preprint arXiv:2509.10167, 2025

work page arXiv 2025
[14]

Propagation of chaos for interacting particles subject to environmental noise.The Annals of Applied Probability, 26(3):1407 – 1442, 2016

Michele Coghi and Franco Flandoli. Propagation of chaos for interacting particles subject to environmental noise.The Annals of Applied Probability, 26(3):1407 – 1442, 2016

2016
[15]

Weak synchronization for isotropic flows.Discrete and Continuous Dynamical Systems - B, 21(9):3003–3014, 2016

Michael Cranston, Benjamin Gess, and Michael Scheutzow. Weak synchronization for isotropic flows.Discrete and Continuous Dynamical Systems - B, 21(9):3003–3014, 2016

2016
[16]

Synchro- nization on circles and spheres with nonlinear interactions.arXiv preprint arXiv:2405.18273, 2024

Christopher Criscitiello, Quentin Rebjock, Andrew D McRae, and Nicolas Boumal. Synchro- nization on circles and spheres with nonlinear interactions.arXiv preprint arXiv:2405.18273, 2024

work page arXiv 2024
[17]

Alexander G. de G. Matthews, Mark Rowland, Jiri Hron, Richard E. Turner, and Zoubin Ghahramani. Gaussian process behaviour in wide deep neural networks.ArXiv, abs/1804.11271, 2018

work page arXiv 2018
[18]

A proposal on machine learning via dynamical systems.Communications in Mathematics and Statistics, 1(5):1–11, 2017

Weinan E. A proposal on machine learning via dynamical systems.Communications in Mathematics and Statistics, 1(5):1–11, 2017

2017
[19]

Random quadratic form on a sphere: Synchronization by common noise.arXiv preprint arXiv:2603.06187, 2026

Maximilian Engel and Anna Shalova. Random quadratic form on a sphere: Synchronization by common noise.arXiv preprint arXiv:2603.06187, 2026

work page arXiv 2026
[20]

Favaro, B

S. Favaro, B. Hanin, D. Marinucci, I. Nourdin, and G. Peccati. Quantitative CLTs in deep neural networks.Probability Theory and Related Fields, 191(3):933–977, Apr 2025

2025
[21]

Fedorov, M

Lev Fedorov, Michaël E Sander, Romuald Elie, Pierre Marion, and Mathieu Laurière. Clus- tering in deep stochastic transformers.arXiv preprint arXiv:2601.21942, 2026

work page arXiv 2026
[22]

Synchronization by noise for order-preserving random dynamical systems.The Annals of Probability, pages 1325–1350, 2017

Franco Flandoli, Benjamin Gess, and Michael Scheutzow. Synchronization by noise for order-preserving random dynamical systems.The Annals of Probability, pages 1325–1350, 2017

2017
[23]

On the rate of convergence in Wasserstein distance of the empirical measure.Probability Theory and Related Fields, 162(3-4):707–738, 2015

Nicolas Fournier and Arnaud Guillin. On the rate of convergence in Wasserstein distance of the empirical measure.Probability Theory and Related Fields, 162(3-4):707–738, 2015

2015
[24]

Gerber, R

Nicolai Gerber, Rishabh Gvalani, Martin Hairer, Greg Pavliotis, and André Schlichting. Formation of clusters and coarsening in weakly interacting diffusions.arXiv preprint arXiv:2510.17629, 2025

work page arXiv 2025
[25]

Dynamic metastability in the self-attention model

Borjan Geshkovski, Hugo Koubbi, Yury Polyanskiy, and Philippe Rigollet. Dynamic metastability in the self-attention model.arXiv preprint arXiv:2410.06833, 2024

work page arXiv 2024
[26]

The emergence of clusters in self-attention dynamics.Advances in Neural Information Processing Systems, 36:57026–57037, 2023

Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. The emergence of clusters in self-attention dynamics.Advances in Neural Information Processing Systems, 36:57026–57037, 2023. 39

2023
[27]

A mathematical perspective on transformers.Bulletin of the American Mathematical Society, 62(3):427–479, 2025

Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. A mathematical perspective on transformers.Bulletin of the American Mathematical Society, 62(3):427–479, 2025

2025
[28]

Measure-to-measure inter- polation using transformers.arXiv preprint arXiv:2411.04551, 2024

Borjan Geshkovski, Philippe Rigollet, and Domènec Ruiz-Balet. Measure-to-measure inter- polation using transformers.arXiv preprint arXiv:2411.04551, 2024

work page arXiv 2024
[29]

Commutative scaling of width and depth in deep neural networks.Journal of Machine Learning Research, 25(299):1–41, 2024

Soufiane Hayou. Commutative scaling of width and depth in deep neural networks.Journal of Machine Learning Research, 25(299):1–41, 2024

2024
[30]

Width and depth limits commute in residual networks

Soufiane Hayou and Greg Yang. Width and depth limits commute in residual networks. In International Conference on Machine Learning, pages 12700–12723. PMLR, 2023

2023
[31]

Inequalities involving gegenbauer polynomials and their tangent lines.Mathematical Inequalities and Applications, 22(1):353–360, 2019

Tomasz Hrycak and Sebastian Schmutzhard. Inequalities involving gegenbauer polynomials and their tangent lines.Mathematical Inequalities and Applications, 22(1):353–360, 2019

2019
[32]

Neural tangent kernel: Convergence and generalization in neural networks

Arthur Jacot, Franck Gabriel, and Clement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grau- man, N. Cesa-Bianchi, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018

2018
[33]

Normalization in atten- tion dynamics

Nikita Karagodin, Shu Ge, Yury Polyanskiy, and Philippe Rigollet. Normalization in atten- tion dynamics. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems
[34]

Clustering in causal attention masking.Advances in Neural Information Processing Systems, 37:115652–115681, 2024

Nikita Karagodin, Yury Polyanskiy, and Philippe Rigollet. Clustering in causal attention masking.Advances in Neural Information Processing Systems, 37:115652–115681, 2024

2024
[35]

arXiv preprint arXiv:2604.01978 , year =

Hugo Koubbi, Borjan Geshkovski, and Philippe Rigollet. Homogenized transformers.arXiv preprint arXiv:2604.01978, 2026

work page arXiv 2026
[36]

Deep neural networks as Gaussian processes

Jaehoon Lee, Jascha Sohl-dickstein, Jeffrey Pennington, Roman Novak, Sam Schoenholz, and Yasaman Bahri. Deep neural networks as Gaussian processes. InInternational Confer- ence on Learning Representations, 2018

2018
[37]

A mean field analysis of deep ResNet and beyond: Towards provably optimization via overparameterization from depth

Yiping Lu, Chao Ma, Yulong Lu, Jianfeng Lu, and Lexing Ying. A mean field analysis of deep ResNet and beyond: Towards provably optimization via overparameterization from depth. InInternational Conference on Machine Learning, pages 6426–6436. PMLR, 2020

2020
[38]

Scaling ResNets in the large-depth regime.Journal of Machine Learning Research, 26(56):1–48, 2025

Pierre Marion, Adeline Fermanian, Gérard Biau, and Jean-Philippe Vert. Scaling ResNets in the large-depth regime.Journal of Machine Learning Research, 26(56):1–48, 2025

2025
[39]

Almost global consensus on the n-sphere.IEEE Transactions on Automatic Control, 63(6):1664–1675, 2017

Johan Markdahl, Johan Thunberg, and Jorge Gonçalves. Almost global consensus on the n-sphere.IEEE Transactions on Automatic Control, 63(6):1664–1675, 2017

2017
[40]

R. M. Neal.Bayesian Learning for Neural Networks, Vol. 118 of Lecture Notes in Statistics. Springer-Verlag, 1996

1996
[41]

Infinitely deep neural networks as diffusion pro- cesses

Stefano Peluchetti and Stefano Favaro. Infinitely deep neural networks as diffusion pro- cesses. InInternational Conference on Artificial Intelligence and Statistics, pages 1126–1136. PMLR, 2020

2020
[42]

Synchronization of mean-field models on the circle.arXiv preprint arXiv:2507.22857, 2025

Yury Polyanskiy, Philippe Rigollet, and Andrew Yao. Synchronization of mean-field models on the circle.arXiv preprint arXiv:2507.22857, 2025. 40

work page arXiv 2025
[43]

Flots browniens isotropes sur la sphère.Annales de l’Institut Henri Poincare (B) Probability and Statistics, 35(3):313–354, 1999

Olivier Raimond. Flots browniens isotropes sur la sphère.Annales de l’Institut Henri Poincare (B) Probability and Statistics, 35(3):313–354, 1999

1999
[44]

Carl Edward Rasmussen and Christopher K. I. Williams.Gaussian Processes for Machine Learning. The MIT Press, 11 2005

2005
[45]

The mean-field dynamics of transformers

Philippe Rigollet. The mean-field dynamics of transformers.arXiv preprint arXiv:2512.01868, 2025

work page arXiv 2025
[46]

Sinkformers: Trans- formers with doubly stochastic attention

Michael E Sander, Pierre Ablin, Mathieu Blondel, and Gabriel Peyré. Sinkformers: Trans- formers with doubly stochastic attention. InInternational Conference on Artificial Intelli- gence and Statistics, pages 3515–3530. PMLR, 2022

2022
[47]

I. J. Schoenberg. Positive definite functions on spheres.Duke Mathematical Journal, 9(1):96 – 108, 1942

1942
[48]

Solutions of stationary McKean-Vlasov equation on a high-dimensional sphere and other Riemannian manifolds.Advances in Nonlinear Analysis, 15(1):20250141, 2026

Anna Shalova and André Schlichting. Solutions of stationary McKean-Vlasov equation on a high-dimensional sphere and other Riemannian manifolds.Advances in Nonlinear Analysis, 15(1):20250141, 2026

2026
[49]

American Mathematical Soc., 1939

Gabor Szeg.Orthogonal polynomials, volume 23. American Mathematical Soc., 1939

1939
[50]

Topics in propagation of chaos.Ecole d’été de probabilités de Saint- Flour XIX—1989, 1464:165–251, 1991

Alain-Sol Sznitman. Topics in propagation of chaos.Ecole d’été de probabilités de Saint- Flour XIX—1989, 1464:165–251, 1991

1989
[51]

Wide deep neural networks with Gaussian weights are very close to Gaussian processes.arXiv preprint arXiv:2312.11737, 2023

Dario Trevisan. Wide deep neural networks with Gaussian weights are very close to Gaussian processes.arXiv preprint arXiv:2312.11737, 2023

work page arXiv 2023
[52]

Attention is all you need.Advances in neural informa- tion processing systems, 30, 2017

AshishVaswani, NoamShazeer, NikiParmar, JakobUszkoreit, LlionJones, AidanNGomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural informa- tion processing systems, 30, 2017

2017
[53]

Cédric Villani.Optimal transport – Old and new, volume 338, pages xxii+973. 01 2008

2008
[54]

Root mean square layer normalization.Advances in neural information processing systems, 32, 2019

Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advances in neural information processing systems, 32, 2019

2019
[55]

Yuriiformer: A suite of nesterov- accelerated transformers.arXiv preprint arXiv:2601.23236, 2026

Aleksandr Zimin, Yury Polyanskiy, and Philippe Rigollet. Yuriiformer: A suite of nesterov- accelerated transformers.arXiv preprint arXiv:2601.23236, 2026. A Technical Lemmas for Section 2 A.1 Proof of Lemma 2.4 We repeatedly use that the truncation mapT, the cutoffρ, and the truncated projectionx7→Pρ x are globally Lipschitz and bounded, and thatT(Rd)⊂B(0...

work page arXiv 2026
[56]

Terms ofO(|u| −3)contribute3|w| 3
[57]

Terms ofO(|u| −5)contribute18|w| 3
[58]

Summing these gives|D 3LN(u)| ≤36|w| 3/|u|3

Terms ofO(|u| −7)contribute15|w| 3. Summing these gives|D 3LN(u)| ≤36|w| 3/|u|3. The Lagrange remainder|R 2| ≤ 1 3! sup|D 3LN| yields the constant36/6 = 6, completing the proof. We now proceed to provide the proof of Lemma 2.6: Proof of Lemma 2.6.First, we expand the intermediate stepY i ℓ. Using the first-order Taylor expansion of the layer normalization...
[59]

Using the projection bound|Pu −P v| ≤2|u−v|, |PY i ℓ w−P X i ℓ w| ≤2|Y i ℓ −X i ℓ||w| ≤2|w|c∆t. 44
[60]

Using the decomposition|A(Y)Y−A(X)X| ≤ |A(Y)−A(X)||Y|+ |A(X)||Y−X|and recalling|Y i ℓ |= 1: 1 2 |PY i ℓ w|2 − 1 2 |PX i ℓ w|2 = 1 2 ((X i ℓ)⊤w)2 −((Y i ℓ )⊤w)2 ≤ |w|2c∆t

LetA(u) = 1 2 |Puw|2. Using the decomposition|A(Y)Y−A(X)X| ≤ |A(Y)−A(X)||Y|+ |A(X)||Y−X|and recalling|Y i ℓ |= 1: 1 2 |PY i ℓ w|2 − 1 2 |PX i ℓ w|2 = 1 2 ((X i ℓ)⊤w)2 −((Y i ℓ )⊤w)2 ≤ |w|2c∆t. Since|A(X)| ≤ 1 2 |w|2, the total bound for this term is: |w|2c∆t+ 1 2 |w|2c∆t= 3 2 |w|2c∆t
[61]

Summing these contributions yields|Ebase| ≤2|w|c∆t+ 9 2 |w|2c∆t

Using the product rule on((u)⊤w)Puw: ((Y i ℓ )⊤w)PY i ℓ w−((X i ℓ)⊤w)PX i ℓ w ≤ |(Y i ℓ −X i ℓ)⊤w||PY i ℓ w|+|(X i ℓ)⊤w||PY i ℓ w−P X i ℓ w| ≤(c∆t|w|)(|w|) + (|w|)(2|w|c∆t) = 3|w| 2c∆t. Summing these contributions yields|Ebase| ≤2|w|c∆t+ 9 2 |w|2c∆t. Finally, the current expression still depends onw= √ ∆tGm ℓ+1(Y i ℓ ). We substitute this with the point-m...
[62]

Linear term: |PX i ℓ w−P X i ℓ w′| ≤ |δ|
[63]

First quadratic term: 1 2 |PX i ℓ w|2X i ℓ − 1 2 |PX i ℓ w′|2X i ℓ ≤ |w||δ|+ 1 2 |δ|2
[64]

Summing these yields|E noise| ≤(1 + 3|w|)|δ|+ 3 2 |δ|2

Second quadratic term: ((X i ℓ)⊤w)PX i ℓ w−((X i ℓ)⊤w′)PX i ℓ w′ ≤2|w||δ|+|δ| 2. Summing these yields|E noise| ≤(1 + 3|w|)|δ|+ 3 2 |δ|2. Collecting all terms (Ri ℓ =E att +E mlp + Ebase +E noise) concludes the proof. Lemma A.2.Letw:= √ ∆tGm ℓ+1(Y i ℓ )and letE mlp be the corresponding MLP remainder from (A.2). Then |Emlp| ≤c|w| 3 almost surely. Proof.LetA...
[65]

BoundingA 1 t (Spatial discretization error): A1 t :=E " sup 0≤s≤t 1 N NX i=1 Z s 0 a(X i tℓu ,X tℓu )−a( ˆX i tℓu , ˆXtℓu ) du 2# . Using the Cauchy-Schwarz inequality ( R s 0 fdu 2 ≤s R s 0 |f|2 du≤T R t 0 |f|2 du) and the global Lipschitz assumption (A.4): A1 t ≤T Z t 0 1 N NX i=1 E a(X i tℓu ,X tℓu )−a( ˆX i tℓu , ˆXtℓu ) 2 du ≤T K 2 Z t 0 E " 1 N NX ...
[66]

Using Cauchy-Schwarz and the Lipschitz assumption (A.4): A2 t ≤T K 2 Z t 0 1 N NX i=1 E h |X i u −X i tℓu |2 i du

BoundingA 2 t (Time regularity of the drift): A2 t :=E " sup 0≤s≤t 1 N NX i=1 Z s 0 a(X i u,X u)−a(X i tℓu ,X tℓu ) du 2# . Using Cauchy-Schwarz and the Lipschitz assumption (A.4): A2 t ≤T K 2 Z t 0 1 N NX i=1 E h |X i u −X i tℓu |2 i du. By Lemma A.3 and the assumption∆t≤1,E[|X i u −X i tℓu |2]≤2K 3∆t(1 + ∆t)≤4K 3∆t. Thus: A2 t ≤T K 2 Z t 0 4K3∆tdu= 4T 2K2K3∆t
[67]

Z t 0 ∞X k=1 ˜σk(X i tℓu )−˜σk( ˆX i tℓu ) 2 du # ≤4K 2 Z t 0 E

BoundingB 1 t (Martingale spatial error): B1 t :=E   sup 0≤s≤t 1 N NX i=1 ∞X k=1 Z s 0 ˜σk(X i tℓu )−˜σk( ˆX i tℓu ) dBk u 2  . Using Doob’s maximal inequality and the Itô isometry: B1 t ≤4 1 N NX i=1 E "Z t 0 ∞X k=1 ˜σk(X i tℓu )−˜σk( ˆX i tℓu ) 2 du # ≤4K 2 Z t 0 E " 1 N NX i=1 |X i tℓu − ˆX i tℓu |2 # du≤4K 2 Z t 0 Z(u) du
[68]

Again using Doob’s inequality, the Itô isometry, and the Lipschitz condition (A.4): B2 t ≤4K 2 Z t 0 1 N NX i=1 E h |X i u −X i tℓu |2 i du

BoundingB 2 t (Time regularity of the diffusion): B2 t :=E   sup 0≤s≤t 1 N NX i=1 ∞X k=1 Z s 0 ˜σk(X i u)−˜σk(X i tℓu ) dBk u 2  . Again using Doob’s inequality, the Itô isometry, and the Lipschitz condition (A.4): B2 t ≤4K 2 Z t 0 1 N NX i=1 E h |X i u −X i tℓu |2 i du. Applying the bound from Lemma A.3 yields: B2 t ≤4K 2 Z t 0 4K3∆tdu= 16T K 2K3∆t. ...

2048