pith. machine review for the scientific record. sign in

arxiv: 2604.26898 · v1 · submitted 2026-04-29 · 🧮 math.PR · cs.LG· stat.ML

Recognition: unknown

Stochastic Scaling Limits and Synchronization by Noise in Deep Transformer Models

Authors on Pith no claims yet

Pith reviewed 2026-05-07 10:49 UTC · model grok-4.3

classification 🧮 math.PR cs.LGstat.ML
keywords transformer modelsstochastic scaling limitssynchronization by noiseinteracting particle systemspropagation of chaosstochastic partial differential equationsactivation functions
0
0 comments X

The pith

Finite transformer token evolution converges pathwise to a stochastic particle system with noise-driven synchronization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proves that the discrete layerwise updates of tokens in a transformer with MLP blocks converge pathwise to the paths of a system of particles evolving in continuous time under stochastic dynamics. It identifies the corresponding stochastic partial differential equation for the particle distribution and shows that the empirical distribution converges to this as the token count increases. The limiting system exhibits synchronization by noise, meaning the particles align due to a shared noise source that dominates the self-attention drift, leading to exponential decay of interaction energy on average. This requires the noise to be coercive enough and holds for activation functions satisfying a characterized condition.

Core claim

We prove pathwise convergence of the layerwise evolution of tokens in a finite-depth, finite-width transformer model with MultiLayer Perceptron (MLP) blocks to a continuous-time stochastic interacting particle system. We also identify the stochastic partial differential equation describing the evolution of the tokens' distribution in this limit and prove propagation of chaos when the number of such tokens is large. The bounds we establish are quantitative and the limits we consider commute. We further prove that the limiting stochastic model displays synchronization by noise and establish exponential dissipation of the interaction energy on average, provided that the common noise is coercive

What carries the argument

The common noise term in the limiting stochastic interacting particle system, which drives synchronization by overpowering the deterministic self-attention drift when sufficiently coercive.

If this is right

  • The layerwise token dynamics admit quantitative approximation by the continuous-time stochastic particle system.
  • The distribution of tokens obeys a specific stochastic partial differential equation in the scaling limit.
  • Propagation of chaos holds, so the tokens behave as independent copies of the limiting distribution for large token counts.
  • Synchronization by noise occurs with exponential average dissipation of interaction energy when the coercivity condition holds.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The scaling limit opens the possibility of analyzing very deep transformers using tools from stochastic analysis instead of discrete recursion.
  • The synchronization phenomenon might motivate adding controlled shared noise to finite-width transformers to encourage alignment of token representations during training.
  • The characterization of suitable activation functions supplies a concrete design criterion for nonlinearities that promote dissipative behavior in the continuous limit.

Load-bearing premise

The common noise must be sufficiently coercive relative to the deterministic self-attention drift, together with the specific conditions on activation functions; if this coercivity fails, the synchronization and exponential energy dissipation claims do not hold.

What would settle it

Numerical integration of the limiting stochastic particle system with noise intensity below the coercivity threshold set by the self-attention drift, showing that the expected interaction energy fails to decay exponentially.

Figures

Figures reproduced from arXiv: 2604.26898 by Andrea Agazzi, Eloy Mosig Garc\'ia, Giuseppe Bruno, Marco Romito, Samuele Saviozzi.

Figure 1
Figure 1. Figure 1: Noise-induced synchronization without attention. Evolution of the first two compo view at source ↗
Figure 2
Figure 2. Figure 2: Numerical simulations of the discrete transformer dynamics with ReLU activation, view at source ↗
Figure 3
Figure 3. Figure 3: Numerical simulations of the discrete transformer dynamics with SiLU activation, view at source ↗
Figure 4
Figure 4. Figure 4: Numerical simulations of the discrete transformer dynamics with Sigmoid activation, view at source ↗
Figure 5
Figure 5. Figure 5: Numerical simulations of the discrete transformer dynamics with Tanh activation, view at source ↗
Figure 6
Figure 6. Figure 6: Numerical simulations of the discrete transformer dynamics without MLPs, for view at source ↗
read the original abstract

We prove pathwise convergence of the layerwise evolution of tokens in a finite-depth, finite-width transformer model with MultiLayer Perceptron (MLP) blocks to a continuous-time stochastic interacting particle system. We also identify the stochastic partial differential equation describing the evolution of the tokens' distribution in this limit and prove propagation of chaos when the number of such tokens is large. The bounds we establish are quantitative and the limits we consider commute. We further prove that the limiting stochastic model displays synchronization by noise and establish exponential dissipation of the interaction energy on average, provided that the common noise is sufficiently coercive relative to the deterministic self-attention drift. We finally characterize the activation functions satisfying the former condition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper proves pathwise convergence of the layerwise evolution of tokens in a finite-depth, finite-width transformer model with MLP blocks to a continuous-time stochastic interacting particle system. It identifies the SPDE for the limiting token distribution, establishes quantitative bounds with commuting limits, proves propagation of chaos for large token counts, and shows that the limiting model exhibits synchronization by noise with exponential dissipation of interaction energy on average, provided the common noise is sufficiently coercive relative to the deterministic self-attention drift. It also characterizes the activation functions satisfying this coercivity condition.

Significance. If the results hold, this work supplies a rigorous mathematical link between discrete transformer dynamics and continuous stochastic particle systems, with potential to explain synchronization phenomena in large models. The quantitative bounds, commuting limits, and explicit characterization of admissible activation functions are strengths that could support further analysis of scaling and emergent behavior in neural networks. The conditional nature of the synchronization result (tied to coercivity) is appropriately stated.

minor comments (3)
  1. The introduction could more explicitly state the precise scaling regime (e.g., how the layer step size and width enter the quantitative error bounds) to make the limit passage clearer to readers outside the immediate subfield.
  2. Notation for the interacting particle system and the common noise term should be introduced with a dedicated table or diagram in §2 to improve readability when tracking the passage from the discrete transformer to the SPDE.
  3. A brief remark on the well-posedness of the limiting SPDE under the stated coercivity assumption would help readers verify that the synchronization result applies to a unique solution.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of our results on pathwise convergence of transformer token dynamics to stochastic particle systems and SPDEs, as well as the recognition of the quantitative bounds, commuting limits, propagation of chaos, and the conditional synchronization-by-noise result under coercivity. We appreciate the recommendation for minor revision and the acknowledgment that the conditional nature of the synchronization is appropriately stated. No major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained mathematical proof

full rationale

The paper establishes pathwise convergence of finite-depth finite-width transformer token dynamics (with MLP blocks) to a continuous-time stochastic interacting particle system, identifies the limiting SPDE, proves propagation of chaos, and shows conditional synchronization by noise with exponential energy dissipation. All steps are quantitative limit passages under explicitly stated assumptions on activation functions and a coercivity condition on common noise relative to self-attention drift. The final characterization of admissible activation functions is derived as part of the proof rather than presupposed. No parameter fitting, self-definitional reductions, load-bearing self-citations, or imported uniqueness theorems appear in the claimed chain; the argument remains conditional and externally falsifiable via the stated coercivity requirement.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard assumptions from stochastic analysis and the transformer architecture; no free parameters are fitted and no new entities are postulated.

axioms (2)
  • domain assumption The transformer has finite depth and width with standard self-attention and MLP blocks whose evolution can be tracked layerwise.
    Invoked as the starting point for taking the scaling limit.
  • standard math Existence, uniqueness, and well-posedness of solutions to the limiting stochastic particle system and SPDE.
    Required for the convergence statements and propagation of chaos.

pith-pipeline@v0.9.0 · 5425 in / 1486 out tokens · 63989 ms · 2026-05-07T10:49:54.334224+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Uniform Scaling Limits in AdamW-Trained Transformers

    stat.ML 2026-05 unverdicted novelty 7.0

    AdamW-trained transformer hidden states and backpropagated variables converge uniformly in L2 to a forward-backward ODE system (McKean-Vlasov when non-causal) at rate O(L^{-1}+L^{-1/3}H^{-1/2}) as depth L and heads H ...

  2. Quantifying Concentration Phenomena of Mean-Field Transformers in the Low-Temperature Regime

    math.AP 2026-05 unverdicted novelty 6.0

    In the low-temperature regime, the token distribution in mean-field transformers concentrates onto the push-forward under a key-query-value projection with Wasserstein distance scaling as √(log(β+1)/β) exp(Ct) + exp(-ct).

Reference graph

Works this paper leans on

68 extracted references · 19 canonical work pages · cited by 2 Pith papers · 3 internal anchors

  1. [1]

    Adler and Jonathan E

    Robert J. Adler and Jonathan E. Taylor. Orthogonal expansions. InRandom Fields and Geometry, Springer Monographs in Mathematics, chapter 3, pages 65–71. Springer, 2007

  2. [2]

    Perceptrons and localization of attention’s mean-field landscape.arXiv preprint arXiv:2601.21366, 2026

    Antonio Álvarez-López, Borjan Geshkovski, and Domènec Ruiz-Balet. Perceptrons and localization of attention’s mean-field landscape.arXiv preprint arXiv:2601.21366, 2026

  3. [3]

    Neural Machine Translation by Jointly Learning to Align and Translate

    Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate.arXiv preprint arXiv:1409.0473, 2014

  4. [4]

    On the structure of stationary solutions to McKean-Vlasov equations with applications to noisy transformers

    Krishnakumar Balasubramanian, Sayan Banerjee, and Philippe Rigollet. On the structure of stationary solutions to McKean-Vlasov equations with applications to noisy transformers. arXiv preprint arXiv:2510.20094, 2025

  5. [5]

    Quantitative Gaussian approximation of randomly initialized deep neural networks.Machine Learning, 113(9):6373–6393, Sep 2024

    Andrea Basteri and Dario Trevisan. Quantitative Gaussian approximation of randomly initialized deep neural networks.Machine Learning, 113(9):6373–6393, Sep 2024

  6. [6]

    Geodesic distance Riesz energy on the sphere.Transactions of the American mathematical Society, 372(5):3141–3166, 2019

    Dmitriy Bilyk and Feng Dai. Geodesic distance Riesz energy on the sphere.Transactions of the American mathematical Society, 372(5):3141–3166, 2019

  7. [7]

    Emergence of meta-stable clus- tering in mean-field transformer models

    Giuseppe Bruno, Federico Pasqualotto, and Andrea Agazzi. Emergence of meta-stable clus- tering in mean-field transformer models. InInternational Conference on Learning Repre- sentations (ICLR 2025), 2025

  8. [8]

    A multiscale analysis of mean- field transformers in the moderate interaction regime

    Giuseppe Bruno, Federico Pasqualotto, and Andrea Agazzi. A multiscale analysis of mean- field transformers in the moderate interaction regime. InThe Thirty-ninth Annual Confer- ence on Neural Information Processing Systems (NeurIPS), 2025

  9. [9]

    A Unified Perspective on the Dynamics of Deep Transformers.arXiv preprint arXiv:2501.18322, 2025

    Valérie Castin, Pierre Ablin, José Antonio Carrillo, and Gabriel Peyré. A unified perspective on the dynamics of deep transformers.arXiv preprint arXiv:2501.18322, 2025

  10. [10]

    arXiv preprint arXiv:2603.18168 , year=

    Louis-Pierre Chaintron, Lénaïc Chizat, and Javier Maass. ResNets of all shapes and sizes: Convergence of training dynamics in the large-scale limit.arXiv preprint arXiv:2603.18168, 2026. 38

  11. [11]

    Neural ordinary differential equations.Advances in neural information processing systems, 31, 2018

    Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations.Advances in neural information processing systems, 31, 2018

  12. [12]

    Quantitative Clustering in Mean-Field Transformer Models

    Shi Chen, Zhengjiang Lin, Yury Polyanskiy, and Philippe Rigollet. Quantitative clustering in mean-field transformer models.arXiv preprint arXiv:2504.14697, 2025

  13. [13]

    The Hidden Width of Deep ResNets: Tight Error Bounds and Phase Diagrams

    Lénaïc Chizat. The hidden width of deep ResNets: Tight error bounds and phase diagrams. arXiv preprint arXiv:2509.10167, 2025

  14. [14]

    Propagation of chaos for interacting particles subject to environmental noise.The Annals of Applied Probability, 26(3):1407 – 1442, 2016

    Michele Coghi and Franco Flandoli. Propagation of chaos for interacting particles subject to environmental noise.The Annals of Applied Probability, 26(3):1407 – 1442, 2016

  15. [15]

    Weak synchronization for isotropic flows.Discrete and Continuous Dynamical Systems - B, 21(9):3003–3014, 2016

    Michael Cranston, Benjamin Gess, and Michael Scheutzow. Weak synchronization for isotropic flows.Discrete and Continuous Dynamical Systems - B, 21(9):3003–3014, 2016

  16. [16]

    Synchro- nization on circles and spheres with nonlinear interactions.arXiv preprint arXiv:2405.18273, 2024

    Christopher Criscitiello, Quentin Rebjock, Andrew D McRae, and Nicolas Boumal. Synchro- nization on circles and spheres with nonlinear interactions.arXiv preprint arXiv:2405.18273, 2024

  17. [17]

    Alexander G. de G. Matthews, Mark Rowland, Jiri Hron, Richard E. Turner, and Zoubin Ghahramani. Gaussian process behaviour in wide deep neural networks.ArXiv, abs/1804.11271, 2018

  18. [18]

    A proposal on machine learning via dynamical systems.Communications in Mathematics and Statistics, 1(5):1–11, 2017

    Weinan E. A proposal on machine learning via dynamical systems.Communications in Mathematics and Statistics, 1(5):1–11, 2017

  19. [19]

    Random quadratic form on a sphere: Synchronization by common noise.arXiv preprint arXiv:2603.06187, 2026

    Maximilian Engel and Anna Shalova. Random quadratic form on a sphere: Synchronization by common noise.arXiv preprint arXiv:2603.06187, 2026

  20. [20]

    Favaro, B

    S. Favaro, B. Hanin, D. Marinucci, I. Nourdin, and G. Peccati. Quantitative CLTs in deep neural networks.Probability Theory and Related Fields, 191(3):933–977, Apr 2025

  21. [21]

    Fedorov, M

    Lev Fedorov, Michaël E Sander, Romuald Elie, Pierre Marion, and Mathieu Laurière. Clus- tering in deep stochastic transformers.arXiv preprint arXiv:2601.21942, 2026

  22. [22]

    Synchronization by noise for order-preserving random dynamical systems.The Annals of Probability, pages 1325–1350, 2017

    Franco Flandoli, Benjamin Gess, and Michael Scheutzow. Synchronization by noise for order-preserving random dynamical systems.The Annals of Probability, pages 1325–1350, 2017

  23. [23]

    On the rate of convergence in Wasserstein distance of the empirical measure.Probability Theory and Related Fields, 162(3-4):707–738, 2015

    Nicolas Fournier and Arnaud Guillin. On the rate of convergence in Wasserstein distance of the empirical measure.Probability Theory and Related Fields, 162(3-4):707–738, 2015

  24. [24]

    Gerber, R

    Nicolai Gerber, Rishabh Gvalani, Martin Hairer, Greg Pavliotis, and André Schlichting. Formation of clusters and coarsening in weakly interacting diffusions.arXiv preprint arXiv:2510.17629, 2025

  25. [25]

    Dynamic metastability in the self-attention model

    Borjan Geshkovski, Hugo Koubbi, Yury Polyanskiy, and Philippe Rigollet. Dynamic metastability in the self-attention model.arXiv preprint arXiv:2410.06833, 2024

  26. [26]

    The emergence of clusters in self-attention dynamics.Advances in Neural Information Processing Systems, 36:57026–57037, 2023

    Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. The emergence of clusters in self-attention dynamics.Advances in Neural Information Processing Systems, 36:57026–57037, 2023. 39

  27. [27]

    A mathematical perspective on transformers.Bulletin of the American Mathematical Society, 62(3):427–479, 2025

    Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. A mathematical perspective on transformers.Bulletin of the American Mathematical Society, 62(3):427–479, 2025

  28. [28]

    Measure-to-measure inter- polation using transformers.arXiv preprint arXiv:2411.04551, 2024

    Borjan Geshkovski, Philippe Rigollet, and Domènec Ruiz-Balet. Measure-to-measure inter- polation using transformers.arXiv preprint arXiv:2411.04551, 2024

  29. [29]

    Commutative scaling of width and depth in deep neural networks.Journal of Machine Learning Research, 25(299):1–41, 2024

    Soufiane Hayou. Commutative scaling of width and depth in deep neural networks.Journal of Machine Learning Research, 25(299):1–41, 2024

  30. [30]

    Width and depth limits commute in residual networks

    Soufiane Hayou and Greg Yang. Width and depth limits commute in residual networks. In International Conference on Machine Learning, pages 12700–12723. PMLR, 2023

  31. [31]

    Inequalities involving gegenbauer polynomials and their tangent lines.Mathematical Inequalities and Applications, 22(1):353–360, 2019

    Tomasz Hrycak and Sebastian Schmutzhard. Inequalities involving gegenbauer polynomials and their tangent lines.Mathematical Inequalities and Applications, 22(1):353–360, 2019

  32. [32]

    Neural tangent kernel: Convergence and generalization in neural networks

    Arthur Jacot, Franck Gabriel, and Clement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grau- man, N. Cesa-Bianchi, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018

  33. [33]

    Normalization in atten- tion dynamics

    Nikita Karagodin, Shu Ge, Yury Polyanskiy, and Philippe Rigollet. Normalization in atten- tion dynamics. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  34. [34]

    Clustering in causal attention masking.Advances in Neural Information Processing Systems, 37:115652–115681, 2024

    Nikita Karagodin, Yury Polyanskiy, and Philippe Rigollet. Clustering in causal attention masking.Advances in Neural Information Processing Systems, 37:115652–115681, 2024

  35. [35]

    arXiv preprint arXiv:2604.01978 , year =

    Hugo Koubbi, Borjan Geshkovski, and Philippe Rigollet. Homogenized transformers.arXiv preprint arXiv:2604.01978, 2026

  36. [36]

    Deep neural networks as Gaussian processes

    Jaehoon Lee, Jascha Sohl-dickstein, Jeffrey Pennington, Roman Novak, Sam Schoenholz, and Yasaman Bahri. Deep neural networks as Gaussian processes. InInternational Confer- ence on Learning Representations, 2018

  37. [37]

    A mean field analysis of deep ResNet and beyond: Towards provably optimization via overparameterization from depth

    Yiping Lu, Chao Ma, Yulong Lu, Jianfeng Lu, and Lexing Ying. A mean field analysis of deep ResNet and beyond: Towards provably optimization via overparameterization from depth. InInternational Conference on Machine Learning, pages 6426–6436. PMLR, 2020

  38. [38]

    Scaling ResNets in the large-depth regime.Journal of Machine Learning Research, 26(56):1–48, 2025

    Pierre Marion, Adeline Fermanian, Gérard Biau, and Jean-Philippe Vert. Scaling ResNets in the large-depth regime.Journal of Machine Learning Research, 26(56):1–48, 2025

  39. [39]

    Almost global consensus on the n-sphere.IEEE Transactions on Automatic Control, 63(6):1664–1675, 2017

    Johan Markdahl, Johan Thunberg, and Jorge Gonçalves. Almost global consensus on the n-sphere.IEEE Transactions on Automatic Control, 63(6):1664–1675, 2017

  40. [40]

    R. M. Neal.Bayesian Learning for Neural Networks, Vol. 118 of Lecture Notes in Statistics. Springer-Verlag, 1996

  41. [41]

    Infinitely deep neural networks as diffusion pro- cesses

    Stefano Peluchetti and Stefano Favaro. Infinitely deep neural networks as diffusion pro- cesses. InInternational Conference on Artificial Intelligence and Statistics, pages 1126–1136. PMLR, 2020

  42. [42]

    Synchronization of mean-field models on the circle.arXiv preprint arXiv:2507.22857, 2025

    Yury Polyanskiy, Philippe Rigollet, and Andrew Yao. Synchronization of mean-field models on the circle.arXiv preprint arXiv:2507.22857, 2025. 40

  43. [43]

    Flots browniens isotropes sur la sphère.Annales de l’Institut Henri Poincare (B) Probability and Statistics, 35(3):313–354, 1999

    Olivier Raimond. Flots browniens isotropes sur la sphère.Annales de l’Institut Henri Poincare (B) Probability and Statistics, 35(3):313–354, 1999

  44. [44]

    Carl Edward Rasmussen and Christopher K. I. Williams.Gaussian Processes for Machine Learning. The MIT Press, 11 2005

  45. [45]

    The mean-field dynamics of transformers

    Philippe Rigollet. The mean-field dynamics of transformers.arXiv preprint arXiv:2512.01868, 2025

  46. [46]

    Sinkformers: Trans- formers with doubly stochastic attention

    Michael E Sander, Pierre Ablin, Mathieu Blondel, and Gabriel Peyré. Sinkformers: Trans- formers with doubly stochastic attention. InInternational Conference on Artificial Intelli- gence and Statistics, pages 3515–3530. PMLR, 2022

  47. [47]

    I. J. Schoenberg. Positive definite functions on spheres.Duke Mathematical Journal, 9(1):96 – 108, 1942

  48. [48]

    Solutions of stationary McKean-Vlasov equation on a high-dimensional sphere and other Riemannian manifolds.Advances in Nonlinear Analysis, 15(1):20250141, 2026

    Anna Shalova and André Schlichting. Solutions of stationary McKean-Vlasov equation on a high-dimensional sphere and other Riemannian manifolds.Advances in Nonlinear Analysis, 15(1):20250141, 2026

  49. [49]

    American Mathematical Soc., 1939

    Gabor Szeg.Orthogonal polynomials, volume 23. American Mathematical Soc., 1939

  50. [50]

    Topics in propagation of chaos.Ecole d’été de probabilités de Saint- Flour XIX—1989, 1464:165–251, 1991

    Alain-Sol Sznitman. Topics in propagation of chaos.Ecole d’été de probabilités de Saint- Flour XIX—1989, 1464:165–251, 1991

  51. [51]

    Wide deep neural networks with Gaussian weights are very close to Gaussian processes.arXiv preprint arXiv:2312.11737, 2023

    Dario Trevisan. Wide deep neural networks with Gaussian weights are very close to Gaussian processes.arXiv preprint arXiv:2312.11737, 2023

  52. [52]

    Attention is all you need.Advances in neural informa- tion processing systems, 30, 2017

    AshishVaswani, NoamShazeer, NikiParmar, JakobUszkoreit, LlionJones, AidanNGomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural informa- tion processing systems, 30, 2017

  53. [53]

    Cédric Villani.Optimal transport – Old and new, volume 338, pages xxii+973. 01 2008

  54. [54]

    Root mean square layer normalization.Advances in neural information processing systems, 32, 2019

    Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advances in neural information processing systems, 32, 2019

  55. [55]

    Yuriiformer: A suite of nesterov- accelerated transformers.arXiv preprint arXiv:2601.23236, 2026

    Aleksandr Zimin, Yury Polyanskiy, and Philippe Rigollet. Yuriiformer: A suite of nesterov- accelerated transformers.arXiv preprint arXiv:2601.23236, 2026. A Technical Lemmas for Section 2 A.1 Proof of Lemma 2.4 We repeatedly use that the truncation mapT, the cutoffρ, and the truncated projectionx7→Pρ x are globally Lipschitz and bounded, and thatT(Rd)⊂B(0...

  56. [56]

    Terms ofO(|u| −3)contribute3|w| 3

  57. [57]

    Terms ofO(|u| −5)contribute18|w| 3

  58. [58]

    Summing these gives|D 3LN(u)| ≤36|w| 3/|u|3

    Terms ofO(|u| −7)contribute15|w| 3. Summing these gives|D 3LN(u)| ≤36|w| 3/|u|3. The Lagrange remainder|R 2| ≤ 1 3! sup|D 3LN| yields the constant36/6 = 6, completing the proof. We now proceed to provide the proof of Lemma 2.6: Proof of Lemma 2.6.First, we expand the intermediate stepY i ℓ. Using the first-order Taylor expansion of the layer normalization...

  59. [59]

    Using the projection bound|Pu −P v| ≤2|u−v|, |PY i ℓ w−P X i ℓ w| ≤2|Y i ℓ −X i ℓ||w| ≤2|w|c∆t. 44

  60. [60]

    Using the decomposition|A(Y)Y−A(X)X| ≤ |A(Y)−A(X)||Y|+ |A(X)||Y−X|and recalling|Y i ℓ |= 1: 1 2 |PY i ℓ w|2 − 1 2 |PX i ℓ w|2 = 1 2 ((X i ℓ)⊤w)2 −((Y i ℓ )⊤w)2 ≤ |w|2c∆t

    LetA(u) = 1 2 |Puw|2. Using the decomposition|A(Y)Y−A(X)X| ≤ |A(Y)−A(X)||Y|+ |A(X)||Y−X|and recalling|Y i ℓ |= 1: 1 2 |PY i ℓ w|2 − 1 2 |PX i ℓ w|2 = 1 2 ((X i ℓ)⊤w)2 −((Y i ℓ )⊤w)2 ≤ |w|2c∆t. Since|A(X)| ≤ 1 2 |w|2, the total bound for this term is: |w|2c∆t+ 1 2 |w|2c∆t= 3 2 |w|2c∆t

  61. [61]

    Summing these contributions yields|Ebase| ≤2|w|c∆t+ 9 2 |w|2c∆t

    Using the product rule on((u)⊤w)Puw: ((Y i ℓ )⊤w)PY i ℓ w−((X i ℓ)⊤w)PX i ℓ w ≤ |(Y i ℓ −X i ℓ)⊤w||PY i ℓ w|+|(X i ℓ)⊤w||PY i ℓ w−P X i ℓ w| ≤(c∆t|w|)(|w|) + (|w|)(2|w|c∆t) = 3|w| 2c∆t. Summing these contributions yields|Ebase| ≤2|w|c∆t+ 9 2 |w|2c∆t. Finally, the current expression still depends onw= √ ∆tGm ℓ+1(Y i ℓ ). We substitute this with the point-m...

  62. [62]

    Linear term: |PX i ℓ w−P X i ℓ w′| ≤ |δ|

  63. [63]

    First quadratic term: 1 2 |PX i ℓ w|2X i ℓ − 1 2 |PX i ℓ w′|2X i ℓ ≤ |w||δ|+ 1 2 |δ|2

  64. [64]

    Summing these yields|E noise| ≤(1 + 3|w|)|δ|+ 3 2 |δ|2

    Second quadratic term: ((X i ℓ)⊤w)PX i ℓ w−((X i ℓ)⊤w′)PX i ℓ w′ ≤2|w||δ|+|δ| 2. Summing these yields|E noise| ≤(1 + 3|w|)|δ|+ 3 2 |δ|2. Collecting all terms (Ri ℓ =E att +E mlp + Ebase +E noise) concludes the proof. Lemma A.2.Letw:= √ ∆tGm ℓ+1(Y i ℓ )and letE mlp be the corresponding MLP remainder from (A.2). Then |Emlp| ≤c|w| 3 almost surely. Proof.LetA...

  65. [65]

    BoundingA 1 t (Spatial discretization error): A1 t :=E " sup 0≤s≤t 1 N NX i=1 Z s 0 a(X i tℓu ,X tℓu )−a( ˆX i tℓu , ˆXtℓu ) du 2# . Using the Cauchy-Schwarz inequality ( R s 0 fdu 2 ≤s R s 0 |f|2 du≤T R t 0 |f|2 du) and the global Lipschitz assumption (A.4): A1 t ≤T Z t 0 1 N NX i=1 E a(X i tℓu ,X tℓu )−a( ˆX i tℓu , ˆXtℓu ) 2 du ≤T K 2 Z t 0 E " 1 N NX ...

  66. [66]

    Using Cauchy-Schwarz and the Lipschitz assumption (A.4): A2 t ≤T K 2 Z t 0 1 N NX i=1 E h |X i u −X i tℓu |2 i du

    BoundingA 2 t (Time regularity of the drift): A2 t :=E " sup 0≤s≤t 1 N NX i=1 Z s 0 a(X i u,X u)−a(X i tℓu ,X tℓu ) du 2# . Using Cauchy-Schwarz and the Lipschitz assumption (A.4): A2 t ≤T K 2 Z t 0 1 N NX i=1 E h |X i u −X i tℓu |2 i du. By Lemma A.3 and the assumption∆t≤1,E[|X i u −X i tℓu |2]≤2K 3∆t(1 + ∆t)≤4K 3∆t. Thus: A2 t ≤T K 2 Z t 0 4K3∆tdu= 4T 2K2K3∆t

  67. [67]

    Z t 0 ∞X k=1 ˜σk(X i tℓu )−˜σk( ˆX i tℓu ) 2 du # ≤4K 2 Z t 0 E

    BoundingB 1 t (Martingale spatial error): B1 t :=E   sup 0≤s≤t 1 N NX i=1 ∞X k=1 Z s 0 ˜σk(X i tℓu )−˜σk( ˆX i tℓu ) dBk u 2  . Using Doob’s maximal inequality and the Itô isometry: B1 t ≤4 1 N NX i=1 E "Z t 0 ∞X k=1 ˜σk(X i tℓu )−˜σk( ˆX i tℓu ) 2 du # ≤4K 2 Z t 0 E " 1 N NX i=1 |X i tℓu − ˆX i tℓu |2 # du≤4K 2 Z t 0 Z(u) du

  68. [68]

    Again using Doob’s inequality, the Itô isometry, and the Lipschitz condition (A.4): B2 t ≤4K 2 Z t 0 1 N NX i=1 E h |X i u −X i tℓu |2 i du

    BoundingB 2 t (Time regularity of the diffusion): B2 t :=E   sup 0≤s≤t 1 N NX i=1 ∞X k=1 Z s 0 ˜σk(X i u)−˜σk(X i tℓu ) dBk u 2  . Again using Doob’s inequality, the Itô isometry, and the Lipschitz condition (A.4): B2 t ≤4K 2 Z t 0 1 N NX i=1 E h |X i u −X i tℓu |2 i du. Applying the bound from Lemma A.3 yields: B2 t ≤4K 2 Z t 0 4K3∆tdu= 16T K 2K3∆t. ...