pith. sign in

arxiv: 2605.18870 · v1 · pith:CD3HLOFMnew · submitted 2026-05-15 · 💻 cs.LG · math.AP· math.FA

Multi-Headed Transformer Architectures as Time-dependent Wasserstein Gradient Flows

Pith reviewed 2026-05-20 20:31 UTC · model grok-4.3

classification 💻 cs.LG math.APmath.FA
keywords transformer architecturesWasserstein gradient flowsinteraction energyattention mechanismmulti-headed transformerstime-dependent flowsstability analysisGamma-convergence
0
0 comments X

The pith

Multi-headed transformers process data as time-dependent Wasserstein gradient flows of an attention interaction energy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a connection between multi-headed transformer architectures and time-dependent gradient flows in the Wasserstein space of probability measures. By designing an interaction energy that encodes the attention mechanism, the model allows for varying weights across heads and layers without special initialization. Under an integrability assumption on weight evolution, the flows' omega-limit sets consist of stationary points for the energy at a limiting weight distribution. Stability results show continuous dependence on initial data, uniqueness, and Gamma-convergence for perturbed energies. Experiments validate energy dissipation and asymptotic behaviors in different regimes.

Core claim

The data flow in multi-headed transformer architectures is modeled as time-dependent gradient flows for a suitable interaction energy capturing the design of the attention mechanism. Under a suitable integrability assumption on the evolution of the weights, each element of the ω-limit set of the gradient flows is a stationary point of the interaction energy at a limiting weight distribution. The models exhibit stability under perturbations of initial data and weights, with Gamma-convergence of perturbed energies leading to convergence of flows.

What carries the argument

Time-dependent Wasserstein gradient flow driven by an interaction energy that replicates the attention mechanism.

Load-bearing premise

The weights of the transformer evolve with time in a manner satisfying a suitable integrability condition.

What would settle it

A numerical simulation or theoretical construction where the long-time limit of the flow is not a stationary point for the interaction energy under the limiting weights.

Figures

Figures reproduced from arXiv: 2605.18870 by Alex Massucco, Carola-Bibiane Sch\"onlieb, Christoph Brune, Leonardo Del Grande, Marcello Carioni.

Figure 1
Figure 1. Figure 1: Token cloud at t ∈ {0, 16, 18, 20} on S 2 (n = 150 tokens). Top row: OU weights (Monte Carlo mean over NMC = 20 trajectories), showing progressive token clustering consistent with convergence to a stationary point of E as predicted by Theorem 4.16. Bottom row: oscillating weights, showing no persistent clustering, consistent with the absence of a stationary limit when Assumption (62) is violated. Ornstein–… view at source ↗
Figure 2
Figure 2. Figure 2: Strong upper-gradient analysis on S 2 for H ∈ {1, 10, 100}. Left column: G 2 t versus time. Right column: time-averaged mean G 2 versus H with best-fit curve aHb (dashed) and 95% uncertainty band. Top row: OU weights (Monte Carlo mean over NMC = 20 trajectories), confirming the gradient￾variance decay of (84) with O(H−1 ) quantitative scaling. Bottom row: oscillating weights, exhibiting the non-decaying be… view at source ↗
Figure 3
Figure 3. Figure 3: (right) plots the identity in (81) along with its individual terms. As predicted by the theory, the energy balance is again preserved in time. Finally, [PITH_FULL_IMAGE:figures/full_fig_p037_3.png] view at source ↗
read the original abstract

In recent years, transformer architectures have revolutionized the field of language processing, opening the door to previously unforeseen possibilities. However, from a theoretical point of view, the mathematical models proposed in the literature often lack direct contact with the actual architectures and depend on strong simplifying assumptions. In this paper, we reduce this gap by modelling the data flow in multi-headed transformer architectures as time-dependent gradient flows for a suitable interaction energy capturing the design of the attention mechanism. The explicit dependence on time allows us to consider different weights for each head and for each layer, without imposing constraints on the initialization method. Moreover, we prove that, under a suitable integrability assumption on the evolution of the weights, each element of the $\omega$-limit set of the gradient flows is a stationary point of the interaction energy at a limiting weight distribution. Finally, we analyse the stability of the gradient flows considering perturbations of both the initial data and the weights. Specifically, on the one hand, we study the robustness of the proposed models with respect to noisy inputs, establishing a continuous dependence of the gradient flows on the initial data and uniqueness of the flows. On the other hand, we prove the $\Gamma$-convergence of the perturbed interaction energy to the unperturbed one, leading to the convergence of the corresponding gradient flows. We complement these theoretical results with numerical experiments that confirm the predicted energy-dissipation identity and clarify the asymptotic behavior of the dynamics in both the autonomous-like (Ornstein--Uhlenbeck) and the genuinely non-autonomous (oscillating-weights) regimes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper models the data flow through multi-headed transformer architectures as time-dependent gradient flows in the Wasserstein space, driven by an interaction energy constructed to encode the attention mechanism. This time-dependent formulation permits distinct weights per head and layer without initialization restrictions. Under a suitable integrability assumption on the weight trajectories, the authors prove that every element of the ω-limit set of the flows is a stationary point of the interaction energy evaluated at a limiting weight distribution. They further establish stability by proving continuous dependence on initial data, uniqueness of the flows, and Γ-convergence of perturbed interaction energies to the unperturbed one, with numerical experiments confirming the energy-dissipation identity in both autonomous-like and oscillating-weight regimes.

Significance. If the central claims hold, the work supplies a rigorous dynamical-systems perspective on transformer attention that directly incorporates the multi-head, multi-layer structure and avoids overly restrictive initialization assumptions. The combination of time-dependent gradient-flow analysis, ω-limit stationarity, and Γ-convergence results offers a potential route to theoretical guarantees on convergence and robustness. The numerical illustrations of energy dissipation in both autonomous and genuinely non-autonomous settings add concrete support. The principal limitation is that the key integrability hypothesis remains unverified against actual transformer weight trajectories.

major comments (1)
  1. [main convergence theorem / energy-dissipation identity] The theorem establishing ω-limit stationarity (the result stated after the modeling section and proved via the energy-dissipation identity): the integrability assumption on the evolution of the weights is invoked to obtain compactness in the space of measures and to pass to the limit, yet the manuscript neither derives this condition from the discrete or continuous transformer update rules nor supplies numerical checks confirming that the time-integral of weight variations remains finite along realistic trajectories. If the integral diverges, the claimed stationarity of ω-limit points need not hold.
minor comments (1)
  1. [numerical experiments] The numerical-experiments section would benefit from an explicit statement of the precise functional form chosen for the oscillating weights in the non-autonomous regime, together with the discretization scheme used to approximate the Wasserstein gradient flow.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for their thorough review and valuable feedback on our manuscript. We appreciate the recognition of the potential of our dynamical-systems perspective on transformer attention. Below, we provide a point-by-point response to the major comment.

read point-by-point responses
  1. Referee: The theorem establishing ω-limit stationarity (the result stated after the modeling section and proved via the energy-dissipation identity): the integrability assumption on the evolution of the weights is invoked to obtain compactness in the space of measures and to pass to the limit, yet the manuscript neither derives this condition from the discrete or continuous transformer update rules nor supplies numerical checks confirming that the time-integral of weight variations remains finite along realistic trajectories. If the integral diverges, the claimed stationarity of ω-limit points need not hold.

    Authors: We agree with the referee that the integrability assumption plays a central role in establishing the compactness needed to identify the ω-limit points as stationary for the limiting interaction energy. Our modeling approach treats the weight trajectories as externally given time-dependent functions to accommodate the multi-head and multi-layer structure without restrictive initialization assumptions; consequently, the integrability condition is imposed at the level of the continuous model rather than derived from discrete update rules. This is a modeling choice that allows flexibility but leaves open the question of whether realistic transformer training satisfies the condition. Our numerical experiments illustrate the energy-dissipation identity under oscillating weights, which presupposes bounded variations in the tested regimes, but we did not explicitly compute or report the time-integral of weight changes for realistic trajectories. In the revised manuscript, we will add a dedicated paragraph in the discussion section clarifying the nature of this assumption, its necessity for the non-autonomous setting, and its relation to the convergence of training dynamics. We will also include a brief numerical illustration using a small-scale transformer simulation to check the finiteness of the integral in the oscillating regime. We believe these additions will strengthen the presentation without altering the core theoretical results. revision: yes

standing simulated objections not resolved
  • Empirical verification of the integrability assumption using weight trajectories from large-scale, real-world transformer training runs

Circularity Check

1 steps flagged

Interaction energy chosen to encode attention by construction; modeling step definitional

specific steps
  1. self definitional [Abstract]
    "we reduce this gap by modelling the data flow in multi-headed transformer architectures as time-dependent gradient flows for a suitable interaction energy capturing the design of the attention mechanism"

    The energy is explicitly constructed ('suitable ... capturing the design') to match the attention mechanism; therefore the claim that the architecture 'is' the gradient flow of this energy holds by the choice of functional rather than by independent derivation or verification against transformer equations.

full rationale

The paper's core modeling step selects a 'suitable interaction energy capturing the design of the attention mechanism' and then represents transformer data flow as its gradient flow. This is self-definitional rather than derived from independent data or first principles. The subsequent omega-limit result and stability analysis rest on an external integrability assumption that is not shown to hold for actual transformer trajectories, but the proof itself does not reduce to a fit or self-citation chain. No other load-bearing circular steps (fitted predictions, uniqueness theorems, or renamed empirical patterns) are present. Overall circularity remains low.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central modeling step defines an interaction energy to capture attention and invokes an integrability condition on weight trajectories; no free parameters are explicitly fitted in the abstract.

axioms (1)
  • domain assumption suitable integrability assumption on the evolution of the weights
    Invoked to guarantee that omega-limit points are stationary for the limiting energy.
invented entities (1)
  • interaction energy capturing the design of the attention mechanism no independent evidence
    purpose: To drive the Wasserstein gradient flow that reproduces transformer data flow
    Constructed specifically for the multi-head attention update rule

pith-pipeline@v0.9.0 · 5832 in / 1259 out tokens · 50704 ms · 2026-05-20T20:31:56.556929+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 6 internal anchors

  1. [1]

    Quantifying Concentration Phenomena of Mean-Field Transformers in the Low-Temperature Regime

    A. Alcalde, L. Bungert, K. Riedl, and T. Roith. Quantifying concentration phenomena of mean- field transformers in the low-temperature regime.arXiv preprint arXiv:2605.10931, 2026

  2. [2]

    Ambrosio, N

    L. Ambrosio, N. Gigli, and G. Savar´ e.Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2008

  3. [3]

    Baevski, Y

    A. Baevski, Y. Zhou, A. Mohamed, and M. Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations.Advances in neural information processing systems, 33:12449– 12460, 2020

  4. [4]

    On the Structure of Stationary Solutions to

    K. Balasubramanian, S. Banerjee, and P. Rigollet. On the structure of stationary solu- tions to McKean–Vlasov equations with applications to noisy transformers.arXiv preprint arXiv:2510.20094, 2025

  5. [5]

    Benamou and Y

    J.-D. Benamou and Y. Brenier. A computational fluid mechanics solution to the Monge– Kantorovich mass transfer problem.Numerische Mathematik, 84(3):375–393, 2000

  6. [6]

    Billingsley.Convergence of probability measures

    P. Billingsley.Convergence of probability measures. John Wiley & Sons, 2013

  7. [7]

    Bruno, F

    G. Bruno, F. Pasqualotto, and A. Agazzi. Emergence of meta-stable clustering in mean-field transformer models. InInternational Conference on Learning Representations, pages 7496–7526, 2025

  8. [8]

    Burger, S

    M. Burger, S. Kabri, Y. Korolev, T. Roith, and L. Weigand. Analysis of mean-field models arising from self-attention dynamics in transformer architectures with layer normalization.Philosophical Transactions A, 383(2298):20240233, 2025

  9. [9]

    A unified perspective on the dynamics of deep transformers.arXiv preprint arXiv:2501.18322,

    V. Castin, P. Ablin, J. A. Carrillo, and G. Peyr´ e. A unified perspective on the dynamics of deep transformers.arXiv preprint arXiv:2501.18322, 2025

  10. [10]

    S. Chen, Z. Lin, Y. Polyanskiy, and P. Rigollet. Quantitative clustering in mean-field transformer models.arXiv preprint arXiv:2504.14697, 2025

  11. [11]

    Synchronization on circles and spheres with nonlinear interactions.arXiv preprint arXiv:2405.18273,

    C. Criscitiello, Q. Rebjock, A. D. McRae, and N. Boumal. Synchronization on circles and spheres with nonlinear interactions.arXiv preprint arXiv:2405.18273, 2024

  12. [12]

    Diestel and J

    J. Diestel and J. J. Uhl.Vector Measures. American Mathematical Society, 1977

  13. [13]

    Dolbeault, B

    J. Dolbeault, B. Nazaret, and G. Savar´ e. A new class of transport distances between measures. Calculus of Variations and Partial Differential Equations, 34(2):193–231, 2009

  14. [14]

    Dosovitskiy, L

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2021

  15. [15]

    Fedorov, M

    L. Fedorov, M. E. Sander, R. Elie, P. Marion, and M. Lauri` ere. Clustering in deep stochastic transformers.arXiv preprint arXiv:2601.21942, 2026

  16. [16]

    L. C. Ferreira and J. C. Valencia-Guevara. Gradient flows of time-dependent functionals in metric spaces and applications to PDEs.Monatshefte f¨ ur Mathematik, 185:231–268, 2018

  17. [17]

    Gemini Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 39

  18. [18]

    Geneva and N

    N. Geneva and N. Zabaras. Transformers for modeling physical systems.Neural Networks, 146:272–289, 2022

  19. [19]

    Gerber, R

    N. Gerber, R. S. Gvalani, M. Hairer, G. A. Pavliotis, and A. Schlichting. Formation of clusters and coarsening in weakly interacting diffusions.arXiv preprint arXiv:2510.17629, 2025

  20. [20]

    Geshkovski, C

    B. Geshkovski, C. Letrouit, Y. Polyanskiy, and P. Rigollet. The emergence of clusters in self- attention dynamics.Advances in Neural Information Processing Systems, 36, 2024

  21. [21]

    Geshkovski, C

    B. Geshkovski, C. Letrouit, Y. Polyanskiy, and P. Rigollet. A mathematical perspective on transformers.Bulletin of the American Mathematical Society, 62(3):427–479, 2025

  22. [22]

    Geshkovski, P

    B. Geshkovski, P. Rigollet, and D. Ruiz-Balet. Measure-to-measure interpolation using trans- formers.arXiv preprint arXiv:2411.04551, 2024

  23. [23]

    S. Gu, B. Kelly, and D. Xiu. Empirical asset pricing via machine learning.The Review of Financial Studies, 33(5):2223–2273, 2020

  24. [24]

    Hauer and J

    D. Hauer and J. M. Maz´ on. Kurdyka–Lojasiewicz–Simon inequality for gradient flows in metric spaces.Trans. Amer. Math. Soc., 372(7):4917–4976, 2019

  25. [25]

    Hayou, E

    S. Hayou, E. Clerico, B. He, G. Deligiannidis, A. Doucet, and J. Rousseau. Stable ResNet. In A. Banerjee and K. Fukumizu, editors,Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 ofProceedings of Machine Learning Research, pages 1324–1332. PMLR, 13–15 Apr 2021

  26. [26]

    Hayou, J.-F

    S. Hayou, J.-F. Ton, A. Doucet, and Y. W. Teh. Robust pruning at initialization. InInternational Conference on Learning Representations, 2021

  27. [27]

    K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. InProceedings of the IEEE International Conference on Computer Vision, pages 1026–1034, 2015

  28. [28]

    K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016

  29. [29]

    W. Hua, Z. Dai, H. Liu, and Q. Le. Transformer quality in linear time. InInternational Conference on Machine Learning, pages 9099–9117. PMLR, 2022

  30. [30]

    F. Y. Huo and N. F. Johnson. Capturing AI’s attention: Physics of repetition, hallucination, bias and beyond.arXiv preprint arXiv:2504.04600, 2025

  31. [31]

    Jumper, R

    J. Jumper, R. Evans, A. Pritzel, T. Green, M. Figurnov, O. Ronneberger, K. Tunyasuvunakool, R. Bates, A. ˇZ´ ıdek, A. Potapenko, et al. Highly accurate protein structure prediction with AlphaFold.Nature, 596(7873):583–589, 2021

  32. [32]

    K. Kan, X. Li, B. J. Zhang, T. Sahai, S. Osher, K. Kumar, and M. A. Katsoulakis. Stability of transformers under layer normalization.arXiv preprint arXiv:2510.09904, 2025

  33. [33]

    Karagodin, Y

    N. Karagodin, Y. Polyanskiy, and P. Rigollet. Clustering in causal attention masking.Advances in Neural Information Processing Systems, 37:115652–115681, 2024

  34. [34]

    H. Kim, G. Papamakarios, and A. Mnih. The Lipschitz constant of self-attention. InInternational Conference on Machine Learning, pages 5562–5571. PMLR, 2021

  35. [35]

    R. Lam, A. Sanchez-Gonzalez, M. Willson, P. Wirnsberger, M. Fortunato, F. Alet, S. Ravuri, T. Ewalds, Z. Eaton-Rosen, W. Hu, et al. Learning skillful medium-range global weather fore- casting.Science, 382(6677):1416–1421, 2023. 40

  36. [36]

    A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, et al. DeepSeek-V3 technical report.arXiv preprint arXiv:2412.19437, 2024

  37. [37]

    Matthes, R

    D. Matthes, R. J. McCann, and G. Savar´ e. A family of nonlinear fourth order equations of gradient flow type.Communications in Partial Differential Equations, 34(11):1352–1397, 2009

  38. [38]

    G. A. Pavliotis.Stochastic processes and applications : diffusion processes, the Fokker-Planck and Langevin equations. Texts in applied mathematics ; Volume 60. Springer, New York, 2014

  39. [39]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, pages 8748–8763. PMLR, 2021

  40. [40]

    Radford, J

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever. Robust speech recognition via large-scale weak supervision. InInternational Conference on Machine Learning, pages 28492–28518. PMLR, 2023

  41. [41]

    Radford, J

    A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

  42. [42]

    Ramapuram, F

    J. Ramapuram, F. Danieli, E. G. Dhekane, F. Weers, D. Busbridge, P. Ablin, T. Likhomanenko, J. Digani, Z. Gu, A. Shidani, and R. Webb. Theory, analysis, and best practices for sigmoid self-attention. InInternational Conference on Learning Representations, 2025

  43. [43]

    Ramsauer, B

    H. Ramsauer, B. Sch¨ afl, J. Lehner, P. Seidl, M. Widrich, L. Gruber, M. Holzleitner, T. Adler, D. Kreil, M. K. Kopp, G. Klambauer, J. Brandstetter, and S. Hochreiter. Hopfield networks is all you need. InInternational Conference on Learning Representations, 2021

  44. [44]

    Rossi, A

    R. Rossi, A. Mielke, and G. Savar´ e. A metric approach to a class of doubly nonlinear evolution equations and applications.Annali della Scuola Normale Superiore di Pisa, Classe di Scienze, pages 97–169, 2008

  45. [45]

    M. E. Sander, P. Ablin, M. Blondel, and G. Peyr´ e. Sinkformers: Transformers with doubly stochastic attention. InInternational Conference on Artificial Intelligence and Statistics, pages 3515–3530. PMLR, 2022

  46. [46]

    Sandier and S

    E. Sandier and S. Serfaty. Gamma-convergence of gradient flows with applications to Ginzburg– Landau.Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 57(12):1627–1672, 2004

  47. [47]

    Santambrogio.Optimal transport for applied mathematicians, volume 87 ofProgress in Non- linear Differential Equations and their Applications

    F. Santambrogio.Optimal transport for applied mathematicians, volume 87 ofProgress in Non- linear Differential Equations and their Applications. Birkh¨ auser/Springer, Cham, 2015

  48. [48]

    S. Serfaty. Gamma-convergence of gradient flows on Hilbert and metric spaces and applications. Discrete and Continuous Dynamical Systems, 31(4):1427–1451, 2011

  49. [49]

    S. Serfaty. Mean field limit for Coulomb-type flows.Duke Mathematical Journal, 169(15):2887– 2935, 2020

  50. [50]

    Shalova and A

    A. Shalova and A. Schlichting. Solutions of stationary McKean–Vlasov equation on a high-dimensional sphere and other Riemannian manifolds.Advances in Nonlinear Analysis, 15(1):20250141, 2026

  51. [51]

    K. Shen, J. Guo, X. Tan, S. Tang, R. Wang, and J. Bian. A study on ReLU and softmax in transformer.arXiv preprint arXiv:2302.06461, 2023. 41

  52. [52]

    J. M. Stokes, K. Yang, K. Swanson, W. Jin, A. Cubillos-Ruiz, N. M. Donghia, C. R. MacNair, S. French, L. A. Carfrae, Z. Bloom-Ackermann, et al. A deep learning approach to antibiotic discovery.Cell, 180(4):688–702, 2020

  53. [53]

    A. Tong, T. Nguyen-Tang, D. Lee, D. Nguyen, T. Tran, D. L. W. Hall, C. Kang, and J. Choi. Neu- ral ODE transformers: Analyzing internal dynamics and adaptive fine-tuning. InInternational Conference on Learning Representations, 2025

  54. [54]

    E. J. Topol. High-performance medicine: the convergence of human and artificial intelligence. Nature Medicine, 25(1):44–56, 2019

  55. [55]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi` ere, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

  56. [56]

    Vaswani, N

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polo- sukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  57. [57]

    Villani.Optimal transport: old and new, volume 338

    C. Villani.Optimal transport: old and new, volume 338. Springer, 2008

  58. [58]

    Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation.arXiv preprint arXiv:1609.08144, 2016

  59. [59]

    Xiong, Y

    R. Xiong, Y. Yang, D. He, K. Zheng, S. Zheng, C. Xing, H. Zhang, Y. Lan, L. Wang, and T. Liu. On layer normalization in the transformer architecture. InInternational Conference on Machine Learning, pages 10524–10533. PMLR, 2020

  60. [60]

    Zhang and R

    B. Zhang and R. Sennrich. Root mean square layer normalization.Advances in neural information processing systems, 32, 2019

  61. [61]

    Zhang, Y

    H. Zhang, Y. N. Dauphin, and T. Ma. Fixup initialization: Residual learning without normal- ization. InInternational Conference on Learning Representations, 2019

  62. [62]

    Zhang, D

    H. Zhang, D. Yu, M. Yi, W. Chen, and T.-Y. Liu. Stabilize deep ResNet with a sharp scaling factorτ.Machine Learning, 111(9):3359–3392, 2022

  63. [63]

    Zhang, C

    Y. Zhang, C. Chen, T. Ding, Z. Li, R. Sun, and Z.-Q. Luo. Why transformers need Adam: A Hessian perspective.Advances in neural information processing systems, 37:131786–131823, 2024. 42