Training Infinitely Deep and Wide Transformers

Gabriel Peyr\'e; Maarten V. de Hoop; Rapha\"el Barboni; Takashi Furuya

arxiv: 2605.17660 · v1 · pith:6QDFC2Y4new · submitted 2026-05-17 · 🧮 math.OC · cs.AI· cs.LG· stat.ML

Training Infinitely Deep and Wide Transformers

Rapha\"el Barboni , Maarten V. de Hoop , Takashi Furuya , Gabriel Peyr\'e This is my paper

Pith reviewed 2026-05-19 22:08 UTC · model grok-4.3

classification 🧮 math.OC cs.AIcs.LGstat.ML

keywords transformersmean-field limitgradient flowWasserstein metricNeural Tangent Kernelattention mechanismsglobal convergenceoptimization landscape

0 comments

The pith

Gradient flow in the conditional Wasserstein metric converges to global minima for infinitely deep and wide transformers when the initial loss is small enough and the attention NTK is injective.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a mean-field model in which both the depth and width of a transformer tend to infinity. In this limit the forward pass becomes a well-posed evolution of token distributions governed by ODEs in function space, while the training risk is minimized by a gradient flow in the conditional Wasserstein metric. The central theorem states that this flow reaches a global minimizer whenever the starting loss is sufficiently small, once an injectivity condition on the Neural Tangent Kernel of the attention mechanism is satisfied. This injectivity is shown to be equivalent to the linear independence of log-sum-exp functions modulo affine functions and holds for a wide class of token distributions. The result therefore rules out spurious local minima in the optimization landscape of the infinite transformer under the stated assumptions.

Core claim

Under the Neural Tangent Kernel injectivity assumption—equivalent to linear independence of log-sum-exp functions modulo affine functions—gradient flow in the conditional Wasserstein metric converges to global minima of the training risk for mean-field transformers whenever the initial loss is sufficiently small.

What carries the argument

The conditional Wasserstein gradient flow of the training risk, derived from adjoint sensitivity analysis on the forward and backward ODEs that govern the evolution of token distributions and attention parameters.

If this is right

The optimization landscape of the infinite transformer contains no spurious local minima once the NTK injectivity condition holds.
Gradient-based training reaches a global minimum from any sufficiently low initial loss value.
Token distributions evolve according to unique flow maps that satisfy well-posed ODEs in an appropriate function space.
The injectivity condition is satisfied by discrete distributions, uniform distributions, and Gaussian mixtures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Large finite transformers may inherit approximate global convergence when their width and depth are big enough for the mean-field description to be accurate.
The linear-independence condition on log-sum-exp functions could be checked numerically on real data sets to predict whether global convergence is likely.
The same conditional-Wasserstein-flow approach may be adaptable to other attention-based or coupling-based architectures.

Load-bearing premise

The mean-field limit together with the NTK injectivity condition continue to describe the behavior of the large but finite transformers that are actually trained.

What would settle it

A concrete token distribution for which the associated log-sum-exp functions are linearly dependent modulo affine functions, together with a numerical check that the resulting gradient flow reaches a local minimum rather than a global one even from small initial loss.

read the original abstract

Transformers have become the dominant architecture in modern machine learning, yet the theoretical understanding of their training dynamics remains limited. This paper develops a rigorous mathematical framework for analyzing gradient-based training of transformers in the mean-field regime, where both the depth (number of layers) and width (number of attention heads) tend to infinity. While ResNet training can be understood as controlling a neural ODE, transformer training corresponds to controlling a neural PDE, due to the coupling of multiple token distributions through the attention mechanism. Our mean-field model features two types of measure representations: token distributions evolving through layers and attention parameters at each layer. We establish well-posedness of the forward pass through infinitely deep transformers, characterizing token evolution via flow maps that satisfy ODEs in function spaces. Using adjoint sensitivity analysis, we derive an explicit formula for the conditional Wasserstein gradient of the training risk, involving adjoint variables governed by backward ODEs. We prove the existence and uniqueness of gradient flow curves in the conditional Wasserstein metric space, establishing a rigorous foundation for gradient-based transformer training. A key technical contribution is providing necessary and sufficient conditions for injectivity of the Neural Tangent Kernel (NTK) for attention mechanisms: we show that NTK injectivity is equivalent to linear independence of log-sum-exp functions modulo affine functions, a condition satisfied by diverse token distributions, including discrete distributions, uniform distributions, and Gaussian mixtures. Under this NTK injectivity assumption, we prove that gradient flow converges to global minima when the initial loss is sufficiently small, eliminating spurious local minima from the optimization landscape.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper builds a mean-field PDE model for infinite-depth and infinite-width transformers, derives an explicit conditional Wasserstein gradient via adjoints, and proves global convergence under an NTK injectivity condition phrased as linear independence of log-sum-exp functions.

read the letter

The core contribution is a function-space treatment of transformer training as a controlled neural PDE rather than an ODE. Tokens evolve as measures coupled by attention, attention parameters act as controls at each layer, and the authors obtain an explicit formula for the gradient in the conditional Wasserstein metric. They also give a necessary-and-sufficient criterion for NTK injectivity that reduces to linear independence of log-sum-exp functions modulo affine maps, and they verify this holds for discrete sets, uniforms, and Gaussian mixtures. Under that assumption they show gradient flow reaches global minima whenever the initial loss is small enough. The well-posedness of the forward flow maps and the existence of the gradient-flow curves are established with standard tools from measure-valued ODEs and adjoint sensitivity. That part is cleanly executed and extends the ResNet-ODE literature in a natural direction. The technical step on the NTK condition is independent of the convergence argument and stands on its own. The main gap is the missing link to finite models. The argument supplies no quantitative stability or approximation rates that would control how close a large but finite transformer stays to the infinite limit, so discretization errors in depth or width could reintroduce critical points or violate the independence condition before the loss gets small. The stress-test note correctly flags this. Readers who already work with mean-field limits, Wasserstein geometry, or neural ODEs will find the setup useful and the proofs worth checking. The paper is formally grounded enough to merit a serious referee, even if the finite-to-infinite passage needs more work in revision.

Referee Report

2 major / 2 minor

Summary. The paper develops a mean-field analysis for training transformers with both depth and width tending to infinity. It models the system as a neural PDE coupling token distributions and attention parameters, proves well-posedness of the infinite-depth forward pass via flow maps and function-space ODEs, derives the training gradient explicitly using adjoint sensitivity in the conditional Wasserstein metric, establishes existence and uniqueness of gradient-flow trajectories, characterizes NTK injectivity as linear independence of log-sum-exp functions modulo affine maps (satisfied by discrete, uniform, and Gaussian-mixture token distributions), and proves that, under this injectivity assumption, gradient flow converges to global minima whenever the initial loss is sufficiently small.

Significance. If the derivations hold, the work supplies a rigorous PDE-theoretic foundation for transformer optimization that parallels neural-ODE analyses of ResNets while accounting for the attention-induced coupling of token measures. The explicit adjoint formula for the conditional Wasserstein gradient and the necessary-and-sufficient NTK-injectivity characterization are technically substantive contributions. The global-convergence result under a verifiable injectivity condition would be a notable advance for understanding the absence of spurious local minima in the infinite-width/depth regime.

major comments (2)

[§5 (global convergence theorem)] The global-convergence statement (presumably Theorem 5.3 or the main result in §5) assumes NTK injectivity persists along the entire gradient-flow trajectory. The manuscript must verify or prove that the linear-independence condition on log-sum-exp functions is preserved (or at least not violated) by the evolving token distributions under the forward PDE; without this invariance the assumption cannot be maintained from a small initial loss to the global minimum.
[§2–3 (mean-field limit and passage from finite to infinite)] No quantitative approximation rates or stability estimates are supplied that control the distance between the finite-width/depth transformer and its mean-field PDE limit. Because the central claim concerns practical gradient-based training, the absence of such bounds leaves open whether discretization or finite-head errors can reintroduce critical points before the loss becomes small (as noted in the stress-test concern).

minor comments (2)

[§2] Notation for the conditional Wasserstein metric and the precise coupling between token measures and attention parameters should be introduced with a short self-contained paragraph early in §2 to aid readers unfamiliar with Wasserstein geometry on product spaces.
[§3.1] The regularity assumptions (e.g., Lipschitz constants, moment bounds) required for well-posedness of the function-space ODEs and for the adjoint equations should be stated explicitly rather than left implicit in the existence proofs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading of our manuscript and the constructive comments. We address the major comments point by point below.

read point-by-point responses

Referee: [§5 (global convergence theorem)] The global-convergence statement (presumably Theorem 5.3 or the main result in §5) assumes NTK injectivity persists along the entire gradient-flow trajectory. The manuscript must verify or prove that the linear-independence condition on log-sum-exp functions is preserved (or at least not violated) by the evolving token distributions under the forward PDE; without this invariance the assumption cannot be maintained from a small initial loss to the global minimum.

Authors: We thank the referee for highlighting this subtlety. The global convergence result in §5 is established under the assumption that the NTK injectivity condition (linear independence of log-sum-exp functions modulo affine maps) holds for the token distributions. In the revised manuscript we will add a remark immediately following the statement of the theorem clarifying that the assumption is required to hold along the entire trajectory. We will further include a short continuity argument showing that, because the forward evolution is realized by smooth flow maps that act continuously on the space of measures in the weak topology, and because the set of measures satisfying the independence condition is open, the property is preserved whenever the initial loss is sufficiently small. This constitutes a partial revision. revision: partial
Referee: [§2–3 (mean-field limit and passage from finite to infinite)] No quantitative approximation rates or stability estimates are supplied that control the distance between the finite-width/depth transformer and its mean-field PDE limit. Because the central claim concerns practical gradient-based training, the absence of such bounds leaves open whether discretization or finite-head errors can reintroduce critical points before the loss becomes small (as noted in the stress-test concern).

Authors: We agree that quantitative approximation rates between finite transformers and the mean-field PDE limit would strengthen the link to practical training. However, the present work concentrates on the rigorous analysis of the infinite-depth, infinite-width regime itself: well-posedness of the neural PDE, derivation of the conditional Wasserstein gradient via adjoint sensitivity, and global convergence under the NTK injectivity condition. Establishing explicit rates would require additional stability estimates for the coupled attention mechanism in the conditional Wasserstein metric and is a substantial undertaking that lies outside the current scope. In the revised manuscript we will insert a brief paragraph in the introduction and a corresponding note in the conclusion acknowledging this limitation and identifying the derivation of approximation rates as an important direction for future research. The core claims of the paper remain unaffected. revision: no

Circularity Check

0 steps flagged

No circularity: derivations use standard adjoint and Wasserstein tools with independent NTK characterization

full rationale

The paper's core chain proceeds from well-posedness of the infinite-depth forward pass via flow maps satisfying ODEs in function spaces, through adjoint sensitivity to obtain the explicit conditional Wasserstein gradient, to existence/uniqueness of gradient-flow curves in that metric, and finally to global convergence under the NTK-injectivity assumption when initial loss is small. The injectivity result is stated as an independent necessary-and-sufficient characterization (linear independence of log-sum-exp functions modulo affine maps) verified on concrete distributions; it is not obtained by fitting inside the paper nor by self-referential definition. All steps rely on classical PDE/OT techniques whose validity does not presuppose the target convergence statement. No load-bearing self-citation, ansatz smuggling, or renaming of known empirical patterns occurs. The derivation is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The framework rests on standard results from analysis and optimal transport rather than new free parameters or ad-hoc entities; the main modeling choice is the mean-field representation itself.

axioms (2)

standard math Well-posedness of ODEs in appropriate function spaces for the token flow maps
Invoked to characterize the forward pass of the infinitely deep transformer.
domain assumption Existence and uniqueness of gradient flows in the conditional Wasserstein metric
Used to guarantee that the training dynamics are well-defined.

invented entities (1)

Conditional Wasserstein metric on the space of attention parameters coupled to token distributions no independent evidence
purpose: To equip the mean-field training problem with a geometry in which gradient flows can be defined and analyzed
Introduced as the natural geometry for the coupled token-and-attention system; no independent empirical evidence supplied.

pith-pipeline@v0.9.0 · 5829 in / 1586 out tokens · 66986 ms · 2026-05-19T22:08:22.341277+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages

[1]

Transformers learn to imple- ment preconditioned gradient descent for in-context learning

Kwangjun Ahn, Xiang Cheng, Hadi Daneshmand, and Suvrit Sra. Transformers learn to imple- ment preconditioned gradient descent for in-context learning. InAdvances in Neural Information Processing Systems, volume 36, 2023

work page 2023
[2]

What learning algorithm is in-context learning? investigations with linear models

Ekin Aky¨ urek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. What learning algorithm is in-context learning? investigations with linear models. InInternational Conference on Learning Representations, 2023

work page 2023
[3]

A convergence theory for deep learning via over-parameterization

Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. InInternational Conference on Machine Learning, pages 242–252. PMLR, 2019

work page 2019
[4]

Transport equation and Cauchy problem for non-smooth vector fields

Luigi Ambrosio. Transport equation and Cauchy problem for non-smooth vector fields. InCalculus of variations and nonlinear partial differential equations, pages 1–41. Springer, 2008

work page 2008
[5]

A user’s guide to optimal transport

Luigi Ambrosio and Nicola Gigli. A user’s guide to optimal transport. In Benedetto Piccoli and Michel Rascle, editors,Modelling and Optimisation of Flows on Networks, volume 2062 ofLecture Notes in Mathematics, pages 1–155. Springer, Berlin, Heidelberg, 2012

work page 2062
[6]

Lectures in Mathematics ETH Z¨ urich

Luigi Ambrosio, Nicola Gigli, and Giuseppe Savar´ e.Gradient Flows: In Metric Spaces and in the Space of Probability Measures. Lectures in Mathematics ETH Z¨ urich. Birkh¨ auser, Basel, second edition, 2008

work page 2008
[7]

Vladimir Arnold. Sur la g´ eom´ etrie diff´ erentielle des groupes de lie de dimension infinie et ses applications ` a l’hydrodynamique des fluides parfaits.Annales de l’Institut Fourier, 16(1):319– 361, 1966

work page 1966
[8]

SIAM, 2014

Hedy Attouch, Giuseppe Buttazzo, and G´ erard Michaille.Variational analysis in Sobolev and BV spaces: applications to PDEs and optimization. SIAM, 2014

work page 2014
[9]

Integral manifolds for Carath´ eodory type differential equa- tions in Banach spaces

Bernd Aulbach and Thomas Wanner. Integral manifolds for Carath´ eodory type differential equa- tions in Banach spaces. In Bernd Aulbach and Fritz Colonius, editors,Six Lectures on Dynamical Systems, pages 45–119. World Scientific, Singapore, 1996

work page 1996
[10]

Understanding the training of infinitely deep and wide ResNets with conditional optimal transport.Communications on Pure and Applied Mathematics, 78(11):2149–2205, 2025

Rapha¨ el Barboni, Gabriel Peyr´ e, and Fran¸ cois-Xavier Vialard. Understanding the training of infinitely deep and wide ResNets with conditional optimal transport.Communications on Pure and Applied Mathematics, 78(11):2149–2205, 2025

work page 2025
[11]

How smooth is attention? InInternational Conference on Machine Learning, pages 5817–5840

Val´ erie Castin, Pierre Ablin, and Gabriel Peyr´ e. How smooth is attention? InInternational Conference on Machine Learning, pages 5817–5840. PMLR, 2024

work page 2024
[12]

Convergence of gradient descent for deep neural networks

Sourav Chatterjee. Convergence of gradient descent for deep neural networks.arXiv preprint arXiv:2203.16462, 2022

work page arXiv 2022
[13]

Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. Neural ordinary differential equations.Advances in Neural Information Processing Systems, 31, 2018. 31

work page 2018
[14]

On the global convergence of gradient descent for over- parameterized models using optimal transport.Advances in Neural Information Processing Sys- tems, 31:3036–3046, 2018

L´ ena¨ ıc Chizat and Francis Bach. On the global convergence of gradient descent for over- parameterized models using optimal transport.Advances in Neural Information Processing Sys- tems, 31:3036–3046, 2018

work page 2018
[15]

New problems on minimizing movements

Ennio De Giorgi. New problems on minimizing movements. In Claudio Baiocchi and Jacques-Louis Lions, editors,Boundary Value Problems for Partial Differential Equations and Applications, volume 29 ofRMA Research Notes in Applied Mathematics, pages 81–98. Masson, Paris, 1993

work page 1993
[16]

Springer, Berlin, Heidelberg, 1977

Klaus Deimling.Ordinary Differential Equations in Banach Spaces, volume 596 ofLecture Notes in Mathematics. Springer, Berlin, Heidelberg, 1977

work page 1977
[17]

Local conditions for global convergence of gradient flows and proximal point sequences in metric spaces.Transactions of the American Mathematical Society, 377(06):3779–3804, 2024

Lorenzo Dello Schiavo, Jan Maas, and Francesco Pedrotti. Local conditions for global convergence of gradient flows and proximal point sequences in metric spaces.Transactions of the American Mathematical Society, 377(06):3779–3804, 2024

work page 2024
[18]

Uhl, Jr.Vector Measures, volume 15 ofMathematical Surveys

Joe Diestel and Jerry J. Uhl, Jr.Vector Measures, volume 15 ofMathematical Surveys. American Mathematical Society, Providence, RI, 1977

work page 1977
[19]

On the global convergence of gradient descent for multi-layer ResNets in the mean-field regime.arXiv preprint arXiv:2110.02926, 2021

Zhiyan Ding, Shi Chen, Qin Li, and Stephen Wright. On the global convergence of gradient descent for multi-layer ResNets in the mean-field regime.arXiv preprint arXiv:2110.02926, 2021

work page arXiv 2021
[20]

Gradient descent finds global minima of deep neural networks

Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. InInternational Conference on Machine Learning, pages 1675–

work page
[21]

de Hoop, and Gabriel Peyr´ e

Takashi Furuya, Maarten V. de Hoop, and Gabriel Peyr´ e. Transformers are universal in-context learners. InInternational Conference on Learning Representations, 2025

work page 2025
[22]

Global convergence in training large-scale Transformers.Advances in Neural Information Processing Systems, 37:29213–29284, 2024

Cheng Gao, Yuan Cao, Zihao Li, Yihan He, Mengdi Wang, Han Liu, Jason Klusowski, and Jian- qing Fan. Global convergence in training large-scale Transformers.Advances in Neural Information Processing Systems, 37:29213–29284, 2024

work page 2024
[23]

Liang, and Gregory Valiant

Shivam Garg, Dimitris Tsipras, Percy S. Liang, and Gregory Valiant. What can Transformers learn in-context? a case study of simple function classes.Advances in Neural Information Processing Systems, 35:30583–30598, 2022

work page 2022
[24]

The emergence of clusters in self-attention dynamics.Advances in Neural Information Processing Systems, 36:57026– 57037, 2023

Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. The emergence of clusters in self-attention dynamics.Advances in Neural Information Processing Systems, 36:57026– 57037, 2023

work page 2023
[25]

Kurdyka– Lojasiewicz–Simon inequality for gradient flows in metric spaces.Transactions of the American Mathematical Society, 372(7):4917–4976, 2019

Daniel Hauer and Jos´ e Maz´ on. Kurdyka– Lojasiewicz–Simon inequality for gradient flows in metric spaces.Transactions of the American Mathematical Society, 372(7):4917–4976, 2019

work page 2019
[26]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016
[27]

A convergence result of a continuous model of deep learning via Lojasiewicz–Simon inequality.arXiv preprint arXiv:2311.15365, 2023

Noboru Isobe. A convergence result of a continuous model of deep learning via Lojasiewicz–Simon inequality.arXiv preprint arXiv:2311.15365, 2023

work page arXiv 2023
[28]

Neural tangent kernel: Convergence and generalization in neural networks.Advances in Neural Information Processing Systems, 31, 2018

Arthur Jacot, Franck Gabriel, and Cl´ ement Hongler. Neural tangent kernel: Convergence and generalization in neural networks.Advances in Neural Information Processing Systems, 31, 2018

work page 2018
[29]

The variational formulation of the Fokker– Planck equation.SIAM journal on mathematical analysis, 29(1):1–17, 1998

Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation of the Fokker– Planck equation.SIAM journal on mathematical analysis, 29(1):1–17, 1998

work page 1998
[30]

On the linearity of large non-linear models: When and why the tangent kernel is constant.Advances in Neural Information Processing Systems, 33, 2020

Chaoyue Liu, Libin Zhu, and Mikhail Belkin. On the linearity of large non-linear models: When and why the tangent kernel is constant.Advances in Neural Information Processing Systems, 33, 2020

work page 2020
[31]

A mean field analysis of deep ResNet and beyond: Towards provably optimization via overparameterization from depth

Yiping Lu, Chao Ma, Yulong Lu, Jianfeng Lu, and Lexing Ying. A mean field analysis of deep ResNet and beyond: Towards provably optimization via overparameterization from depth. In International Conference on Machine Learning, pages 6426–6436. PMLR, 2020. 32

work page 2020
[32]

Hashimoto, and Tengyu Ma

Arvind Mahankali, Tatsunori B. Hashimoto, and Tengyu Ma. One step of gradient descent is provably the optimal in-context learner with one layer of linear self-attention. InInternational Conference on Learning Representations, 2024

work page 2024
[33]

Scaling ResNets in the large-depth regime.Journal of Machine Learning Research, 26(56):1–48, 2025

Pierre Marion, Adeline Fermanian, G´ erard Biau, and Jean-Philippe Vert. Scaling ResNets in the large-depth regime.Journal of Machine Learning Research, 26(56):1–48, 2025

work page 2025
[34]

A mean field view of the landscape of two- layer neural networks.Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018

Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the landscape of two- layer neural networks.Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018

work page 2018
[35]

Micchelli, Yuesheng Xu, and Haizhang Zhang

Charles A. Micchelli, Yuesheng Xu, and Haizhang Zhang. Universal kernels.Journal of Machine Learning Research, 7(95):2651–2667, 2006

work page 2006
[36]

Heterogeneous gradient flows in the topology of fibered optimal transport.Calculus of Variations and Partial Differential Equations, 62(9):258, 2023

Jan Peszek and David Poyato. Heterogeneous gradient flows in the topology of fibered optimal transport.Calculus of Variations and Partial Differential Equations, 62(9):258, 2023

work page 2023
[37]

Control to flocking of the kinetic Cucker–Smale model.SIAM Journal on Mathematical Analysis, 47(6):4685–4719, 2015

Benedetto Piccoli, Francesco Rossi, and Emmanuel Tr´ elat. Control to flocking of the kinetic Cucker–Smale model.SIAM Journal on Mathematical Analysis, 47(6):4685–4719, 2015

work page 2015
[38]

On the convergence of gradient descent on learning Transformers with residual connections.IEEE Signal Processing Letters, pages 1–5, 2026

Zhen Qin, Jinxin Zhou, Jiachen Jiang, and Zhihui Zhu. On the convergence of gradient descent on learning Transformers with residual connections.IEEE Signal Processing Letters, pages 1–5, 2026

work page 2026
[39]

Parameters as interacting particles: long time conver- gence and asymptotic error scaling of neural networks.Advances in neural information processing systems, 31, 2018

Grant Rotskoff and Eric Vanden-Eijnden. Parameters as interacting particles: long time conver- gence and asymptotic error scaling of neural networks.Advances in neural information processing systems, 31, 2018

work page 2018
[40]

Sander, Pierre Ablin, Mathieu Blondel, and Gabriel Peyr´ e

Michael E. Sander, Pierre Ablin, Mathieu Blondel, and Gabriel Peyr´ e. Sinkformers: Transform- ers with doubly stochastic attention. InInternational Conference on Artificial Intelligence and Statistics, pages 3515–3530. PMLR, 2022

work page 2022
[41]

Birkh¨ auser, Cham, 2015

Filippo Santambrogio.Optimal Transport for Applied Mathematicians: Calculus of Variations, PDEs, and Modeling, volume 87 ofProgress in Nonlinear Differential Equations and Their Ap- plications. Birkh¨ auser, Cham, 2015

work page 2015
[42]

Euclidean, metric, and Wasserstein gradient flows: An overview.Bulletin of Mathematical Sciences, 7:87–154, 2017

Filippo Santambrogio. Euclidean, metric, and Wasserstein gradient flows: An overview.Bulletin of Mathematical Sciences, 7:87–154, 2017

work page 2017
[43]

Unraveling the gradient descent dynamics of Transformers.Advances in Neural Information Processing Systems, 37:92317– 92351, 2024

Bingqing Song, Boran Han, Shuai Zhang, Jie Ding, and Mingyi Hong. Unraveling the gradient descent dynamics of Transformers.Advances in Neural Information Processing Systems, 37:92317– 92351, 2024

work page 2024
[44]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

work page 2017
[45]

Springer, 2009

C´ edric Villani.Optimal transport: old and new, volume 338. Springer, 2009

work page 2009
[46]

On the convergence of encoder- only shallow Transformers.Advances in Neural Information Processing Systems, 36:52197–52237, 2023

Yongtao Wu, Fanghui Liu, Grigorios Chrysos, and Volkan Cevher. On the convergence of encoder- only shallow Transformers.Advances in Neural Information Processing Systems, 36:52197–52237, 2023

work page 2023
[47]

Springer Berlin Heidelberg, Berlin, Heidelberg, 2010

Laurent Younes.Shapes and Diffeomorphisms, volume 171 ofApplied Mathematical Sciences. Springer Berlin Heidelberg, Berlin, Heidelberg, 2010

work page 2010
[48]

local slope

Hongyi Zhang, Yann N Dauphin, and Tengyu Ma. Fixup initialization: Residual learning without normalization. InInternational Conference on Learning Representations, 2019. 33 A Proofs of section 2 This section is devoted to proving the results in section 2. Recall that we are considering Attention layers of the form eq. (1), parameterized by tripletsθ= (Q, ...

work page 2019
[49]

43 The above result relies on the two following lemmas, showing local Lipschitz regularity of the adjoint variables and of the gradient field w.r.t

= 0, which leads to the desired result. 43 The above result relies on the two following lemmas, showing local Lipschitz regularity of the adjoint variables and of the gradient field w.r.t. the parameterization. Lemma 9.Fix an input token distributionµ∈ P c(Rd)and an input tokenx∈R d. LetSbe some compact set s.t.Supp(µ)∪ {x} ⊂S. Then the associated adjoint...

work page
[50]

for everys∈[0,1]it holds for every radiusR >0 and everyµ∈ P c(Rd)supported inB(0, R): ∥Vs[µ]∥C0 ≤C 0(s)(1 +R)

There exists a functionC 0 ∈L 1([0,1])s.t. for everys∈[0,1]it holds for every radiusR >0 and everyµ∈ P c(Rd)supported inB(0, R): ∥Vs[µ]∥C0 ≤C 0(s)(1 +R)

work page
[51]

For every radiusR >0there exist functionsL R, MR ∈L 1([0,1])s.t. for everyµ∈ P c(Rd) supported inB(0, R)it holds: sup x,y∈Rd ∥Vs[µ](x)−V s[µ](y)∥ ≤L R(s)∥x−y∥ and for everyµ, ν∈ P c(Rd)supported onB(0, R): ∥Vs[µ]−V s[ν]∥C0(B(0,R)) ≤M R(s)W1(µ, ν). Under those assumptions, one can show the existence and uniqueness of the transport equation with velocity-fi...

work page
[52]

for everyx∈ Xthe mapt∈I7→f(t, x)∈ Xis measurable, 1

work page
[53]

•We sayfislocallyL 1-Lipschitzif for every bounded subsetV ⊂ Xthere exists a function LV ∈L 1 loc(I)such that for a.e.t∈Iit holds: ∀x, y∈ V,∥f(t, x)−f(t, y)∥ ≤L V(t)∥x−y∥

for a.e.t∈I, the mapx∈ X 7→f(t, x)∈ Xis continuous. •We sayfislocallyL 1-Lipschitzif for every bounded subsetV ⊂ Xthere exists a function LV ∈L 1 loc(I)such that for a.e.t∈Iit holds: ∀x, y∈ V,∥f(t, x)−f(t, y)∥ ≤L V(t)∥x−y∥. •We sayfhasL 1-linear growth if there exists a functionC∈L 1 loc(I)such that for a.e.t∈Iit holds: ∀x∈ X,∥f(t, x)∥ ≤C(t)(1 +∥x∥). Theo...

work page

[1] [1]

Transformers learn to imple- ment preconditioned gradient descent for in-context learning

Kwangjun Ahn, Xiang Cheng, Hadi Daneshmand, and Suvrit Sra. Transformers learn to imple- ment preconditioned gradient descent for in-context learning. InAdvances in Neural Information Processing Systems, volume 36, 2023

work page 2023

[2] [2]

What learning algorithm is in-context learning? investigations with linear models

Ekin Aky¨ urek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. What learning algorithm is in-context learning? investigations with linear models. InInternational Conference on Learning Representations, 2023

work page 2023

[3] [3]

A convergence theory for deep learning via over-parameterization

Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. InInternational Conference on Machine Learning, pages 242–252. PMLR, 2019

work page 2019

[4] [4]

Transport equation and Cauchy problem for non-smooth vector fields

Luigi Ambrosio. Transport equation and Cauchy problem for non-smooth vector fields. InCalculus of variations and nonlinear partial differential equations, pages 1–41. Springer, 2008

work page 2008

[5] [5]

A user’s guide to optimal transport

Luigi Ambrosio and Nicola Gigli. A user’s guide to optimal transport. In Benedetto Piccoli and Michel Rascle, editors,Modelling and Optimisation of Flows on Networks, volume 2062 ofLecture Notes in Mathematics, pages 1–155. Springer, Berlin, Heidelberg, 2012

work page 2062

[6] [6]

Lectures in Mathematics ETH Z¨ urich

Luigi Ambrosio, Nicola Gigli, and Giuseppe Savar´ e.Gradient Flows: In Metric Spaces and in the Space of Probability Measures. Lectures in Mathematics ETH Z¨ urich. Birkh¨ auser, Basel, second edition, 2008

work page 2008

[7] [7]

Vladimir Arnold. Sur la g´ eom´ etrie diff´ erentielle des groupes de lie de dimension infinie et ses applications ` a l’hydrodynamique des fluides parfaits.Annales de l’Institut Fourier, 16(1):319– 361, 1966

work page 1966

[8] [8]

SIAM, 2014

Hedy Attouch, Giuseppe Buttazzo, and G´ erard Michaille.Variational analysis in Sobolev and BV spaces: applications to PDEs and optimization. SIAM, 2014

work page 2014

[9] [9]

Integral manifolds for Carath´ eodory type differential equa- tions in Banach spaces

Bernd Aulbach and Thomas Wanner. Integral manifolds for Carath´ eodory type differential equa- tions in Banach spaces. In Bernd Aulbach and Fritz Colonius, editors,Six Lectures on Dynamical Systems, pages 45–119. World Scientific, Singapore, 1996

work page 1996

[10] [10]

Understanding the training of infinitely deep and wide ResNets with conditional optimal transport.Communications on Pure and Applied Mathematics, 78(11):2149–2205, 2025

Rapha¨ el Barboni, Gabriel Peyr´ e, and Fran¸ cois-Xavier Vialard. Understanding the training of infinitely deep and wide ResNets with conditional optimal transport.Communications on Pure and Applied Mathematics, 78(11):2149–2205, 2025

work page 2025

[11] [11]

How smooth is attention? InInternational Conference on Machine Learning, pages 5817–5840

Val´ erie Castin, Pierre Ablin, and Gabriel Peyr´ e. How smooth is attention? InInternational Conference on Machine Learning, pages 5817–5840. PMLR, 2024

work page 2024

[12] [12]

Convergence of gradient descent for deep neural networks

Sourav Chatterjee. Convergence of gradient descent for deep neural networks.arXiv preprint arXiv:2203.16462, 2022

work page arXiv 2022

[13] [13]

Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. Neural ordinary differential equations.Advances in Neural Information Processing Systems, 31, 2018. 31

work page 2018

[14] [14]

On the global convergence of gradient descent for over- parameterized models using optimal transport.Advances in Neural Information Processing Sys- tems, 31:3036–3046, 2018

L´ ena¨ ıc Chizat and Francis Bach. On the global convergence of gradient descent for over- parameterized models using optimal transport.Advances in Neural Information Processing Sys- tems, 31:3036–3046, 2018

work page 2018

[15] [15]

New problems on minimizing movements

Ennio De Giorgi. New problems on minimizing movements. In Claudio Baiocchi and Jacques-Louis Lions, editors,Boundary Value Problems for Partial Differential Equations and Applications, volume 29 ofRMA Research Notes in Applied Mathematics, pages 81–98. Masson, Paris, 1993

work page 1993

[16] [16]

Springer, Berlin, Heidelberg, 1977

Klaus Deimling.Ordinary Differential Equations in Banach Spaces, volume 596 ofLecture Notes in Mathematics. Springer, Berlin, Heidelberg, 1977

work page 1977

[17] [17]

Local conditions for global convergence of gradient flows and proximal point sequences in metric spaces.Transactions of the American Mathematical Society, 377(06):3779–3804, 2024

Lorenzo Dello Schiavo, Jan Maas, and Francesco Pedrotti. Local conditions for global convergence of gradient flows and proximal point sequences in metric spaces.Transactions of the American Mathematical Society, 377(06):3779–3804, 2024

work page 2024

[18] [18]

Uhl, Jr.Vector Measures, volume 15 ofMathematical Surveys

Joe Diestel and Jerry J. Uhl, Jr.Vector Measures, volume 15 ofMathematical Surveys. American Mathematical Society, Providence, RI, 1977

work page 1977

[19] [19]

On the global convergence of gradient descent for multi-layer ResNets in the mean-field regime.arXiv preprint arXiv:2110.02926, 2021

Zhiyan Ding, Shi Chen, Qin Li, and Stephen Wright. On the global convergence of gradient descent for multi-layer ResNets in the mean-field regime.arXiv preprint arXiv:2110.02926, 2021

work page arXiv 2021

[20] [20]

Gradient descent finds global minima of deep neural networks

Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. InInternational Conference on Machine Learning, pages 1675–

work page

[21] [21]

de Hoop, and Gabriel Peyr´ e

Takashi Furuya, Maarten V. de Hoop, and Gabriel Peyr´ e. Transformers are universal in-context learners. InInternational Conference on Learning Representations, 2025

work page 2025

[22] [22]

Global convergence in training large-scale Transformers.Advances in Neural Information Processing Systems, 37:29213–29284, 2024

Cheng Gao, Yuan Cao, Zihao Li, Yihan He, Mengdi Wang, Han Liu, Jason Klusowski, and Jian- qing Fan. Global convergence in training large-scale Transformers.Advances in Neural Information Processing Systems, 37:29213–29284, 2024

work page 2024

[23] [23]

Liang, and Gregory Valiant

Shivam Garg, Dimitris Tsipras, Percy S. Liang, and Gregory Valiant. What can Transformers learn in-context? a case study of simple function classes.Advances in Neural Information Processing Systems, 35:30583–30598, 2022

work page 2022

[24] [24]

The emergence of clusters in self-attention dynamics.Advances in Neural Information Processing Systems, 36:57026– 57037, 2023

Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. The emergence of clusters in self-attention dynamics.Advances in Neural Information Processing Systems, 36:57026– 57037, 2023

work page 2023

[25] [25]

Kurdyka– Lojasiewicz–Simon inequality for gradient flows in metric spaces.Transactions of the American Mathematical Society, 372(7):4917–4976, 2019

Daniel Hauer and Jos´ e Maz´ on. Kurdyka– Lojasiewicz–Simon inequality for gradient flows in metric spaces.Transactions of the American Mathematical Society, 372(7):4917–4976, 2019

work page 2019

[26] [26]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016

[27] [27]

A convergence result of a continuous model of deep learning via Lojasiewicz–Simon inequality.arXiv preprint arXiv:2311.15365, 2023

Noboru Isobe. A convergence result of a continuous model of deep learning via Lojasiewicz–Simon inequality.arXiv preprint arXiv:2311.15365, 2023

work page arXiv 2023

[28] [28]

Neural tangent kernel: Convergence and generalization in neural networks.Advances in Neural Information Processing Systems, 31, 2018

Arthur Jacot, Franck Gabriel, and Cl´ ement Hongler. Neural tangent kernel: Convergence and generalization in neural networks.Advances in Neural Information Processing Systems, 31, 2018

work page 2018

[29] [29]

The variational formulation of the Fokker– Planck equation.SIAM journal on mathematical analysis, 29(1):1–17, 1998

Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation of the Fokker– Planck equation.SIAM journal on mathematical analysis, 29(1):1–17, 1998

work page 1998

[30] [30]

On the linearity of large non-linear models: When and why the tangent kernel is constant.Advances in Neural Information Processing Systems, 33, 2020

Chaoyue Liu, Libin Zhu, and Mikhail Belkin. On the linearity of large non-linear models: When and why the tangent kernel is constant.Advances in Neural Information Processing Systems, 33, 2020

work page 2020

[31] [31]

A mean field analysis of deep ResNet and beyond: Towards provably optimization via overparameterization from depth

Yiping Lu, Chao Ma, Yulong Lu, Jianfeng Lu, and Lexing Ying. A mean field analysis of deep ResNet and beyond: Towards provably optimization via overparameterization from depth. In International Conference on Machine Learning, pages 6426–6436. PMLR, 2020. 32

work page 2020

[32] [32]

Hashimoto, and Tengyu Ma

Arvind Mahankali, Tatsunori B. Hashimoto, and Tengyu Ma. One step of gradient descent is provably the optimal in-context learner with one layer of linear self-attention. InInternational Conference on Learning Representations, 2024

work page 2024

[33] [33]

Scaling ResNets in the large-depth regime.Journal of Machine Learning Research, 26(56):1–48, 2025

Pierre Marion, Adeline Fermanian, G´ erard Biau, and Jean-Philippe Vert. Scaling ResNets in the large-depth regime.Journal of Machine Learning Research, 26(56):1–48, 2025

work page 2025

[34] [34]

A mean field view of the landscape of two- layer neural networks.Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018

Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the landscape of two- layer neural networks.Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018

work page 2018

[35] [35]

Micchelli, Yuesheng Xu, and Haizhang Zhang

Charles A. Micchelli, Yuesheng Xu, and Haizhang Zhang. Universal kernels.Journal of Machine Learning Research, 7(95):2651–2667, 2006

work page 2006

[36] [36]

Heterogeneous gradient flows in the topology of fibered optimal transport.Calculus of Variations and Partial Differential Equations, 62(9):258, 2023

Jan Peszek and David Poyato. Heterogeneous gradient flows in the topology of fibered optimal transport.Calculus of Variations and Partial Differential Equations, 62(9):258, 2023

work page 2023

[37] [37]

Control to flocking of the kinetic Cucker–Smale model.SIAM Journal on Mathematical Analysis, 47(6):4685–4719, 2015

Benedetto Piccoli, Francesco Rossi, and Emmanuel Tr´ elat. Control to flocking of the kinetic Cucker–Smale model.SIAM Journal on Mathematical Analysis, 47(6):4685–4719, 2015

work page 2015

[38] [38]

On the convergence of gradient descent on learning Transformers with residual connections.IEEE Signal Processing Letters, pages 1–5, 2026

Zhen Qin, Jinxin Zhou, Jiachen Jiang, and Zhihui Zhu. On the convergence of gradient descent on learning Transformers with residual connections.IEEE Signal Processing Letters, pages 1–5, 2026

work page 2026

[39] [39]

Parameters as interacting particles: long time conver- gence and asymptotic error scaling of neural networks.Advances in neural information processing systems, 31, 2018

Grant Rotskoff and Eric Vanden-Eijnden. Parameters as interacting particles: long time conver- gence and asymptotic error scaling of neural networks.Advances in neural information processing systems, 31, 2018

work page 2018

[40] [40]

Sander, Pierre Ablin, Mathieu Blondel, and Gabriel Peyr´ e

Michael E. Sander, Pierre Ablin, Mathieu Blondel, and Gabriel Peyr´ e. Sinkformers: Transform- ers with doubly stochastic attention. InInternational Conference on Artificial Intelligence and Statistics, pages 3515–3530. PMLR, 2022

work page 2022

[41] [41]

Birkh¨ auser, Cham, 2015

Filippo Santambrogio.Optimal Transport for Applied Mathematicians: Calculus of Variations, PDEs, and Modeling, volume 87 ofProgress in Nonlinear Differential Equations and Their Ap- plications. Birkh¨ auser, Cham, 2015

work page 2015

[42] [42]

Euclidean, metric, and Wasserstein gradient flows: An overview.Bulletin of Mathematical Sciences, 7:87–154, 2017

Filippo Santambrogio. Euclidean, metric, and Wasserstein gradient flows: An overview.Bulletin of Mathematical Sciences, 7:87–154, 2017

work page 2017

[43] [43]

Unraveling the gradient descent dynamics of Transformers.Advances in Neural Information Processing Systems, 37:92317– 92351, 2024

Bingqing Song, Boran Han, Shuai Zhang, Jie Ding, and Mingyi Hong. Unraveling the gradient descent dynamics of Transformers.Advances in Neural Information Processing Systems, 37:92317– 92351, 2024

work page 2024

[44] [44]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

work page 2017

[45] [45]

Springer, 2009

C´ edric Villani.Optimal transport: old and new, volume 338. Springer, 2009

work page 2009

[46] [46]

On the convergence of encoder- only shallow Transformers.Advances in Neural Information Processing Systems, 36:52197–52237, 2023

Yongtao Wu, Fanghui Liu, Grigorios Chrysos, and Volkan Cevher. On the convergence of encoder- only shallow Transformers.Advances in Neural Information Processing Systems, 36:52197–52237, 2023

work page 2023

[47] [47]

Springer Berlin Heidelberg, Berlin, Heidelberg, 2010

Laurent Younes.Shapes and Diffeomorphisms, volume 171 ofApplied Mathematical Sciences. Springer Berlin Heidelberg, Berlin, Heidelberg, 2010

work page 2010

[48] [48]

local slope

Hongyi Zhang, Yann N Dauphin, and Tengyu Ma. Fixup initialization: Residual learning without normalization. InInternational Conference on Learning Representations, 2019. 33 A Proofs of section 2 This section is devoted to proving the results in section 2. Recall that we are considering Attention layers of the form eq. (1), parameterized by tripletsθ= (Q, ...

work page 2019

[49] [49]

43 The above result relies on the two following lemmas, showing local Lipschitz regularity of the adjoint variables and of the gradient field w.r.t

= 0, which leads to the desired result. 43 The above result relies on the two following lemmas, showing local Lipschitz regularity of the adjoint variables and of the gradient field w.r.t. the parameterization. Lemma 9.Fix an input token distributionµ∈ P c(Rd)and an input tokenx∈R d. LetSbe some compact set s.t.Supp(µ)∪ {x} ⊂S. Then the associated adjoint...

work page

[50] [50]

for everys∈[0,1]it holds for every radiusR >0 and everyµ∈ P c(Rd)supported inB(0, R): ∥Vs[µ]∥C0 ≤C 0(s)(1 +R)

There exists a functionC 0 ∈L 1([0,1])s.t. for everys∈[0,1]it holds for every radiusR >0 and everyµ∈ P c(Rd)supported inB(0, R): ∥Vs[µ]∥C0 ≤C 0(s)(1 +R)

work page

[51] [51]

For every radiusR >0there exist functionsL R, MR ∈L 1([0,1])s.t. for everyµ∈ P c(Rd) supported inB(0, R)it holds: sup x,y∈Rd ∥Vs[µ](x)−V s[µ](y)∥ ≤L R(s)∥x−y∥ and for everyµ, ν∈ P c(Rd)supported onB(0, R): ∥Vs[µ]−V s[ν]∥C0(B(0,R)) ≤M R(s)W1(µ, ν). Under those assumptions, one can show the existence and uniqueness of the transport equation with velocity-fi...

work page

[52] [52]

for everyx∈ Xthe mapt∈I7→f(t, x)∈ Xis measurable, 1

work page

[53] [53]

•We sayfislocallyL 1-Lipschitzif for every bounded subsetV ⊂ Xthere exists a function LV ∈L 1 loc(I)such that for a.e.t∈Iit holds: ∀x, y∈ V,∥f(t, x)−f(t, y)∥ ≤L V(t)∥x−y∥

for a.e.t∈I, the mapx∈ X 7→f(t, x)∈ Xis continuous. •We sayfislocallyL 1-Lipschitzif for every bounded subsetV ⊂ Xthere exists a function LV ∈L 1 loc(I)such that for a.e.t∈Iit holds: ∀x, y∈ V,∥f(t, x)−f(t, y)∥ ≤L V(t)∥x−y∥. •We sayfhasL 1-linear growth if there exists a functionC∈L 1 loc(I)such that for a.e.t∈Iit holds: ∀x∈ X,∥f(t, x)∥ ≤C(t)(1 +∥x∥). Theo...

work page