Develops a mean-field neural PDE model for transformer training, proves forward-pass well-posedness via function-space ODEs, derives conditional Wasserstein gradients, and shows global convergence of gradient flow under an NTK injectivity condition equivalent to linear independence of log-sum-exp fu
•We sayfislocallyL 1-Lipschitzif for every bounded subsetV ⊂ Xthere exists a function LV ∈L 1 loc(I)such that for a.e.t∈Iit holds: ∀x, y∈ V,∥f(t, x)−f(t, y)∥ ≤L V(t)∥x−y∥
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
math.OC 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Training Infinitely Deep and Wide Transformers
Develops a mean-field neural PDE model for transformer training, proves forward-pass well-posedness via function-space ODEs, derives conditional Wasserstein gradients, and shows global convergence of gradient flow under an NTK injectivity condition equivalent to linear independence of log-sum-exp fu