Develops a mean-field neural PDE model for transformer training, proves forward-pass well-posedness via function-space ODEs, derives conditional Wasserstein gradients, and shows global convergence of gradient flow under an NTK injectivity condition equivalent to linear independence of log-sum-exp fu
Convergence of gradient descent for deep neural networks
2 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 2representative citing papers
For orthogonal inputs, gradient flow on shallow ReLU nets with MSE loss at small init converges to zero loss, exhibits min-variation-norm bias, initial alignment, and saddle-to-saddle dynamics.
citing papers explorer
-
Training Infinitely Deep and Wide Transformers
Develops a mean-field neural PDE model for transformer training, proves forward-pass well-posedness via function-space ODEs, derives conditional Wasserstein gradients, and shows global convergence of gradient flow under an NTK injectivity condition equivalent to linear independence of log-sum-exp fu
-
Gradient flow dynamics of shallow ReLU networks for square loss and orthogonal inputs
For orthogonal inputs, gradient flow on shallow ReLU nets with MSE loss at small init converges to zero loss, exhibits min-variation-norm bias, initial alignment, and saddle-to-saddle dynamics.