pith. sign in

math.OC

Optimization and Control

Operations research, linear programming, control theory, systems theory, optimal control, game theory

4
math.OC 2026-05-18 2 theorems

Gradient flow reaches global minima for infinite-depth transformers

by Raphaël Barboni, Maarten V. de Hoop +2 more

Training Infinitely Deep and Wide Transformers

when initial loss is small and log-sum-exp functions remain linearly independent modulo affine functions

abstract click to expand
Transformers have become the dominant architecture in modern machine learning, yet the theoretical understanding of their training dynamics remains limited. This paper develops a rigorous mathematical framework for analyzing gradient-based training of transformers in the mean-field regime, where both the depth (number of layers) and width (number of attention heads) tend to infinity. While ResNet training can be understood as controlling a neural ODE, transformer training corresponds to controlling a neural PDE, due to the coupling of multiple token distributions through the attention mechanism. Our mean-field model features two types of measure representations: token distributions evolving through layers and attention parameters at each layer. We establish well-posedness of the forward pass through infinitely deep transformers, characterizing token evolution via flow maps that satisfy ODEs in function spaces. Using adjoint sensitivity analysis, we derive an explicit formula for the conditional Wasserstein gradient of the training risk, involving adjoint variables governed by backward ODEs. We prove the existence and uniqueness of gradient flow curves in the conditional Wasserstein metric space, establishing a rigorous foundation for gradient-based transformer training. A key technical contribution is providing necessary and sufficient conditions for injectivity of the Neural Tangent Kernel (NTK) for attention mechanisms: we show that NTK injectivity is equivalent to linear independence of log-sum-exp functions modulo affine functions, a condition satisfied by diverse token distributions, including discrete distributions, uniform distributions, and Gaussian mixtures. Under this NTK injectivity assumption, we prove that gradient flow converges to global minima when the initial loss is sufficiently small, eliminating spurious local minima from the optimization landscape.
0
2
math.OC 2026-05-19 2 theorems

Primal-dual trajectories converge without Lipschitz gradients for α ≥ 3

by Xin He, Nan-jing Huang +2 more

Trajectory convergence and o(t⁻²) rates for Nesterov accelerated primal-dual dynamics without Lipschitz gradient assumption

Finite-dimensional Bregman arguments establish convergence at critical damping and improved rates for stronger damping.

abstract click to expand
We consider the Nesterov accelerated primal-dual dynamical system \[ \begin{cases} \ddot{x}(t)+\dfrac{\alpha}{t}\dot{x}(t) +\nabla f(x(t)) +A^\top\bigl(\lambda(t)+\theta t\dot{\lambda}(t)\bigr)+\beta A^\top(Ax(t)-b)=0,\\[0.6em] \ddot{\lambda}(t)+\dfrac{\alpha}{t}\dot{\lambda}(t) -\bigl(A(x(t)+\theta t\dot{x}(t))-b\bigr)=0, \end{cases} \] which is linked to the linearly constrained optimization problem $ \min_{x\in\mathbb{R}^n} f(x),\ s.t.\ Ax=b, $ where $\alpha\ge 3$ and $f$ is convex and continuously differentiable. In a Hilbert framework, the weak convergence of its trajectory was established by Bo\c{t} and Nguyen (J. Differential Equations, 303:369--406, 2021) under $\alpha>3$ and the Lipschitz continuity assumption on $\nabla f$. In this paper, we prove in finite-dimensional spaces that the trajectory converges to a primal-dual solution for $\alpha\ge3$, without assuming Lipschitz continuity of $\nabla f$. Moreover, when $\alpha>3$, we establish improved $o(t^{-2})$ convergence rates for both the objective residual and the feasibility violation. Our analysis relies on Bregman-distance arguments, instead of the Lipschitz continuity of $\nabla f$. The same strategy can also be extended to time-scaled primal-dual dynamics to obtain analogous convergence results. To the best of our knowledge, this is the first results in this topic without Lipschitz gradient assumption. Our result also present the first work on the convergence of the trajectory of the accelerated primal-dual dynamical system for the critical case $\alpha=3$.
1 0
2
math.OC 2026-05-15 2 theorems

Everywhere regularity in bilevel problems is non-prevalent

by Xiaotian Jiang, Chang He +2 more

On the Nature of Regularity Assumptions in Bilevel Optimization with Constrained Lower-level Problem

Structural invariants cannot be made consistent by small perturbations, yet the conditions hold almost everywhere after generic random ones.

Figure from the paper full image
abstract click to expand
In this paper, we study the regularity assumptions commonly adopted in bilevel optimization with constrained lower-level problems, including the linear independence constraint qualification, the strict complementary slackness condition, and the second-order sufficient condition. These conditions are typically required to hold for the lower-level problem at every upper-level variable $x$. We first show that the requirement that these conditions hold at every upper-level variable $x$ is strong, in the sense that it is non-prevalent: there exist problems for which no sufficiently small perturbation of the lower-level objective and constraints can make the conditions hold at every $x$. To establish the result, we prove rigidity theorems showing that certain structural quantities of the lower-level problem must remain invariant across all $x$ whenever these conditions hold everywhere. We then construct explicit counterexamples in which these invariants differ between two values of $x$. In contrast, we show that the weaker requirement, that these conditions hold at almost every $x$, is a weak assumption, in the sense that it is prevalent: with probability one over a random perturbation of the lower-level objective and constraints, each condition holds at almost every $x$. We further analyze the gap between the two requirements. Although the ``every $x$'' and ``almost every $x$'' versions differ only on a measure-zero set, we show that this difference introduces fundamental difficulties in both theory and computation for bilevel optimization.
0
1
math.OC 2026-05-13 2 theorems

Mixed scores in diffusion reduce to geometric potential

by Kang Liu, Enrique Zuazua

Geometric Asymptotics of Score Mixing and Guidance in Diffusion Models

Small-time dynamics governed by weighted squared distances to data supports, for both mixture and amplified guidance

Figure from the paper full image
abstract click to expand
Diffusion models are routinely guided in practice by combining multiple score fields, yet the mathematical structure of score mixing is still poorly understood. We study the small-time generation dynamics driven by mixed scores $$ s=\lambda\,\nabla\log u_1+(1-\lambda)\,\nabla\log u_2,\qquad \lambda\ge 0, $$ in the heat-flow framework, where $u_1,u_2$ are heat evolutions of two compactly supported probability measures. This single formulation covers both the mixture-of-experts regime $(0\leq \lambda\leq 1)$ and the classifier-free guidance regime $(\lambda>1)$. Exploiting a Laplace-Varadhan principle under a similarity-time rescaling, we show that the small-time generation dynamics is governed by the explicit geometric potential $$ \Phi_\lambda=\lambda d_1^2+(1-\lambda)d_2^2, $$ which depends only on the supports of the initial measures and on the mixing parameter. This gives a rigorous reduction from a singular, non-autonomous score-driven dynamics to autonomous Clarke-type subgradient inclusions. In the empirical setting of finite Dirac mixtures, the limiting potential is piecewise quadratic with a Voronoi-type structure; this rigidity yields convergence of all autonomous limiting trajectories to critical points and a conditional convergence criterion for the original generation flow toward local minimizers of the potential, with rate $\mathcal O(\sqrt t)$ in the smooth stable case.
0

browse all of math.OC → full archive · search · sub-categories