Mirror Descent Under Generalized Smoothness
read the original abstract
Smoothness is crucial for attaining fast rates in first-order optimization. However, many optimization problems in modern machine learning involve non-smooth objectives. Recent studies relax the smoothness assumption by allowing the Lipschitz constant of the gradient to grow with respect to the gradient norm, which accommodates a broad range of objectives in practice. Despite this progress, existing generalizations of smoothness are restricted to Euclidean geometry with $\ell_2$-norm and only have theoretical guarantees for optimization in the Euclidean space. In this paper, we address this limitation by introducing a new $\ell*$-smoothness concept that measures the norm of Hessians in terms of a general norm and its dual, and establish convergence for mirror-descent-type algorithms, matching the rates under the classic smoothness. Notably, we propose a generalized self-bounding property that facilitates bounding the gradients via controlling suboptimality gaps, serving as a principal component for convergence analysis. Beyond deterministic optimization, we establish sharp convergence for stochastic mirror descent, matching state-of-the-art under classic smoothness. Our theory also extends to non-convex and composite optimization, which may shed light on practical usages of mirror descent, including pre-training and post-training of LLMs.
This paper has not been read by Pith yet.
Forward citations
Cited by 1 Pith paper
-
Open Problem: Is AdamW Effective Under Heavy-Tailed Noise?
The paper poses whether AdamW converges under heavy-tailed stochastic gradient noise and supplies a weighted-metric benchmark plus a corridor lower-bound showing how denominator memory can obscure large gradients.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.