pith. sign in

arxiv: 2502.00753 · v4 · pith:KFWHDLGLnew · submitted 2025-02-02 · 🧮 math.OC · cs.LG

Mirror Descent Under Generalized Smoothness

classification 🧮 math.OC cs.LG
keywords smoothnessoptimizationnormconvergencedescentmirrorunderclassic
0
0 comments X
read the original abstract

Smoothness is crucial for attaining fast rates in first-order optimization. However, many optimization problems in modern machine learning involve non-smooth objectives. Recent studies relax the smoothness assumption by allowing the Lipschitz constant of the gradient to grow with respect to the gradient norm, which accommodates a broad range of objectives in practice. Despite this progress, existing generalizations of smoothness are restricted to Euclidean geometry with $\ell_2$-norm and only have theoretical guarantees for optimization in the Euclidean space. In this paper, we address this limitation by introducing a new $\ell*$-smoothness concept that measures the norm of Hessians in terms of a general norm and its dual, and establish convergence for mirror-descent-type algorithms, matching the rates under the classic smoothness. Notably, we propose a generalized self-bounding property that facilitates bounding the gradients via controlling suboptimality gaps, serving as a principal component for convergence analysis. Beyond deterministic optimization, we establish sharp convergence for stochastic mirror descent, matching state-of-the-art under classic smoothness. Our theory also extends to non-convex and composite optimization, which may shed light on practical usages of mirror descent, including pre-training and post-training of LLMs.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Open Problem: Is AdamW Effective Under Heavy-Tailed Noise?

    cs.LG 2026-06 unverdicted novelty 5.0

    The paper poses whether AdamW converges under heavy-tailed stochastic gradient noise and supplies a weighted-metric benchmark plus a corridor lower-bound showing how denominator memory can obscure large gradients.