Linear Convergence Rate in Convex Setup is Possible! Gradient Descent Method Variants under $(L_0,L_1)$-Smoothness

Aleksandr Lobanov; Alexander Gasnikov; Eduard Gorbunov; Martin Tak\'a\v{c}

arxiv: 2412.17050 · v2 · pith:3R3BOGUBnew · submitted 2024-12-22 · 🧮 math.OC

Linear Convergence Rate in Convex Setup is Possible! Gradient Descent Method Variants under (L₀,L₁)-Smoothness

Aleksandr Lobanov , Alexander Gasnikov , Eduard Gorbunov , Martin Tak\'a\v{c} This is my paper

classification 🧮 math.OC

keywords descentgradientnablawhenavailableconvergenceconvexsmoothness

0 comments

read the original abstract

The gradient descent (GD) method -- is a fundamental and likely the most popular optimization algorithm in machine learning (ML), with a history traced back to a paper in 1847 (Cauchy, 1847). It was studied under various assumptions, including so-called $(L_0,L_1)$-smoothness, which received noticeable attention in the ML community recently. In this paper, we provide a refined convergence analysis of gradient descent and its variants, assuming generalized smoothness. In particular, we show that $(L_0,L_1)$-GD has the following behavior in the convex setup: as long as $\|\nabla f(x^k)\| \geq \frac{L_0}{L_1}$ the algorithm has linear convergence in function suboptimality, and when $\|\nabla f(x^k)\| < \frac{L_0}{L_1}$ is satisfied, $(L_0,L_1)$-GD has standard sublinear rate. Moreover, we also show that this behavior is common for its variants with different types of oracle: Normalized Gradient Descent as well as Clipped Gradient Descent (the case when the full gradient $\nabla f(x)$ is available); Random Coordinate Descent (when the gradient component $\nabla_{i} f(x)$ is available); Random Coordinate Descent with Order Oracle (when only $\text{sign} [f(y) - f(x)]$ is available). In addition, we also extend our analysis of $(L_0,L_1)$-GD to the strongly convex case.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Avoiding Bias in Clipped SGD for Overparameterized Models under Generalized Smoothness
math.OC 2026-05 unverdicted novelty 7.0

Clipped and normalized SGD converge without bias in overparameterized interpolating models under (L0,L1)-smoothness, with improved rates and extensions to heavy-tailed noise and weaker smoothness.