pith. sign in

arxiv: 2412.17050 · v2 · pith:3R3BOGUBnew · submitted 2024-12-22 · 🧮 math.OC

Linear Convergence Rate in Convex Setup is Possible! Gradient Descent Method Variants under (L₀,L₁)-Smoothness

classification 🧮 math.OC
keywords descentgradientnablawhenavailableconvergenceconvexsmoothness
0
0 comments X
read the original abstract

The gradient descent (GD) method -- is a fundamental and likely the most popular optimization algorithm in machine learning (ML), with a history traced back to a paper in 1847 (Cauchy, 1847). It was studied under various assumptions, including so-called $(L_0,L_1)$-smoothness, which received noticeable attention in the ML community recently. In this paper, we provide a refined convergence analysis of gradient descent and its variants, assuming generalized smoothness. In particular, we show that $(L_0,L_1)$-GD has the following behavior in the convex setup: as long as $\|\nabla f(x^k)\| \geq \frac{L_0}{L_1}$ the algorithm has linear convergence in function suboptimality, and when $\|\nabla f(x^k)\| < \frac{L_0}{L_1}$ is satisfied, $(L_0,L_1)$-GD has standard sublinear rate. Moreover, we also show that this behavior is common for its variants with different types of oracle: Normalized Gradient Descent as well as Clipped Gradient Descent (the case when the full gradient $\nabla f(x)$ is available); Random Coordinate Descent (when the gradient component $\nabla_{i} f(x)$ is available); Random Coordinate Descent with Order Oracle (when only $\text{sign} [f(y) - f(x)]$ is available). In addition, we also extend our analysis of $(L_0,L_1)$-GD to the strongly convex case.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Avoiding Bias in Clipped SGD for Overparameterized Models under Generalized Smoothness

    math.OC 2026-05 unverdicted novelty 7.0

    Clipped and normalized SGD converge without bias in overparameterized interpolating models under (L0,L1)-smoothness, with improved rates and extensions to heavy-tailed noise and weaker smoothness.