Gradient Sliding for Composite Optimization

Guanghui Lan

Authors on Pith no claims yet

classification 🧮 math.OC cs.CCstat.ML

keywords componentsmoothalgorithmscompositeepsilongradientnonsmoothclass

read the original abstract

We consider in this paper a class of composite optimization problems whose objective function is given by the summation of a general smooth and nonsmooth component, together with a relatively simple nonsmooth term. We present a new class of first-order methods, namely the gradient sliding algorithms, which can skip the computation of the gradient for the smooth component from time to time. As a consequence, these algorithms require only ${\cal O}(1/\sqrt{\epsilon})$ gradient evaluations for the smooth component in order to find an $\epsilon$-solution for the composite problem, while still maintaining the optimal ${\cal O}(1/\epsilon^2)$ bound on the total number of subgradient evaluations for the nonsmooth component. We then present a stochastic counterpart for these algorithms and establish similar complexity bounds for solving an important class of stochastic composite optimization problems. Moreover, if the smooth component in the composite function is strongly convex, the developed gradient sliding algorithms can significantly reduce the number of graduate and subgradient evaluations for the smooth and nonsmooth component to ${\cal O} (\log (1/\epsilon))$ and ${\cal O}(1/\epsilon)$, respectively. Finally, we generalize these algorithms to the case when the smooth component is replaced by a nonsmooth one possessing a certain bi-linear saddle point structure.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Rescaled Asynchronous SGD: Optimal Distributed Optimization under Data and System Heterogeneity
cs.LG 2026-05 unverdicted novelty 6.0

Rescaled ASGD recovers convergence to the true global objective by rescaling worker stepsizes proportional to computation times, matching the known time lower bound in the leading term under non-convex smoothness and ...