Negative Stepsizes Make Gradient-Descent-Ascent Converge
read the original abstract
Efficient computation of min-max problems is a central question in optimization, learning, games, and control. Arguably the most natural algorithm is gradient-descent-ascent (GDA). However, since the 1970s, conventional wisdom has argued that GDA fails to converge even on simple problems. This failure spurred an extensive literature on modifying GDA with additional building blocks such as extragradients, optimism, momentum, anchoring, etc. In contrast, we show that GDA converges in its original form by simply using a judicious choice of stepsizes. The key innovation is the proposal of unconventional stepsize schedules (dubbed slingshot stepsize schedules) that are time-varying, asymmetric, and periodically negative. We show that all three properties are necessary for convergence, and that altogether this enables GDA to converge on the classical counterexamples (e.g., unconstrained convex-concave problems). The core algorithmic intuition is that although negative stepsizes make backward progress, they de-synchronize the min and max variables (overcoming the cycling issue of GDA), and lead to a slingshot phenomenon in which the forward progress in the other iterations is overwhelmingly larger. This results in fast overall convergence. Geometrically, the slingshot dynamics leverage the non-reversibility of gradient flow: positive/negative steps cancel to first order, yielding a second-order net movement in a new direction that leads to convergence and is otherwise impossible for GDA to move in. We interpret this as a second-order finite-differencing algorithm and show that, intriguingly, it approximately implements consensus optimization, an empirically popular algorithm for min-max problems involving deep neural networks (e.g., training GANs).
This paper has not been read by Pith yet.
Forward citations
Cited by 3 Pith papers
-
Lower Bounds for Anytime Acceleration of Gradient Descent
Establishes that no positive stepsize schedule achieves better than o(n^{-1.334}) anytime convergence for function values or o(n^{-1}) for squared gradient norms in smooth convex optimization.
-
Negative Momentum for Convex-Concave Optimization
Negative momentum enables global convergence in convex-concave min-max optimization and accelerated rates in the strongly-convex-strongly-concave setting.
-
Stepsize Hedging: an Alternative Mechanism for Accelerating Gradient Descent
This expository article introduces stepsize hedging as a way to accelerate gradient descent without additional terms like momentum.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.