pith. sign in

arxiv: 2412.06731 · v4 · submitted 2024-12-09 · 🧮 math.OC

Beyond Minimax Optimality: A Subgame Perfect Gradient Method

Pith reviewed 2026-05-23 07:32 UTC · model grok-4.3

classification 🧮 math.OC
keywords convex optimizationfirst-order methodsdynamic optimalitysubgame perfect equilibriumoptimized gradient methodconvergence ratesworst-case analysis
0
0 comments X

The pith

The Subgame Perfect Gradient Method is dynamically optimal at every iteration given the first-order information observed so far.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops the Subgame Perfect Gradient Method as a refinement of the Optimized Gradient Method that incorporates the full history of gradients and function values at each step. It establishes that this method is dynamically optimal by showing it forms a subgame perfect equilibrium, so that no other first-order method can guarantee a strictly better convergence rate on every smooth convex function consistent with the data revealed up to that point. This moves past the classic minimax setting, where the worst-case rate is fixed in advance regardless of what the algorithm learns during execution. A sympathetic reader would care because the result suggests first-order methods can deliver stronger, information-dependent guarantees rather than always preparing for the single hardest function.

Core claim

SPGM refines OGM to make use of the full history of first-order information and is dynamically optimal in the sense that in each iteration, no other algorithm can offer a strictly better convergence rate on all functions which agree with the observed first-order information up to that iteration, with this notion formalized using the game-theoretic concept of a subgame perfect equilibrium.

What carries the argument

The Subgame Perfect Gradient Method (SPGM), a first-order iteration that at each step solves for the update guaranteeing the best possible remaining convergence rate against all functions consistent with the observed data.

If this is right

  • At every iteration SPGM's convergence bound is the best possible among all first-order methods given the history.
  • SPGM therefore improves on OGM once any nontrivial first-order information has been collected.
  • Numerical experiments confirm that SPGM strongly outperforms OGM on standard test problems.
  • The same dynamic-optimality argument applies to any problem class where first-order information can be revealed incrementally.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same equilibrium construction could be used to derive optimal methods for problems where only stochastic or inexact gradients are available.
  • It may be possible to replace the fixed minimax analysis of many existing methods with a sequence of subgame equilibria that tighten as data arrives.
  • The approach suggests designing new methods by recursively solving the equilibrium problem rather than a single static optimization over step coefficients.

Load-bearing premise

Modeling the choice of first-order methods as a game in which dynamic optimality is captured exactly by subgame perfect equilibrium identifies when one method is strictly better than others given the same information.

What would settle it

A concrete first-order method together with a smooth convex function that matches all observed first-order information up to iteration k but from that point achieves a strictly better convergence rate than SPGM on the remaining possible functions.

Figures

Figures reproduced from arXiv: 2412.06731 by Alex L. Wang, Benjamin Grimmer, Kevin Shu.

Figure 1
Figure 1. Figure 1: Left: The first five iterates of OGM on f(x) = Lx2/2 with x0 = 1. OGM produces x4 ≈ 0.304. Right: The first three iterates of SPGM on f(x) = Lx2/2 with x0 = 1. After seeing the history H = {(x0, f0, g0),(x1, f1, g1)}, SPGM determines that 0 is a minimizer for any L-smooth convex function agreeing with H. In effect, the history H completely determines the function f on the interval [x0, x1]. fhard so that f… view at source ↗
Figure 2
Figure 2. Figure 2: Two representative plots of convergence of [PITH_FULL_IMAGE:figures/full_fig_p022_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Two representative plots of 1/τn,N as a function of n as SPGM-10 and SPGM are run on Least Squares Regression problems of dimension d = 512, m = 2048. OGM displays its constant guarantee, 1/τ0,N , providing the baseline, non-adaptive guarantee. Left instance is of the form (18) with N = 300 and right instance is of the form (21) with N = 30. 6.2 A sample of randomly generated problem instances We consider … view at source ↗
Figure 4
Figure 4. Figure 4: Performance comparison over 42 randomly generated instances for the problems (18)–(23). From left to right, the target accuracy ranges as 10−3 , 10−6 , 10−9 . First, we consider (regularized) least squares regression problems defined in the following four ways: min x∈Rd f(x) = 1 m ∥Ax − b∥ 2 2 , (18) min x∈Rd f(x) = 1 m ∥Ax − b∥ 2 2 + 1 2 ∥x∥ 2 2 , (19) min x∈Rd f(x) = 1 m ∥Ax − b∥ 2 2 + h100(∥x∥2), (20) m… view at source ↗
Figure 5
Figure 5. Figure 5: Performance comparison over twelve instances derived from LIBSVM data for the regression [PITH_FULL_IMAGE:figures/full_fig_p024_5.png] view at source ↗
read the original abstract

The study of convex optimization has historically been concerned with worst-case convergence rates. The development of the Optimized Gradient Method (OGM), due to \citet{drori2012PerformanceOF,Kim2016optimal}, marked a major milestone in this study, as OGM achieves the optimal worst-case convergence rate among all first-order methods for unconstrained smooth convex optimization. In order to examine the possibility of obtaining stronger convergence guarantees, we will consider algorithms with \emph{dynamic} convergence rates, which may improve as additional first-order information is revealed. Our main contribution is the development of an algorithm, the Subgame Perfect Gradient Method (SPGM), which refines OGM to make use of the full history of first-order information. We show that SPGM is \emph{dynamically optimal}, in the sense that in each iteration, no other algorithm can offer a strictly better convergence rate on all functions which agree with the observed first-order information up to that iteration. We formalize this notion of dynamic optimality using the game-theoretic notion of a subgame perfect equilibrium. We conclude our study with preliminary numerical experiments showing that SPGM strongly outperforms OGM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 1 minor

Summary. The paper introduces the Subgame Perfect Gradient Method (SPGM) as a refinement of the Optimized Gradient Method (OGM) that uses the full history of first-order information. It claims SPGM is dynamically optimal: at each iteration, no other first-order method can guarantee a strictly better convergence rate on all functions consistent with the observed gradients up to that point. This notion is formalized via subgame perfect equilibrium in an appropriate game. The work concludes with preliminary numerical experiments indicating that SPGM outperforms OGM.

Significance. If the dynamic optimality result holds, the paper advances convex optimization beyond static minimax rates by introducing an adaptive, information-dependent optimality concept with a rigorous game-theoretic foundation. The explicit construction of SPGM that realizes subgame-perfect optimality and the preliminary experiments showing practical gains are strengths. This framing could guide development of adaptive first-order methods that improve as more gradient information is revealed.

minor comments (1)
  1. [Abstract] Abstract and § on experiments: the description of the numerical results is labeled 'preliminary'; adding details on the specific test functions, problem dimensions, stopping criteria, and quantitative metrics (e.g., iteration counts or function-value gaps) would make the empirical support more reproducible and proportionate to the theoretical claim.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the paper, recognition of the significance of the dynamic optimality framework, and recommendation for minor revision. We are glad that the game-theoretic formulation and the potential for adaptive first-order methods were viewed favorably.

Circularity Check

0 steps flagged

No significant circularity: dynamic optimality is independently defined

full rationale

The paper's core claim defines dynamic optimality via subgame-perfect equilibrium on the class of functions consistent with observed first-order information; this game-theoretic notion is introduced as a new formalization and does not reduce by construction to the OGM rates or any fitted parameters. SPGM is constructed by refining the known OGM (externally cited) to incorporate full history, but the optimality proof is tied directly to the subgame definition rather than renaming or self-referentially fitting prior results. No self-citation load-bearing steps, ansatz smuggling, or uniqueness theorems imported from the authors appear in the derivation chain. The contribution remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The paper rests on standard domain assumptions of smooth convex optimization and introduces a new algorithmic construction; no free parameters or invented physical entities are indicated in the abstract.

axioms (2)
  • domain assumption The objective function is smooth and convex
    Standard assumption required for first-order methods and OGM comparison.
  • domain assumption First-order information consists of function values and gradients at queried points
    Implicit in the description of history-based refinement of OGM.
invented entities (1)
  • Subgame Perfect Gradient Method (SPGM) no independent evidence
    purpose: To realize dynamic optimality using full first-order history
    New algorithm introduced to achieve the stated dynamic optimality property.

pith-pipeline@v0.9.0 · 5731 in / 1428 out tokens · 28827 ms · 2026-05-23T07:32:26.443914+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages

  1. [1]

    Altschuler and Pablo A

    Jason M. Altschuler and Pablo A. Parrilo. Acceleration by stepsize hedging I: Multi-step descent and the silver stepsize schedule. 2023

  2. [2]

    Altschuler and Pablo A

    Jason M. Altschuler and Pablo A. Parrilo. Acceleration by stepsize hedging II: Silver stepsize schedule for smooth convex optimization. 2023

  3. [3]

    URL https://docs.mosek.com/latest/ juliaapi/index.html

    MOSEK ApS.Optimizer API for Julia 10.2.10, 2019. URL https://docs.mosek.com/latest/ juliaapi/index.html

  4. [4]

    Taylor, and Francis Bach

    Mathieu Barré, Adrien B. Taylor, and Francis Bach. Principled analyses and design of first-order methods with inexact proximal operators.Math. Program., 201(1–2):185–230, December 2022. ISSN 0025-5610

  5. [5]

    SIAM, Philadelphia, PA, 2001

    Aharon Ben-Tal and Arkadi Nemirovski.Lectures on modern convex optimization: analysis, algorithms, and engineering applications. SIAM, Philadelphia, PA, 2001

  6. [6]

    C. G. Broyden. The convergence of a class of double-rank minimization algorithms.Journal of the Institute of Mathematics and Its Applications, 6(1):76–90, 1970

  7. [7]

    Méthode générale pour la résolution des systemes d’équations simultanées.Comp

    Augustin Cauchy et al. Méthode générale pour la résolution des systemes d’équations simultanées.Comp. Rend. Sci. Paris, 25(1847):536–538, 1847

  8. [8]

    LIBSVM: A library for support vector machines.ACM Trans

    Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines.ACM Trans. Intell. Syst. Technol., 2(3), May 2011

  9. [9]

    A robust accelerated optimization algorithm for strongly convex functions

    Saman Cyrus, Bin Hu, Bryan Van Scoy, and Laurent Lessard. A robust accelerated optimization algorithm for strongly convex functions. In2018 Annual American Control Conference (ACC), pages 1376–1381, 2018

  10. [10]

    Optimal convergence rates for the proximal bundle method.SIAM Journal on Optimization, 33(2):424–454, 2023

    Mateo Díaz and Benjamin Grimmer. Optimal convergence rates for the proximal bundle method.SIAM Journal on Optimization, 33(2):424–454, 2023

  11. [11]

    The exact information-based complexity of smooth convex minimization.J

    Yoel Drori. The exact information-based complexity of smooth convex minimization.J. Complex., 39: 1–16, 2017

  12. [12]

    Yoel Drori and Adrien B. Taylor. Efficient first-order methods for convex minimization: a constructive approach.Mathematical Programming, 184:183 – 220, 2018

  13. [13]

    Yoel Drori and Adrien B. Taylor. On the oracle complexity of smooth strongly convex minimization.J. Complex., 68:101590, 2021

  14. [14]

    Performance of first-order methods for smooth convex minimization: a novel approach.Math

    Yoel Drori and Marc Teboulle. Performance of first-order methods for smooth convex minimization: a novel approach.Math. Program., 145:451–482, 2012

  15. [15]

    An optimal variant of Kelley’s cutting-plane method.Mathematical Programming, 160:321–351, 2016

    Yoel Drori and Marc Teboulle. An optimal variant of Kelley’s cutting-plane method.Mathematical Programming, 160:321–351, 2016

  16. [16]

    Rate of Convergence of the Bundle Method.J

    Yu Du and Andrzej Ruszczyński. Rate of Convergence of the Bundle Method.J. Optim. Theory Appl., 173(3):908–922, June 2017. ISSN 0022-3239. 25

  17. [17]

    Acceleration methods.Foundations and Trends®in Optimization, 5(1-2):1–245, 2021

    Alexandre d’Aspremont, Damien Scieur, Adrien Taylor, et al. Acceleration methods.Foundations and Trends®in Optimization, 5(1-2):1–245, 2021

  18. [18]

    On the acceleration of proximal bundle methods

    David Fersztand and Xu Andy Sun. On the acceleration of proximal bundle methods. 2025

  19. [19]

    A new approach to variable metric algorithms.The Computer Journal, 13(3):317–322, 1970

    Roger Fletcher. A new approach to variable metric algorithms.The Computer Journal, 13(3):317–322, 1970

  20. [20]

    Mihai I. Florea. An efficient accelerated gradient method with memory applicable to composite problems. In2021 International Aegean Conference on Electrical Machines and Power Electronics (ACEMP) and 2021 International Conference on Optimization of Electrical and Electronic Equipment (OPTIM), pages 473–480, 2021

  21. [21]

    Mihai I. Florea. Exact gradient methods with memory.Optimization Methods and Software, 37(6): 2310–2337, 2022. PMID: 39906221

  22. [22]

    Mihai I. Florea. Adaptive first-order methods with enhanced worst-case rates. 2024

  23. [23]

    Florea and Yurii E

    Mihai I. Florea and Yurii E. Nesterov. An optimal lower bound for smooth convex functions.Foundations of Computational Mathematics, 2025. Published online 25 June 2025

  24. [24]

    A family of variable metric updates derived by variational means.Mathematics of Computation, 24(109):23–26, 1970

    Donald Goldfarb. A family of variable metric updates derived by variational means.Mathematics of Computation, 24(109):23–26, 1970

  25. [25]

    Benjamin Grimmer, Kevin Shu, and Alex L. Wang. Accelerated objective gap and gradient norm convergence for gradient descent via long steps. 2024

  26. [26]

    Benjamin Grimmer, Kevin Shu, and Alex L. Wang. Composing optimized stepsize schedules for gradient descent. 2024

  27. [27]

    Van Parys, and Ernest Ryu

    Shuvomoy Das Gupta, Bart P.G. Van Parys, and Ernest Ryu. Branch-and-bound performance estimation programming: A unified methodology for constructing optimal optimization methods.Math. Program., 2023

  28. [28]

    A Redistributed Proximal Bundle Method for Nonconvex Optimization.SIAM Journal on Optimization, 20(5):2442–2473, 2010

    Warren Hare and Claudia Sagastizábal. A Redistributed Proximal Bundle Method for Nonconvex Optimization.SIAM Journal on Optimization, 20(5):2442–2473, 2010. ISSN 1052-6234

  29. [29]

    Springer Berlin Heidelberg, Berlin, Heidelberg, 1993

    Jean-Baptiste Hiriart-Urruty and Claude Lemaréchal.Acceleration of the Cutting-Plane Algorithm: Primal Forms of Bundle Methods, pages 275–330. Springer Berlin Heidelberg, Berlin, Heidelberg, 1993

  30. [30]

    Uijeong Jang, Shuvomoy Das Gupta, and Ernest K. Ryu. Computer-assisted design of accelerated composite optimization methods: Optista. 2024

  31. [31]

    Accelerated proximal point method for maximally monotone operators.Math

    Donghwan Kim. Accelerated proximal point method for maximally monotone operators.Math. Program., 190(1–2):57–87, November 2021. ISSN 0025-5610

  32. [32]

    Donghwan Kim and Jeffrey A. Fessler. Optimized first-order methods for smooth convex minimization. Math. Program., 159(1–2):81–107, 2016

  33. [33]

    On the convergence analysis of the optimized gradient method.J

    Donghwan Kim and Jeffrey A Fessler. On the convergence analysis of the optimized gradient method.J. Optim. Theory Appl., 172(1):187–205, 2017

  34. [34]

    Donghwan Kim and Jeffrey A. Fessler. Another look at the fast iterative shrinkage/thresholding algorithm (fista).SIAM Journal on Optimization, 28(1):223–250, 2018

  35. [35]

    Donghwan Kim and Jeffrey A. Fessler. Adaptive restart of the optimized gradient method for convex optimization.J. Optim. Theory Appl., 178(1):240–263, July 2018. ISSN 0022-3239

  36. [36]

    Donghwan Kim and Jeffrey A. Fessler. Optimizing the efficiency of first-order methods for decreasing the gradient of smooth convex functions.J. Optim. Theory Appl., 188(1):192–219, 2021. 26

  37. [37]

    A proof of exact convergence rate of gradient descent

    Jungbin Kim. A proof of exact convergence rate of gradient descent. Part II. Performance criterion (f(xn)−f∗)/∥x0−x∗∥2. 2024

  38. [38]

    Krzysztof C. Kiwiel. An Aggregate Subgradient Method for Nonsmooth Convex Minimization.Math. Program., 27(3):320–341, October 1983. ISSN 0025-5610

  39. [39]

    Bundle-Level Type Methods Uniformly Optimal For Smooth And Nonsmooth Convex Optimization.Mathematical Programming, 149(1):1–45, Feb 2015

    Guanghui Lan. Bundle-Level Type Methods Uniformly Optimal For Smooth And Nonsmooth Convex Optimization.Mathematical Programming, 149(1):1–45, Feb 2015. ISSN 1436-4646

  40. [40]

    Springer Berlin Heidelberg, Berlin, Heidelberg, 1975

    Claude Lemaréchal.An Extension of Davidon Methods to Nondifferentiable Problems, pages 95–109. Springer Berlin Heidelberg, Berlin, Heidelberg, 1975. ISBN 978-3-642-00764-4

  41. [41]

    A simple uniformly optimal method without line search for convex optimization: T

    Tianjiao Li and Guanghui Lan. A simple uniformly optimal method without line search for convex optimization: T. li, g. lan.Mathematical Programming, pages 1–38, 2025

  42. [42]

    Jiaming Liang and Renato D. C. Monteiro. A proximal bundle variant with optimal iteration-complexity for a large range of prox stepsizes.SIAM Journal on Optimization, 31(4):2955–2986, 2021

  43. [43]

    On the limited memory BFGS method for large scale optimization

    Dong C Liu and Jorge Nocedal. On the limited memory BFGS method for large scale optimization. Math. Program., 45(1):503–528, 1989

  44. [44]

    JuMP 1.0: Recent improvements to a modeling language for mathematical optimization.Math

    Miles Lubin, Oscar Dowson, Joaquim Dias Garcia, Joey Huchette, Benoît Legat, and Juan Pablo Vielma. JuMP 1.0: Recent improvements to a modeling language for mathematical optimization.Math. Program. Comput., 2023

  45. [45]

    Adaptive gradient descent without descent

    Yura Malitsky and Konstantin Mishchenko. Adaptive gradient descent without descent. 2019

  46. [46]

    Optim: A mathematical optimization package for Julia.J

    Patrick Kofod Mogensen and Asbjørn Nilsen Riseth. Optim: A mathematical optimization package for Julia.J. Open Source Softw., 3(24):615, 2018

  47. [47]

    A. S. Nemirovsky and D. B. Yudin.Problem complexity and method efficiency in optimization. Wiley- Interscience, New York, 1983

  48. [48]

    Information-based complexity of linear operator equations.J

    Arkadi S Nemirovsky. Information-based complexity of linear operator equations.J. Complex., 8(2): 153–175, 1992

  49. [49]

    Universal gradient methods for convex optimization problems.Mathematical Programming, 152(1):381–404, 2015

    Yu Nesterov. Universal gradient methods for convex optimization problems.Mathematical Programming, 152(1):381–404, 2015

  50. [50]

    A method for solving the convex programming problem with convergence rateO(1/k2)

    Yurii Nesterov. A method for solving the convex programming problem with convergence rateO(1/k2). Proceedings of the USSR Academy of Sciences, 269:543–547, 1983

  51. [51]

    Yurii Nesterov and Mihai I. Florea. Gradient methods with memory.Optimization Methods and Software, 37(3):936–953, 2022

  52. [52]

    An introduction to game theory.Oxford University Press, 2:672–713, 2004

    Martin J Osborne. An introduction to game theory.Oxford University Press, 2:672–713, 2004

  53. [53]

    Chanwoo Park, Jisun Park, and Ernest K. Ryu. Factor- √ 2 acceleration of accelerated gradient methods. Applied Mathematics & Optimization, 88:1–38, 2021

  54. [54]

    Exact worst-case convergence rates of gradient descent: a complete analysis for all constant stepsizes over nonconvex and convex functions

    Teodor Rotaru, François Glineur, and Panagiotis Patrinos. Exact worst-case convergence rates of gradient descent: a complete analysis for all constant stepsizes over nonconvex and convex functions. 2024

  55. [55]

    Princeton University Press, Princeton, NJ, USA, 2006

    Andrzej Ruszczynski.Nonlinear Optimization. Princeton University Press, Princeton, NJ, USA, 2006. ISBN 0691119155, 9780691119151

  56. [56]

    David F. Shanno. Conditioning of quasi-newton methods for function minimization.Mathematics of Computation, 24(111):647–656, 1970. 27

  57. [57]

    Stochastic first-order methods: non-asymptotic and computer-aided analyses via potential functions

    Adrien Taylor and Francis Bach. Stochastic first-order methods: non-asymptotic and computer-aided analyses via potential functions. In Alina Beygelzimer and Daniel Hsu, editors,Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 ofProceedings of Machine Learning Research, pages 2934–2992, Phoenix, AZ, USA, 25–28 Jun 2019. PMLR

  58. [58]

    An optimal gradient method for smooth strongly convex minimization

    Adrien Taylor and Yoel Drori. An optimal gradient method for smooth strongly convex minimization. Math. Program., 199(1–2):557–594, jun 2022. ISSN 0025-5610

  59. [59]

    Smooth strongly convex interpolation and exact worst-case performance of first-order methods.Math

    Adrien Taylor, Julien Hendrickx, and François Glineur. Smooth strongly convex interpolation and exact worst-case performance of first-order methods.Math. Program., 161:307–345, 2017

  60. [60]

    Exact worst-case performance of first-order methods for composite convex optimization.SIAM Journal on Optimization, 27(3):1283–1313, 2017

    Adrien Taylor, Julien Hendrickx, and François Glineur. Exact worst-case performance of first-order methods for composite convex optimization.SIAM Journal on Optimization, 27(3):1283–1313, 2017

  61. [61]

    Freeman, and Kevin M

    Bryan Van Scoy, Randy A. Freeman, and Kevin M. Lynch. The fastest known globally convergent first-order method for minimizing strongly convex functions.IEEE Control Systems Letters, 2(1):49–54, 2018

  62. [62]

    Springer Berlin Heidelberg, Berlin, Heidelberg, 1975

    Philip Wolfe.A Method of Conjugate Subgradients for Minimizing Nondifferentiable Functions, pages 145–173. Springer Berlin Heidelberg, Berlin, Heidelberg, 1975. ISBN 978-3-642-00764-4

  63. [63]

    Accelerated gradient descent by concatenation of stepsize schedules

    Zehao Zhang and Rujun Jiang. Accelerated gradient descent by concatenation of stepsize schedules. 2024

  64. [64]

    Anytime acceleration of gradient descent

    Zihan Zhang, Jason D Lee, Simon S Du, and Yuxin Chen. Anytime acceleration of gradient descent. 2024. A Deferred proofs Proof of Lemma 2.We expand f⋆−f+ 0 −L 4∥z1−x⋆∥2 + L 4∥x0−x⋆∥2 =f ⋆−f+ 0 −1 L∥g0∥2 +⟨x0−x⋆,g 0⟩ =f ⋆−f+ 0 − ⟨ g0,x⋆−x+ 0 ⟩ . The expression on the final line isQ⋆,0which is nonnegative by Lemma 1.■ Proof of Lemma 3.By assumption,zis an au...